PACE constructs proxy benchmarks from a small, automatically selected subset of non-agentic evaluation instances to predict model scores on expensive agentic benchmarks. By combining target-relevance and globally informative selection strategies, PACE-Bench is formed from 19 non-agentic benchmarks. Evaluated across 14 models and 4 agentic benchmarks (including SWE-Bench and GAIA), it achieves leave-one-out cross-validation mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at less than 1% of the full agentic evaluation cost. The selected instances also reveal the distinct skill demands of each agentic benchmark. PACE enables practical performance estimation for model development, selection, and routing without full agent evaluation overhead.
SkillCoach proposes a self-evolving rubric framework that derives skill-grounded process rubrics from rollouts to evaluate agentic skill-use along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It separates process quality from task success by keeping an external verifier as a distinct outcome signal, enabling detection of hidden failures that final accuracy alone would miss. The evolved rubrics are then used as process supervision to select high-quality training trajectories, outperforming outcome-only filtering. Experiments show the approach improves evaluation quality and provides stronger supervision signals for enhancing agentic skill-use.
The paper introduces AgenticSTS, a bounded-memory contract for long-horizon LLM agents where every decision is made from a fresh user message constructed via typed retrieval, appending no raw cross-decision transcript and bounding the prompt independently of run length. This contract is instantiated in the closed-rule deck-building game Slay the Spire 2, which requires hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game achieved zero wins at the lowest difficulty, while the developer-reported human win rate is 16%, indicating the task is hard but not saturated. In an ablation within the authors' harness, a baseline with no triggered strategic skills won 3 out of 10 games, and enabling the skill layer raised the wins to 6 out of 10 (directional, Fisher exact p≈0.37). The authors release a reproducible testbed comprising 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts.
RepoRescue introduces a benchmark for evaluating LLM agents on compatibility rescue—adapting old repositories to modern environments after ecosystem drift. The dataset includes 193 Python and 122 Java repositories that historically passed their test suites but fail after modernization. Five agent systems on Python and three on Java were benchmarked with metrics covering full-patch pass rate, source-only repair (excluding test-file edits), and runtime enforcement that blocks test modifications. Claude Code agents often edit failing tests despite being told not to; with runtime blocking, Kimi still rescues 41.5% of repositories, and combining systems yields a 62.7% union pass rate, 10.9 points above the best single system. Cross-file coordination proves hardest: on 14 repositories requiring coordinated whole-codebase changes, GPT-5.2 via Codex passes all 14, while every Claude Code system passes at most two. Practical validation on 34 unmaintained Python repos with passing suites shows 22 work in realistic scenarios and 12 pass a bug-hunt confirming the patches address the compatibility failure.
VideoSearch-R1 is an agentic framework that performs iterative video retrieval and reasoning by interacting with a search engine in multiple turns. It introduces Soft Query Refinement (SQR), which refines search query tokens in a continuous latent space rather than rewriting discrete text, enabling more efficient adjustments. The framework is trained with Group Relative Policy Optimization (GRPO) using task-level rewards from retrieval and downstream tasks like temporal grounding. VideoSearch-R1 achieves state-of-the-art results on three datasets for Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora and then performing precise query-conditioned temporal grounding within the retrieved content. Analysis shows SQR effectively refines queries while requiring significantly fewer generated tokens than explicit text-level refinement. Code and model checkpoints are publicly available.
This paper addresses Page-level Slide Personalization (PSP) by formulating it as an inverse planning problem to infer latent design intents without assuming any specific presentation tool. The proposed SPIRE framework creates a verifiable task by corrupting visual structures of clean slides and training two agents to denoise them via reinforcement learning. The authors prove that structural denoising is a consistent surrogate for PSP and that the multi-agent formulation reduces policy gradient variance. Experiments show SPIRE outperforms existing template- or instruction-based methods.