AGVBench is a comprehensive benchmark evaluating 30 data augmentation strategies on five public palm- and finger-vein datasets with seven backbone architectures, including CNNs, vision transformers, and vein-specific models. Multi-image mixing methods such as MixUp, PuzzleMix, and StarMixup achieve the highest recognition accuracy but exhibit poor calibration and high vulnerability to adversarial perturbations. Severe geometric transformations often degrade performance, likely due to feature misalignment or spatial cropping. The results demonstrate that accuracy-centric evaluation is insufficient for biometric data augmentation, emphasizing the need for security and robustness. AGVBench provides standardized protocols and open-source code to advance reproducible and secure vein recognition research.
PACE constructs proxy benchmarks from a small, automatically selected subset of non-agentic evaluation instances to predict model scores on expensive agentic benchmarks. By combining target-relevance and globally informative selection strategies, PACE-Bench is formed from 19 non-agentic benchmarks. Evaluated across 14 models and 4 agentic benchmarks (including SWE-Bench and GAIA), it achieves leave-one-out cross-validation mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at less than 1% of the full agentic evaluation cost. The selected instances also reveal the distinct skill demands of each agentic benchmark. PACE enables practical performance estimation for model development, selection, and routing without full agent evaluation overhead.
SkillCoach proposes a self-evolving rubric framework that derives skill-grounded process rubrics from rollouts to evaluate agentic skill-use along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It separates process quality from task success by keeping an external verifier as a distinct outcome signal, enabling detection of hidden failures that final accuracy alone would miss. The evolved rubrics are then used as process supervision to select high-quality training trajectories, outperforming outcome-only filtering. Experiments show the approach improves evaluation quality and provides stronger supervision signals for enhancing agentic skill-use.
The paper introduces AgenticSTS, a bounded-memory contract for long-horizon LLM agents where every decision is made from a fresh user message constructed via typed retrieval, appending no raw cross-decision transcript and bounding the prompt independently of run length. This contract is instantiated in the closed-rule deck-building game Slay the Spire 2, which requires hundreds of tactical and strategic decisions. A public online benchmark of frontier LLMs on the same game achieved zero wins at the lowest difficulty, while the developer-reported human win rate is 16%, indicating the task is hard but not saturated. In an ablation within the authors' harness, a baseline with no triggered strategic skills won 3 out of 10 games, and enabling the skill layer raised the wins to 6 out of 10 (directional, Fisher exact p≈0.37). The authors release a reproducible testbed comprising 298 completed trajectories with condition tags, frozen memory/skill snapshots, prompt records, and analysis scripts.
AnyGroundBench is a new benchmark for evaluating spatio-temporal video grounding (STVG) in vision-language models, shifting from zero-shot testing to rigorous domain adaptation. It covers five specialized domains: animal, industry, sports, surgery, and public security, using newly captured videos and established datasets with dense annotations. The benchmark includes dedicated training subsets to systematically measure domain adaptability. Evaluation of 15 state-of-the-art VLMs reveals that all models fail to adapt under zero-shot and in-context learning settings, exposing critical flaws in their spatio-temporal reasoning capabilities.
The paper adapts a mixture-of-experts discrete diffusion language model, DiffusionGemma-26B, and benchmarks it against the autoregressive Gemma-4-26B on medical visual question answering. Using the same LoRA fine-tuning recipe, the diffusion model matches or exceeds AR performance, scored by a verbosity-robust LLM judge, while decoding 3.5–4.4× faster. The fine-tuned model (3.8B active parameters) is competitive with frontier vision-language models. Crucially, the diffusion paradigm enables any-order infill: a radiologist can correct parts of a report and the model generates the text between them, a capability inherent to diffusion that autoregressive models cannot easily replicate. This suits real-world radiology reports, which often vary in style and completeness across clinicians and institutions.