AGVBench is a comprehensive benchmark evaluating 30 data augmentation strategies on five public palm- and finger-vein datasets with seven backbone architectures, including CNNs, vision transformers, and vein-specific models. Multi-image mixing methods such as MixUp, PuzzleMix, and StarMixup achieve the highest recognition accuracy but exhibit poor calibration and high vulnerability to adversarial perturbations. Severe geometric transformations often degrade performance, likely due to feature misalignment or spatial cropping. The results demonstrate that accuracy-centric evaluation is insufficient for biometric data augmentation, emphasizing the need for security and robustness. AGVBench provides standardized protocols and open-source code to advance reproducible and secure vein recognition research.
This paper reveals that dense on-policy self-distillation (SDPO) accelerates in-domain specialization under stable teacher signals, but causes severe forgetting and even complete collapse during continual post-training. In contrast, on-policy reinforcement learning methods like GRPO adapt more conservatively and better preserve prior capabilities. Denser self-distillation induces larger drift in parameter and response spaces, and amplifies high-frequency formatting artifacts through a self-reinforcing teacher-student loop. The findings caution that on-policy data alone is insufficient for continual learning, and dense self-distillation should not be treated as a default stabilizer.
PACE constructs proxy benchmarks from a small, automatically selected subset of non-agentic evaluation instances to predict model scores on expensive agentic benchmarks. By combining target-relevance and globally informative selection strategies, PACE-Bench is formed from 19 non-agentic benchmarks. Evaluated across 14 models and 4 agentic benchmarks (including SWE-Bench and GAIA), it achieves leave-one-out cross-validation mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at less than 1% of the full agentic evaluation cost. The selected instances also reveal the distinct skill demands of each agentic benchmark. PACE enables practical performance estimation for model development, selection, and routing without full agent evaluation overhead.
SkillCoach proposes a self-evolving rubric framework that derives skill-grounded process rubrics from rollouts to evaluate agentic skill-use along four dimensions: skill selection, skill following, skill composition, and skill-grounded reflection. It separates process quality from task success by keeping an external verifier as a distinct outcome signal, enabling detection of hidden failures that final accuracy alone would miss. The evolved rubrics are then used as process supervision to select high-quality training trajectories, outperforming outcome-only filtering. Experiments show the approach improves evaluation quality and provides stronger supervision signals for enhancing agentic skill-use.
WorldDirector is a controllable video world model framework that explicitly decouples semantic motion orchestration from visual generation. It uses a large language model to coordinate 3D object trajectories and camera movements, then employs these trajectories as control signals for a video generator. This design ensures strict physical consistency, stable appearance, and persistent memory of dynamic objects—maintaining their exact visual identity even when they re-enter a scene after long occlusions. The framework supports unrestrained viewpoint exploration and can synthesize complex, extended events with high controllability.
The paper formalizes Representation Distribution Matching (RDM) for one-step image generation, analyzing two design axes: distribution comparison method and representation space. They find that classical MMD becomes a strong scalable objective when estimated with large batches (>2048) and that any single representation can be gamed, motivating a battery of encoders and the SW_r14 metric. Their improved RDM (iRDM) sets a new one-step state of the art on ImageNet (SW_r14 1.30) and is preferred by PickScore over the prior best on 71.2% of samples. The recipe also post-trains the four-step FLUX.2 into a one-step generator that surpasses its four-step version on GenEval (0.826 vs 0.794) and PickScore (22.76 vs 22.58) in 90 H200 GPU-hours.