The paper proposes Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. It first trains a reasoning-aware retriever via gold-relevance distillation, so that contexts are ranked by expected reasoning benefit rather than semantic overlap. The policy model is then fine-tuned using reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards, enabling it to leverage reasoning traces. Analysis shows that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct scaffolding per problem. On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points for Qwen3-1.7B and 2.8 points for Qwen3-4B, demonstrating that reasoning-aware retrieval is an orthogonal improvement to reward design or training curricula.
PapersSource: ARXIVImportance: 3/5
The paper analyzes on-policy distillation (OPD), a post-training method combining on-policy student trajectories and dense teacher supervision. The study finds that OPD-style updates are small and coordinate-sparse, distributed across layers and FFN-heavy. Training only the discovered sparse subnetwork recovers nearly full OPD performance, but the sparsity-inducing SGD optimizer underperforms AdamW because dense supervision preserves heterogeneous gradient scales that benefit from adaptive scaling. Geometrically, the updates are numerically full-rank but spectrally concentrated, lying away from the principal singular subspaces of the source weights and disproportionately on coordinates where source weights are near zero. The results show that OPD retains geometric signatures of on-policy post-training rather than behaving as dense parameter rewriting.
PapersSource: ARXIVImportance: 4/5
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.
This paper introduces RACES, a framework that treats verifiable environments as composable building blocks, automatically fusing them into new training environments when their input-output types align. Using 300 base environments and composition operators (SEQUENTIAL, PARALLEL, SORT, SELECT), RL training on composite environments consistently enhances reasoning generalization. Experiments show a 3.1-point average gain on six unseen benchmarks for DeepSeek-R1-Distill-Qwen-14B (48.2→51.3) and a 2.3-point gain for Qwen3-14B (58.8→61.1). Training with only 50 base environments reaches performance comparable to using all 300, demonstrating efficient environment scaling.
This paper introduces a data-centric post-training pipeline that applies interpretability protocols to preference datasets, uncovering latent concepts that distinguish preferred from dispreferred model outputs and making them explicit for user feedback. The approach diagnoses undesirable signals such as over-stylization and sycophancy, and mitigates off-target learning by intervening on the learning signal at the concept level. It unifies several interpretability-based training protocols as ways of shaping rewards through feature or data interventions. Empirically, the method amplifies desired properties like safeguards and model personality, turning opaque scalar reward optimization into an auditable process of sculpting the training signal.
This paper reinterprets supervised fine-tuning as a target distribution design problem. The Q-target framework decomposes SFT supervision into two choices: how strongly to rely on the observed token and how to allocate remaining probability mass to alternatives. This unifies many existing SFT variants as implicit selections of the target distribution Q. The authors propose Target-SFT, which constructs the training objective directly from the desired target distribution. Across ten reasoning dataset-model combinations, Target-SFT consistently outperforms conventional SFT and other variants, demonstrating a more fundamental SFT design principle.