The paper proposes Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. It first trains a reasoning-aware retriever via gold-relevance distillation, so that contexts are ranked by expected reasoning benefit rather than semantic overlap. The policy model is then fine-tuned using reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards, enabling it to leverage reasoning traces. Analysis shows that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct scaffolding per problem. On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points for Qwen3-1.7B and 2.8 points for Qwen3-4B, demonstrating that reasoning-aware retrieval is an orthogonal improvement to reward design or training curricula.
PapersSource: ARXIVImportance: 4/5
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.
PapersSource: ARXIVImportance: 4/5
The paper introduces FORGE, a benchmark that measures how often search-augmented LLMs recommend fake products when retrieved web pages are polluted. FORGE rewrites real product descriptions into fake ones across 225 products, 15 categories, and 5 consumer scenarios, then tests 12 commercial and open-weights LLMs. A single polluted page causes fooled recommendation rates up to 27%, and replacing the top-3 search results raises the rate to 73.8%. Vulnerability varies by category, with less familiar products more easily exploited, and reasoning models sometimes worsen the problem by fabricating social proof. Three defenses are evaluated—skepticism prompting, consensus filtering over model priors, and cross-document evidence—but skepticism can backfire and filtering may suppress legitimate recommendations.
SECDA-DSE is a framework that integrates Large Language Models into the SECDA ecosystem for design space exploration of FPGA-based accelerators. It combines a structured DSE Explorer with an LLM Stack using retrieval-augmented generation and chain-of-thought prompting, along with an iterative feedback loop for refinement. The paper extends the evaluation by generating three accelerator designs—element-wise vector multiplication, 2D convolution, and matrix transpose—and performing end-to-end execution on FPGA hardware. Results show that SECDA-DSE produces SECDA-compliant designs that are successfully synthesized and executed on FPGA, capturing kernel-specific trade-offs between compute parallelism and data movement. This demonstrates the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.