This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.
This paper proposes an agentic large language model (LLM) framework for Canadian 10-digit Harmonized Tariff Schedule (HTS) code classification in maritime logistics. The framework combines multi-agent retrieval over official tariff documents, evidence-grounded reasoning, consensus-based validation with element-wise voting across hierarchical code components, confidence estimation, and human-in-the-loop escalation. Evaluation on a private dataset of 3,300 expert-labeled product records reveals that exact 10-digit classification remains difficult, with accuracy sharply declining from coarse chapter level to fine-grained tariff and statistical suffix levels. The results underscore the necessity of interpretable, uncertainty-aware, and human-centered classification workflows over fully autonomous single-step prediction. The code is publicly available.
The paper proposes Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. It first trains a reasoning-aware retriever via gold-relevance distillation, so that contexts are ranked by expected reasoning benefit rather than semantic overlap. The policy model is then fine-tuned using reinforcement learning on retrieved analogous demonstrations under verifiable outcome rewards, enabling it to leverage reasoning traces. Analysis shows that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct scaffolding per problem. On AIME 2025, RA-RFT improves average@32 accuracy over GRPO by 7.1 points for Qwen3-1.7B and 2.8 points for Qwen3-4B, demonstrating that reasoning-aware retrieval is an orthogonal improvement to reward design or training curricula.
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.
The paper introduces FORGE, a benchmark that measures how often search-augmented LLMs recommend fake products when retrieved web pages are polluted. FORGE rewrites real product descriptions into fake ones across 225 products, 15 categories, and 5 consumer scenarios, then tests 12 commercial and open-weights LLMs. A single polluted page causes fooled recommendation rates up to 27%, and replacing the top-3 search results raises the rate to 73.8%. Vulnerability varies by category, with less familiar products more easily exploited, and reasoning models sometimes worsen the problem by fabricating social proof. Three defenses are evaluated—skepticism prompting, consensus filtering over model priors, and cross-document evidence—but skepticism can backfire and filtering may suppress legitimate recommendations.
SECDA-DSE is a framework that integrates Large Language Models into the SECDA ecosystem for design space exploration of FPGA-based accelerators. It combines a structured DSE Explorer with an LLM Stack using retrieval-augmented generation and chain-of-thought prompting, along with an iterative feedback loop for refinement. The paper extends the evaluation by generating three accelerator designs—element-wise vector multiplication, 2D convolution, and matrix transpose—and performing end-to-end execution on FPGA hardware. Results show that SECDA-DSE produces SECDA-compliant designs that are successfully synthesized and executed on FPGA, capturing kernel-specific trade-offs between compute parallelism and data movement. This demonstrates the potential of LLM-guided exploration to adapt architectural configurations across diverse workloads while reducing exploration time and the need for extensive human expertise.