Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
中文标题: 基于《自然》系列期刊荟萃分析文章的大语言模型智能体基准测试
英文摘要
This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.
中文摘要
该论文提出MetaSyn数据集,包含442篇经专家整理的《自然》系列期刊荟萃分析,每篇均配有PI/ECO标准、14万篇PubMed检索语料库、验证过的相关研究及精心设计的干扰项。在检索-筛选-合成的完整流程中,对12种流程配置(9种检索增强生成方案和1种基于协议的智能体)进行了基准测试。尽管在K=200时检索召回率达90.9%,但没有任何系统对真实纳入文献的召回率超过52.7%,暴露出关键的筛选瓶颈。当前大语言模型难以可靠区分符合PI/ECO标准的研究与主题相似但不符合标准的干扰文章。为定位故障点,作者建议使用分阶段归因指标,而非单一的端到端评分。
关键要点
MetaSyn dataset: 442 Nature Portfolio meta-analyses with PI/ECO criteria, PubMed search strategies, and hard negative distractors.
MetaSyn数据集:442篇《自然》系列荟萃分析,配有PI/ECO标准、PubMed检索策略和精心设计的干扰项。
Benchmarked 12 systems combining RAG and agent pipelines on the end-to-end meta-analysis workflow.
在端到端荟萃分析流程中,对12种结合RAG与智能体的系统进行了基准测试。
Retrieval recall ceiling is high (90.9% at K=200), but screening fails dramatically, with best inclusion recall only 52.7%.
检索召回率上限较高(K=200时达90.9%),但筛选阶段表现极差,最优纳入召回率仅52.7%。
LLMs cannot reliably separate PI/ECO-eligible studies from topically similar but ineligible articles, revealing a fundamental reasoning gap.
大语言模型无法可靠区分符合PI/ECO标准的研究与主题相似但不符合的文章,暴露出根本性的推理缺陷。