论文来源: ARXIV2026年6月16日重要度: 4/5

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

中文标题: 基于《自然》系列期刊荟萃分析文章的大语言模型智能体基准测试

英文摘要

This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.

中文摘要

该论文提出MetaSyn数据集，包含442篇经专家整理的《自然》系列期刊荟萃分析，每篇均配有PI/ECO标准、14万篇PubMed检索语料库、验证过的相关研究及精心设计的干扰项。在检索-筛选-合成的完整流程中，对12种流程配置（9种检索增强生成方案和1种基于协议的智能体）进行了基准测试。尽管在K=200时检索召回率达90.9%，但没有任何系统对真实纳入文献的召回率超过52.7%，暴露出关键的筛选瓶颈。当前大语言模型难以可靠区分符合PI/ECO标准的研究与主题相似但不符合标准的干扰文章。为定位故障点，作者建议使用分阶段归因指标，而非单一的端到端评分。

关键要点

MetaSyn dataset: 442 Nature Portfolio meta-analyses with PI/ECO criteria, PubMed search strategies, and hard negative distractors.
MetaSyn数据集：442篇《自然》系列荟萃分析，配有PI/ECO标准、PubMed检索策略和精心设计的干扰项。
Benchmarked 12 systems combining RAG and agent pipelines on the end-to-end meta-analysis workflow.
在端到端荟萃分析流程中，对12种结合RAG与智能体的系统进行了基准测试。
Retrieval recall ceiling is high (90.9% at K=200), but screening fails dramatically, with best inclusion recall only 52.7%.
检索召回率上限较高（K=200时达90.9%），但筛选阶段表现极差，最优纳入召回率仅52.7%。
LLMs cannot reliably separate PI/ECO-eligible studies from topically similar but ineligible articles, revealing a fundamental reasoning gap.
大语言模型无法可靠区分符合PI/ECO标准的研究与主题相似但不符合的文章，暴露出根本性的推理缺陷。

打开原文