ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
中文标题: ReproRepo:利用GitHub仓库问题实现可复现性审计的规模化
英文摘要
The paper introduces ReproRepo, a scalable framework for evaluating LLM agents on research reproducibility by using human-raised GitHub issues as naturally occurring supervision. It is instantiated on 1,149 recent machine learning papers from major conferences and tests four frontier model-agent configurations. The best configuration, Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for approximately 90% of the papers. Analysis shows agents are particularly effective at identifying visible failures and locating the correct semantic region, though exact bug localization remains a weakness. The code is publicly released.
中文摘要
该论文提出ReproRepo,一个可扩展的框架,利用人类在GitHub上提出的issue作为自然监督信号来评估LLM智能体在研究复现中的表现。框架在1149篇近期顶级会议机器学习论文上实例化,测试了四种前沿模型-智能体配置。最佳配置Codex搭配GPT-5.5能够在约90%的论文中至少发现一个与人类报告语义相关的复现障碍。进一步分析显示,智能体擅长发现可见错误和定位正确的语义区域,但在精确定位问题上仍显不足。代码已开源。
关键要点
ReproRepo uses real GitHub issues as natural supervision to evaluate reproducibility auditing by LLM agents, enabling scalable evaluation without manual curation.
ReproRepo利用真实的GitHub issue作为自然监督信号,无需人工标注即可规模化评估LLM智能体的复现审计能力。
The framework is instantiated on 1,149 recent ML papers from major conferences, evaluating four frontier model-agent configurations (including Codex with GPT-5.5).
框架在1149篇近期机器学习会议论文上部署,评估了四种前沿模型-智能体配置(包括Codex + GPT-5.5)。
The best agent surfaces at least one semantically related human-reported blocker for ~90% of the papers, demonstrating strong recall.
最优智能体在约90%的论文中至少发现了一个与人类报告语义相关的障碍,展示了高召回率。
Agents excel at flagging visible failures and identifying the correct semantic area but often fail to pinpoint the exact location of the issue.
智能体擅长标记可见错误并定位正确的问题语义区域,但往往不能精确找到问题具体位置。
The code and framework are open-source, providing a reusable benchmark for future LLM agent evaluations on real-world reproducibility audits.
代码和框架已开源,为未来基于真实世界的复现审计任务评估LLM智能体提供了可重复使用的基准。