Automated reproducibility assessments in the social and behavioral sciences using large language models
中文标题: 使用大型语言模型进行社会科学和行为科学的自动化可重复性评估
英文摘要
Researchers evaluated an LLM pipeline on 76 published social and behavioral science studies with predefined claims. Excluding 7 studies where the LLM failed to produce a viable effect size estimate, the pipeline recovered original effect sizes within ±0.05 Cohen's d in 41% of the remaining studies. It reached the same qualitative conclusion as the original study in 96% of cases, outperforming human reanalysts who achieved 34% effect-size recovery and 74% conclusion agreement. These findings suggest LLMs can automate and scale reproducibility assessments, providing a foundation for systematic auditing of empirical results.
中文摘要
研究人员在76项已发表的社会与行为科学研究上评估了一个LLM流水线,这些研究均带有预定义结论。排除7项LLM无法生成有效效应量估计的研究,该流水线在剩余41%的研究中以±0.05 Cohen's d的容差恢复了原始效应量。它在96%的案例中得出了与原始研究相同的定性结论,优于人类再分析者34%的效应量恢复率和74%的结论一致性。这些结果表明LLM能够自动化并扩展可重复性评估,为系统性审核实证结果奠定了基础。
关键要点
LLM pipeline tested on 76 published studies with predefined claims in social and behavioral sciences.
LLM流水线在76项已发表的社会与行为科学研究上进行了测试,这些研究具有预定义结论。
In 41% of studies (excluding 7 failures), LLM replicated original effect sizes within ±0.05 Cohen's d.
排除7项失败研究后,LLM在41%的研究中以±0.05 Cohen's d的容差复制了原始效应量。
LLM matched the original study's qualitative conclusion in 96% of cases, vs. 74% for human reanalysts.
LLM在96%的案例中得出了与原始研究相同的定性结论,而人类再分析者仅为74%。
Human reanalysts recovered effect sizes in 34% of studies, lower than LLM's 41%.
人类再分析者恢复了34%的效应量,低于LLM的41%(不计失败案例)。
LLMs are shown to be a scalable tool for automated reproducibility auditing of empirical results.
研究表明,LLM可作为系统审核实证结果的可扩展自动化工具。