This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.
DeepRubric is a data construction framework that reverses the typical process of generating rubrics for a given query. Instead, it first builds an evidence tree by recursively expanding evidence-backed sub-questions from a seed topic, then uses the tree’s leaves as atomic, verifiable evaluation targets to synthesize aligned query–rubric pairs. This ensures the reward evaluates exactly the information the query requests. Using 9K such query–rubric pairs, the authors train DeepRubric-8B with rubric-based GRPO, achieving performance comparable to the prior open state-of-the-art deep research models across three benchmarks while requiring roughly 13× fewer RL GPU-hours.
Researchers introduce TuneJury, an open instance-level pairwise reward model for text-to-music that predicts preference scores from a text prompt and an audio clip. The model is trained on publicly available human-preference labels including arena votes, metric-alignment pairs, crowdsourced comparisons, and expert aesthetic ratings. Its score margin is well-calibrated on a held-out test split, enabling data filtering via a simple threshold, and it generalizes to out-of-distribution benchmarks. For generators released after training, the paper proposes anchor calibration, a post-hoc Bradley-Terry calibration that recovers agreement efficiently without retraining. The frozen reward drives consistent gains in three downstream tasks: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available open-source on GitHub.
This paper proposes a Bayesian inference framework to audit frontier AI evaluations using public leaderboard archives such as LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, and tau-bench. It demonstrates that terminal-only performance claims are ambiguous: a single snapshot can be compatible with vastly different pre-terminal histories, varying the time to approach a ceiling by a factor of over three. Synthetic experiments show that a candidate selection-aware frontier model fails in synthetic recovery, objective-archive prediction, preference transfer, and uncertainty calibration, leading audit gates to reject its stronger claims. The study introduces an archive-and-adjudication protocol that reconstructs public evaluation histories, isolates a verified timing boundary, and falsifies unsupported frontier assertions, providing a rigorous method for interpreting leaderboard data.
This paper introduces a multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation using only non-contrast CT (NCCT), aiming to eliminate contrast agent risks. A large-scale paired NCCT-CECT dataset with corresponding reports was curated from two centers, split into internal and external cohorts. Five contemporary deep learning architectures, including chest-specific, abdomen-specific, and general-purpose multimodal models, were evaluated under a unified protocol. NCCT-based models achieved an average multi-organ AUC of 69.1% on the internal set and 63.1% on the external set, demonstrating retained diagnostic signals. The authors release the dataset, code, and benchmark publicly to advance contrast-free abdominal imaging research.
The authors use an open problem from the 2025 ACM EC paper "Stable Menus of Public Goods" as a testbed to investigate the effectiveness of different AI-for-EconCS research workflows. They study three questions: whether including human intuition in the prompt helps, whether automated multi-turn interaction improves results, and whether a large language model (LLM) outperforms a first-year PhD student. The experiments indicate that prompting with human intuition encourages the LLM to exhibit better "taste," and multi-turn workflows are beneficial when they encourage "ambitious" problem-solving steps. When compared using an unpublished manuscript written by the senior authors before working with the PhD student, the LLM is found to be slightly less effective than the first-year PhD student. The paper offers workflow suggestions for integrating LLMs into economic research.