Researchers introduce TuneJury, an open instance-level pairwise reward model for text-to-music that predicts preference scores from a text prompt and an audio clip. The model is trained on publicly available human-preference labels including arena votes, metric-alignment pairs, crowdsourced comparisons, and expert aesthetic ratings. Its score margin is well-calibrated on a held-out test split, enabling data filtering via a simple threshold, and it generalizes to out-of-distribution benchmarks. For generators released after training, the paper proposes anchor calibration, a post-hoc Bradley-Terry calibration that recovers agreement efficiently without retraining. The frozen reward drives consistent gains in three downstream tasks: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available open-source on GitHub.
ActiveSAM is a training-free, zero-shot inference framework that adapts SAM 3 for open-vocabulary semantic segmentation by pruning the full dataset vocabulary to an image-conditional active subset via a low-resolution presence preview. Only the retained classes are decoded at full resolution using the frozen SAM 3 decoder with bucketed prompt multiplexing and margin-aware background calibration. On eight OVSS benchmarks, ActiveSAM outperforms the prior state-of-the-art SegEarth-OV3 by +1.4 mIoU on average while running up to 5.5× faster on large-vocabulary datasets. The method requires no target-dataset training, no weight updates, and no oracle class-presence labels. It also exhibits strong robustness under image corruption, making it suitable for noisy-input domains like autonomous driving. Code is available at https://github.com/VILA-Lab/ActiveSAM.
This paper introduces a multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation using only non-contrast CT (NCCT), aiming to eliminate contrast agent risks. A large-scale paired NCCT-CECT dataset with corresponding reports was curated from two centers, split into internal and external cohorts. Five contemporary deep learning architectures, including chest-specific, abdomen-specific, and general-purpose multimodal models, were evaluated under a unified protocol. NCCT-based models achieved an average multi-organ AUC of 69.1% on the internal set and 63.1% on the external set, demonstrating retained diagnostic signals. The authors release the dataset, code, and benchmark publicly to advance contrast-free abdominal imaging research.
The paper proposes AdaSR, an adaptive streaming reasoning framework that lets large language models reason during continuous input streaming and perform final deliberation after the stream ends, learning when and how much to think. To optimize this hierarchical process, the authors introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, provides fine-grained advantage assignment, and combines format, accuracy, and adaptive thinking rewards. Experiments show AdaSR attains a better trade-off among reasoning accuracy, computational efficiency, and streaming latency compared to supervised fine-tuning baselines. Code is publicly released.
The paper presents EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery. It argues the key bottleneck has shifted from designing agent workflows to engineering agent environments that amplify productive behaviors and suppress harmful ones. EurekAgent engineers environments across four dimensions: permissions engineering for bounded execution and isolated evaluation, artifact engineering for filesystem and Git-based collaboration, budget engineering for budget-aware exploration, and human-in-the-loop engineering for easy oversight. The system achieves new state-of-the-art results on mathematics, kernel engineering, and machine learning tasks, including a novel 26-circle packing solution discovered with under $11 total API cost. Code and results are open-sourced, and the authors call for environment engineering as a core research direction for reliable autonomous research agents.
The paper introduces SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, comprising 31 datasets across 7 task types. Evaluation of 31 embedding models shows large instruction-tuned multilingual models perform best, while existing Slovak-specific NLU models transfer poorly to embedding tasks. The authors develop e5-sk-small (45M parameters) and e5-sk-large (365M) by vocabulary trimming and fine-tuning Multilingual E5 models. Despite size reductions of up to 62%, these open-source models achieve competitive performance with proprietary APIs and are suitable for local deployment in semantic search and RAG. The benchmark, models, datasets, and code are released openly, offering a replicable path for other under-resourced languages.