TuneJury: An Open Metric for Improving Music Generation Preference Alignment
English summary
Researchers introduce TuneJury, an open instance-level pairwise reward model for text-to-music that predicts preference scores from a text prompt and an audio clip. The model is trained on publicly available human-preference labels including arena votes, metric-alignment pairs, crowdsourced comparisons, and expert aesthetic ratings. Its score margin is well-calibrated on a held-out test split, enabling data filtering via a simple threshold, and it generalizes to out-of-distribution benchmarks. For generators released after training, the paper proposes anchor calibration, a post-hoc Bradley-Terry calibration that recovers agreement efficiently without retraining. The frozen reward drives consistent gains in three downstream tasks: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available open-source on GitHub.
Chinese summary
研究人员提出了TuneJury,一个开源的实例级成对奖励模型,用于文本到音乐生成,根据文本提示和音频片段预测音乐偏好评分。模型使用公开的人类偏好数据训练,包括竞技场式投票、度量对齐偏好对、众包成对比较和专家美学评分。在留出测试集上得分差值校准良好,可通过简单阈值进行数据过滤,并能泛化至分布外基准。针对训练后新发布的生成器,论文提出锚定校准,一种无需重新训练即可高效恢复一致性的后验Bradley-Terry校准方法。使用该固定的奖励模型,TuneJury在三个下游任务中带来稳定收益:推理时best-of-N选择、DITTO风格潜在优化和专家迭代后训练。模型已在GitHub上开源。
Key points
Open pairwise reward model for text-to-music trained on diverse public human-preference labels: arena votes, metric-alignment pairs, crowdsourced comparisons, and expert ratings.
基于公开多样化人类偏好标签训练的开放成对奖励模型,涵盖竞技场投票、度量对齐对、众包比较和专家评分。
Score margin is well-calibrated, enabling data filtering via a simple threshold and supporting generalization to out-of-distribution benchmarks.
得分差值校准良好,可通过简单阈值进行数据过滤,并能泛化到分布外基准。
Anchor calibration, a post-hoc Bradley-Terry calibration, recovers agreement for new generators efficiently without retraining.
锚定校准(后验Bradley-Terry校准)可高效恢复对新生成器的一致性,无需重新训练。
Frozen reward model yields consistent gains across three downstream applications: best-of-N selection, DITTO-style optimization, and expert-iteration post-training.
固定奖励模型在三个下游应用中持续提升效果:best-of-N选择、DITTO风格优化和专家迭代后训练。
Model and code are open-source, available on GitHub.
模型和代码已开源,在GitHub上可用。