Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation
English summary
This paper studies two underexplored aspects of synthetic data curation for post-training: whether filtering signals are grounded in the source provenance of each generation, and whether rejected samples can be systematically recovered instead of discarded. Using adversarially injected corpora to obtain ground-truth failure labels, the authors show that exact source provenance improves faithfulness gating for stronger judges. They find that hallucination-based and reward-based gates reject largely disjoint sample populations, making both necessary. An adaptive recovery pipeline that combines failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is primarily driven by generator scale, with filtration and recovery contributing meaningfully but secondarily.
Chinese summary
本文研究了合成后训练数据策管中两个未被充分探讨的方面:过滤信号是否锚定在每次生成的源出处上,以及被拒绝的样本能否被系统性地恢复而非丢弃。作者通过对抗注入的语料库获取真实故障标签,发现精确的源出处能增强更强评判器的忠实度门控。幻觉门控和奖励门控会拒绝大部分不重叠的样本群体,因此两者都必不可少。一种结合故障诊断和定向再生成的自适应恢复流程在产出率、恢复率和注入召回率上均优于朴素重采样。下游微调质量主要由生成器规模驱动,过滤和恢复条件起次要但有意义的作用。
Key points
Exact source provenance improves faithfulness gating for stronger judges, acting as a grounding signal for filtering.
精确的源出处可以增强更强评判器的忠实度门控,为过滤提供锚定信号。
Hallucination gates and reward-model gates reject largely disjoint sets of samples, so both are necessary for comprehensive quality control.
幻觉门控和奖励模型门控拒绝的样本群体大部分不重叠,因此两者对于全面的质量控制都必不可少。
An adaptive recovery pipeline (failure diagnosis + targeted regeneration) outperforms naive resampling in yield, recovery rate, and injection recall.
自适应恢复流程(故障诊断+定向再生成)在产出率、恢复率和注入召回率方面均优于朴素重采样。
Downstream fine-tuning quality is predominantly determined by the generator model scale, with filtration and recovery playing a secondary but meaningful role.
下游微调质量主要由生成器模型规模决定,过滤和恢复条件起到次要但有意义的作用。