Learning from the Self-future: On-policy Self-distillation for dLLMs
English summary
The paper introduces d-OPSD, the first on-policy self-distillation framework designed for diffusion large language models (dLLMs). It replaces autoregressive-centric prefix conditioning with self-generated suffix conditioning, allowing the student to learn from future self-experience. Supervision shifts from token-level to step-level to align with the iterative denoising process. Experiments on four reasoning benchmarks show d-OPSD consistently outperforms RLVR and SFT baselines while requiring only about 10% of RLVR's optimization steps, demonstrating superior sample efficiency. Code is available on GitHub.
Chinese summary
该论文提出d-OPSD,首个专为扩散大语言模型(dLLM)设计的在线策略自蒸馏框架。它将自回归式的前缀条件替换为后缀条件,让学生模型从“自我未来经验”中学习;并将监督信号从token级提升到步级,对齐dLLM的迭代去噪过程。在四项推理基准上,d-OPSD始终优于RLVR和SFT基线,且仅需RLVR约10%的优化步数,展现出优异的样本效率。代码已在GitHub开源。
Key points
First on-policy self-distillation method specifically designed for diffusion LLMs.
首个专为扩散LLM设计的在线策略自蒸馏方法。
Uses suffix conditioning on self-generated answers (future self-experience) instead of left-to-right prefix conditioning.
使用自我生成答案的后缀条件(未来自我经验)而非自回归式前缀条件。
Employs step-level supervision aligned with dLLM denoising instead of token-level loss.
采用与dLLM去噪步骤对齐的步级监督,替代token级损失。
Outperforms RLVR and SFT on four reasoning benchmarks with roughly 10% of RLVR’s optimization steps.
在四项推理基准上胜过RLVR和SFT,优化步数仅约RLVR的十分之一。
Open-source code at github.com/xingzhejun/d-OPSD.
开源代码库:github.com/xingzhejun/d-OPSD。