Learning from the Self-future: On-policy Self-distillation for dLLMs

English summary

The paper introduces d-OPSD, the first on-policy self-distillation framework designed for diffusion large language models (dLLMs). It replaces autoregressive-centric prefix conditioning with self-generated suffix conditioning, allowing the student to learn from future self-experience. Supervision shifts from token-level to step-level to align with the iterative denoising process. Experiments on four reasoning benchmarks show d-OPSD consistently outperforms RLVR and SFT baselines while requiring only about 10% of RLVR's optimization steps, demonstrating superior sample efficiency. Code is available on GitHub.

Chinese summary

该论文提出d-OPSD，首个专为扩散大语言模型（dLLM）设计的在线策略自蒸馏框架。它将自回归式的前缀条件替换为后缀条件，让学生模型从“自我未来经验”中学习；并将监督信号从token级提升到步级，对齐dLLM的迭代去噪过程。在四项推理基准上，d-OPSD始终优于RLVR和SFT基线，且仅需RLVR约10%的优化步数，展现出优异的样本效率。代码已在GitHub开源。

Key points

First on-policy self-distillation method specifically designed for diffusion LLMs.
首个专为扩散LLM设计的在线策略自蒸馏方法。
Uses suffix conditioning on self-generated answers (future self-experience) instead of left-to-right prefix conditioning.
使用自我生成答案的后缀条件（未来自我经验）而非自回归式前缀条件。
Employs step-level supervision aligned with dLLM denoising instead of token-level loss.
采用与dLLM去噪步骤对齐的步级监督，替代token级损失。
Outperforms RLVR and SFT on four reasoning benchmarks with roughly 10% of RLVR’s optimization steps.
在四项推理基准上胜过RLVR和SFT，优化步数仅约RLVR的十分之一。
Open-source code at github.com/xingzhejun/d-OPSD.
开源代码库：github.com/xingzhejun/d-OPSD。

Open original