Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training
English summary
This paper reveals that dense on-policy self-distillation (SDPO) accelerates in-domain specialization under stable teacher signals, but causes severe forgetting and even complete collapse during continual post-training. In contrast, on-policy reinforcement learning methods like GRPO adapt more conservatively and better preserve prior capabilities. Denser self-distillation induces larger drift in parameter and response spaces, and amplifies high-frequency formatting artifacts through a self-reinforcing teacher-student loop. The findings caution that on-policy data alone is insufficient for continual learning, and dense self-distillation should not be treated as a default stabilizer.
Chinese summary
该论文表明,密集的在策略自蒸馏(SDPO)在教师信号稳定时能加速领域内专化,但在持续后训练中会导致严重遗忘甚至完全崩溃。相比之下,在策略强化学习方法(如GRPO)的适应更为保守,能更好地保留先前能力。更密集的自蒸馏会引起参数空间和响应空间的更大漂移,并通过自我强化的师生循环放大高频格式化伪影。研究警告,仅在策略数据不足以实现持续学习,密集自蒸馏不应被当作默认的稳定器。
Key points
SDPO accelerates in-domain specialization when teacher signals are stable and aligned, but fails to generalize out-of-distribution.
当教师信号稳定且对齐时,SDPO 能加速领域内专化,但无法泛化到分布外场景。
In continual post-training, SDPO exhibits catastrophic forgetting and can collapse, while on-policy RL (GRPO) better preserves prior capabilities.
在持续后训练中,SDPO 会出现灾难性遗忘甚至崩溃,而在策略强化学习(GRPO)能更好地保留先前能力。
Denser self-distillation causes larger drift in parameter and response space, and amplifies high-frequency formatting artifacts via a self-reinforcing loop.
更密集的自蒸馏会导致参数和响应空间更大漂移,并通过自我强化的循环放大高频格式化伪影。
On-policy data alone is insufficient for continual learning; dense self-distillation should not be used as a default stabilizer.
仅在策略数据不足以实现持续学习;密集自蒸馏不应被当作默认的稳定器。