Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training

Loading / 加载中

English summary

This paper reveals that dense on-policy self-distillation (SDPO) accelerates in-domain specialization under stable teacher signals, but causes severe forgetting and even complete collapse during continual post-training. In contrast, on-policy reinforcement learning methods like GRPO adapt more conservatively and better preserve prior capabilities. Denser self-distillation induces larger drift in parameter and response spaces, and amplifies high-frequency formatting artifacts through a self-reinforcing teacher-student loop. The findings caution that on-policy data alone is insufficient for continual learning, and dense self-distillation should not be treated as a default stabilizer.

Chinese summary

该论文表明，密集的在策略自蒸馏（SDPO）在教师信号稳定时能加速领域内专化，但在持续后训练中会导致严重遗忘甚至完全崩溃。相比之下，在策略强化学习方法（如GRPO）的适应更为保守，能更好地保留先前能力。更密集的自蒸馏会引起参数空间和响应空间的更大漂移，并通过自我强化的师生循环放大高频格式化伪影。研究警告，仅在策略数据不足以实现持续学习，密集自蒸馏不应被当作默认的稳定器。

Key points

SDPO accelerates in-domain specialization when teacher signals are stable and aligned, but fails to generalize out-of-distribution.

当教师信号稳定且对齐时，SDPO 能加速领域内专化，但无法泛化到分布外场景。

In continual post-training, SDPO exhibits catastrophic forgetting and can collapse, while on-policy RL (GRPO) better preserves prior capabilities.

在持续后训练中，SDPO 会出现灾难性遗忘甚至崩溃，而在策略强化学习（GRPO）能更好地保留先前能力。

Denser self-distillation causes larger drift in parameter and response space, and amplifies high-frequency formatting artifacts via a self-reinforcing loop.

更密集的自蒸馏会导致参数和响应空间更大漂移，并通过自我强化的循环放大高频格式化伪影。

On-policy data alone is insufficient for continual learning; dense self-distillation should not be used as a default stabilizer.

仅在策略数据不足以实现持续学习；密集自蒸馏不应被当作默认的稳定器。