On the Geometry of On-Policy Distillation

Loading / 加载中

English summary

This paper analyzes the training dynamics of On-Policy Distillation (OPD) for large language models. OPD updates operate in a relaxed off-principal regime, affecting fewer weights and avoiding principal directions, unlike supervised fine-tuning (SFT). The method exhibits subspace locking by entering a narrow low-dimensional channel early in training; preserving this update subspace maintains OPD performance, while SFT degrades significantly without it. Sparsifying update tokens and shifting rollout generation off-policy do not disrupt the rank dynamics, but mixing OPD with reinforcement learning alters the update geometry. These findings establish OPD as a geometrically distinct training paradigm.

Chinese summary

本文分析了在策略蒸馏（OPD）用于大语言模型的训练动力学。OPD更新在松弛的非主模式状态下运行，影响较少的权重并避开主方向，与监督微调（SFT）不同。该方法通过训练早期进入狭窄的低维通道表现出子空间锁定；保留此更新子空间可维持OPD性能，而SFT则会显著下降。稀疏化更新token并将rollout生成移至策略外不会破坏秩动态，但混入强化学习会改变更新几何。这些发现确立了OPD作为一种几何上截然不同的训练范式。

Key points

OPD updates operate in a relaxed off-principal regime, affecting fewer weights and avoiding principal directions, unlike SFT.

OPD更新在松弛的非主模式状态下运行，影响更少的权重并避开主方向，这不同于SFT。

OPD exhibits subspace locking by entering a narrow low-dimensional channel early in training, which is critical for maintaining performance.

OPD通过训练早期进入狭窄的低维通道表现出子空间锁定，这对维持性能至关重要。

Preserving the early update subspace prevents performance degradation in OPD, whereas SFT degrades significantly under the same condition.

保留早期更新子空间可防止OPD性能下降，而SFT在相同条件下会显著退化。

Sparsifying update tokens and off-policy rollout maintain the rank dynamics of OPD, but mixing with reinforcement learning alters the geometry.

稀疏化更新token和策略外rollout生成保持OPD的秩动态，但混入强化学习会改变其几何结构。