This paper analyzes the training dynamics of On-Policy Distillation (OPD) for large language models. OPD updates operate in a relaxed off-principal regime, affecting fewer weights and avoiding principal directions, unlike supervised fine-tuning (SFT). The method exhibits subspace locking by entering a narrow low-dimensional channel early in training; preserving this update subspace maintains OPD performance, while SFT degrades significantly without it. Sparsifying update tokens and shifting rollout generation off-policy do not disrupt the rank dynamics, but mixing OPD with reinforcement learning alters the update geometry. These findings establish OPD as a geometrically distinct training paradigm.
Code2LoRA is a hypernetwork framework built on Qwen2.5-Coder-32B-Instruct that generates repository-specific LoRA adapters for code language models without adding token overhead at inference. It supports both static adaptation for stable codebases and evolving adaptation for actively changing ones, injecting repository context such as imports, APIs, and project conventions. The method was evaluated on RepoPeftBench, a benchmark of 604 Python repositories, where it achieved high accuracy on both tracks and outperformed traditional fine-tuning approaches. The code, model checkpoints, and datasets are publicly available.
This paper introduces a reinforcement learning method that trains large language models to translate previously unseen languages by leveraging contextual linguistic knowledge rather than memorization. Prior approaches, such as continued pretraining or incorporating grammar books, led to overfitting and limited transfer. By optimizing for a surface-level translation metric as a reward, RL-trained models surpass in-context learning and supervised fine-tuning baselines. The results indicate that RL can cultivate meta-learning abilities for extremely low-resource translation, extending its utility beyond traditional reasoning tasks.
Trust Region On-Policy Distillation (TrOPD) is proposed to enhance on-policy distillation for large language models by mitigating instability caused by distribution mismatch between teacher and student. The method integrates trust region constraints, outlier estimation for token-level credit assignment, and off-policy guidance to stabilize policy gradients. Experiments show TrOPD outperforms existing on-policy distillation baselines across mathematical reasoning, code generation, and general-domain benchmarks.