Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
中文标题: 最近发展区策略优化:将教师置于提示而非梯度中
英文摘要
The paper introduces Zone of Proximal Policy Optimization (ZPPO), a method that keeps a stronger teacher inside the prompt rather than the policy gradient to avoid drift when student rollouts fail on hard questions. ZPPO constructs two reformulated prompts for difficult queries: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates for discrimination, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts to surface shared failure modes. A prompt replay buffer recirculates each hard question until the student achieves half mean rollout accuracy or is evicted, focusing the student’s current zone of proximal development. Evaluated on the Qwen3.5 family at four student scales (0.8B–9B) with a 27B teacher, post-trained as vision-language models, ZPPO outperforms off/on-policy distillation and GRPO across a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), with the largest gains at the smallest scale.
中文摘要
本文提出最近发展区策略优化(ZPPO),将强教师置于提示而非策略梯度中,避免学生采样全部失败时注入教师响应导致的偏离。针对难题,ZPPO构造两种重述提示:二元候选包含式问句(BCQ)将一条正确教师回答与一条错误学生回答匿名配对供学生判别,负面候选包含式问句(NCQ)聚合学生的错误采样以暴露共性失败模式。提示重放缓冲区循环推送每道难题,直至学生平均准确率达到0.5或被淘汰,聚焦当前最近发展区。在Qwen3.5系列四个规模(0.8B–9B)的学生模型上,以27B教师进行多模态后训练,并在31个基准(16项VLM、10项LLM、5项视频)上评估,ZPPO全面超越离/在线蒸馏和GRPO,最小规模增益最为显著。
关键要点
ZPPO replaces teacher gradient injection with teacher-in-prompt through BCQ and NCQ reformulated prompts, avoiding on-policy drift when student rollouts all fail.
ZPPO用BCQ和NCQ重述提示将教师留在提示中,替代梯度注入,避免学生采样全部失败时的在线策略漂移。
BCQ presents one teacher-correct and one student-incorrect response as anonymized choices for discrimination; NCQ aggregates failed student rollouts to surface common errors.
BCQ将一条教师正确与学生错误回答匿名配对供区分,NCQ聚合学生错误采样以显露共性失败模式。
A prompt replay buffer recirculates hard questions until the student’s average rollout accuracy reaches 0.5, focusing ZPPO within the current zone of proximal development.
提示重放缓冲区循环推送难题直至学生平均准确率达0.5,将训练聚焦于当前最近发展区。
On Qwen3.5 vision-language models (0.8B–9B) with a 27B teacher, ZPPO beats offline/online distillation and GRPO on 31 benchmarks, delivering the largest improvements at the smallest scale.
在Qwen3.5多模态模型(0.8B–9B)上以27B教师训练,ZPPO在31个基准上超越离/在线蒸馏及GRPO,最小规模模型收益最大。