SELF-ALIGNED REWARD: TOWARDS EFFECTIVE AND EFFICIENT REASONERS
中文标题: 自对齐奖励:迈向高效且有效的推理器
英文摘要
The paper introduces Self-Aligned Reward (SAR), a fine-grained RL signal that complements verifiable rewards to improve both accuracy and efficiency of LLM reasoning. SAR is defined as the relative perplexity difference between a query-conditioned answer and the standalone answer, thereby favoring concise, query-specific responses and penalizing redundancy. Quantitative analysis confirms that SAR reliably ranks answer quality, assigning higher scores to concise correct answers than to verbose ones. Integrating SAR with PPO or GRPO reduces average answer length by 30% while boosting accuracy by 4% across four model families and seven benchmarks, with strong out-of-domain generalization. The approach achieves a Pareto-optimal frontier between correctness and efficiency, shortening unnecessary elaboration without hurting advanced reasoning behaviors. Code and data are publicly released.
中文摘要
本文提出自对齐奖励(SAR),一种细粒度的强化学习信号,用于补充可验证奖励,以提升大语言模型推理的准确性和效率。SAR定义为基于查询条件的回答与独立回答之间的相对困惑度差异,从而奖励简洁且与查询相关的回答,抑制冗余。定量分析表明SAR能可靠地判断答案质量,给简洁正确的答案打出更高分。在四个模型家族、七个基准上,将SAR集成到PPO或GRPO中可平均减少30%的回答长度,同时提升4%的准确率,且具有强域外泛化能力。该方法在正确性与效率之间实现了帕累托最优前沿,在缩短不必要展开的同时保留高级推理行为。代码和数据已公开。
关键要点
SAR is a self-guided signal defined as the relative perplexity gap between a query-conditioned answer and the standalone answer, rewarding concise, query-specific responses.
SAR是一种自引导信号,定义为查询条件回答与独立回答之间的相对困惑度差异,奖励简洁且与查询相关的回答。
It complements coarse-grained verifiable rewards in RL algorithms like PPO and GRPO, improving both reasoning accuracy and token efficiency.
它补充了PPO和GRPO等RL算法中粗粒度的可验证奖励,同时提升推理准确率和token效率。
Across 4 model families and 7 benchmarks, SAR reduces answer length by 30% and boosts accuracy by 4%.
在四个模型家族、七个基准上,SAR将回答长度减少30%,准确率提升4%。
SAR generalizes well to out-of-domain tasks and achieves a Pareto-optimal trade-off between correctness and efficiency.
SAR在域外任务上泛化良好,实现了正确性与效率的帕累托最优权衡。
Code implementations and data are publicly available.
代码实现和数据已公开。