ExpRL: Exploratory RL for LLM Mid-Training

English summary

ExpRL proposes an RL-based mid-training method that uses human-written question-answer pairs as reward scaffolds, hiding reference solutions from the policy and instead having an LLM judge compare sampled reasoning traces to assign dense outcome or process rewards. This reinforces partial progress and useful reasoning behaviors that sparse final-answer rewards often miss. On challenging math tasks, ExpRL yields stronger RL priming than supervised fine-tuning, sparse-reward GRPO, and self-distillation, providing a better initialization for subsequent sparse-reward RL. The method also shows promise in mixed-domain experiments beyond math.

Chinese summary

ExpRL 提出了一种基于强化学习的中期训练方法，将人工编写的问答对作为奖励脚手架，隐藏参考解答，由 LLM 评判器比较采样的推理痕迹并给出密集的结果或过程奖励。该方法能强化部分进展和有用的推理行为，弥补稀疏最终答案奖励的不足。在挑战性数学任务上，ExpRL 作为后续稀疏奖励 RL 的初始化，优于监督微调、稀疏奖励 GRPO 和自我蒸馏。混合领域实验表明该方法可扩展到数学之外。

Key points

Uses human-written QA pairs as reward scaffolds, hiding references from the policy and using an LLM judge to provide dense outcome/process rewards.
将人工编写的问答对用作奖励脚手架，隐藏参考解答，通过 LLM 评判器提供密集的结果/过程奖励。
Reinforces partial progress and useful reasoning behaviors that sparse final-answer rewards miss.
强化部分进展和有用的推理行为，弥补稀疏最终答案奖励的不足。
Outperforms supervised fine-tuning, sparse-reward GRPO, and self-distillation as a pre-training step for subsequent sparse-reward RL.
作为后续稀疏奖励 RL 的预训练步骤，优于监督微调、稀疏奖励 GRPO 和自我蒸馏。
Effective on challenging math reasoning tasks and extends to mixed-domain settings.
在挑战性数学推理任务上有效，并扩展到混合领域场景。

Open original