AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization
English summary
The paper proposes AdaSR, an adaptive streaming reasoning framework that lets large language models reason during continuous input streaming and perform final deliberation after the stream ends, learning when and how much to think. To optimize this hierarchical process, the authors introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, provides fine-grained advantage assignment, and combines format, accuracy, and adaptive thinking rewards. Experiments show AdaSR attains a better trade-off among reasoning accuracy, computational efficiency, and streaming latency compared to supervised fine-tuning baselines. Code is publicly released.
Chinese summary
论文提出了AdaSR自适应流式推理框架,使大语言模型能在连续输入流中推理,并在流结束后进行最终深思,学会何时思考及分配多少计算量。为优化这一分层过程,作者引入了分层相对策略优化(HRPO),将策略优化分解为流式推理与深度推理两个阶段,提供细粒度优势分配,并融合格式、准确性与自适应思考奖励。实验表明,相较于监督微调基线,AdaSR在推理准确性、计算效率和流式延迟之间取得了更好的平衡。相关代码已公开。
Key points
AdaSR introduces a streaming reasoning paradigm where models reason while receiving continuous input and conduct final deliberation once the stream ends.
AdaSR引入了流式推理范式,模型在接收连续输入的同时进行推理,并在流结束后进行最终深思。
Hierarchical Relative Policy Optimization (HRPO) decomposes policy optimization into streaming and deep reasoning phases, offers fine-grained advantage assignment, and combines format, accuracy, and adaptive thinking rewards.
分层相对策略优化(HRPO)将策略优化分解为流式推理和深度推理阶段,提供细粒度的优势分配,并结合格式、准确性和自适应思考奖励。
AdaSR achieves a better balance of reasoning accuracy, computational efficiency, and streaming latency than supervised fine-tuning baselines.
AdaSR在推理准确性、计算效率和流式延迟之间实现了优于监督微调基线的平衡。
The authors have open-sourced the code at https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR.
作者已在 https://github.com/EIT-NLP/StreamingLLM/tree/main/AdaSR 将代码开源。