Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning
English summary
The paper proposes Perceive-to-Reason (P2R), a framework that decouples fine-grained visual reasoning into a two-stage process: a Perceiver that localizes question-relevant evidence in the image, and a Reasoner that answers using the annotated image and cropped regions. It introduces Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates updates between perception-focused and reasoning-focused phases using only final-answer supervision. Built on Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance; P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its backbone. Further experiments show the benefits extend beyond high-resolution benchmarks to broader multimodal reasoning tasks.
Chinese summary
该论文提出 Perceive-to-Reason (P2R) 框架,将细粒度视觉推理解耦为两阶段:感知器定位与问题相关的图像证据,推理器基于标注图像和裁剪区域回答问题。同时引入感知-推理交替 GRPO (PRA-GRPO),一种角色感知的强化学习策略,仅使用最终答案监督,交替进行感知和推理训练更新。基于 Qwen3-VL-Instruct-2B/4B/8B 构建,P2R 在所有规模上均带来性能提升;P2R-4B 在 V-Star 上达 93.2%、HR-Bench-4K 上 81.9%、HR-Bench-8K 上 80.5%,大幅超越其基础模型。进一步实验表明,P2R 的优势从高分辨率基准拓展至更广泛的多模态推理任务。
Key points
P2R decouples fine-grained visual reasoning into perception (localizing evidence) and reasoning (answering) stages.
P2R 将细粒度视觉推理解耦为感知(定位证据)和推理(回答问题)两个阶段。
PRA-GRPO alternates training updates between perception and reasoning roles using only final-answer supervision.
PRA-GRPO 仅使用最终答案监督,交替进行面向感知和面向推理的训练更新。
P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, outperforming the backbone.
P2R-4B 在 V-Star 上达 93.2%,HR-Bench-4K 上 81.9%,HR-Bench-8K 上 80.5%,超越基础模型。
The framework is built on Qwen3-VL-Instruct and improves performance across model scales and on broader multimodal tasks.
框架基于 Qwen3-VL-Instruct 构建,在不同模型规模和更广泛的多模态任务上均带来提升。