PapersSource: HUGGINGFACEJuly 1, 2026Importance: 4/5

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

English summary

The paper proposes Perceive-to-Reason (P2R), a framework that decouples fine-grained visual reasoning into a two-stage process: a Perceiver that localizes question-relevant evidence in the image, and a Reasoner that answers using the annotated image and cropped regions. It introduces Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates updates between perception-focused and reasoning-focused phases using only final-answer supervision. Built on Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance; P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its backbone. Further experiments show the benefits extend beyond high-resolution benchmarks to broader multimodal reasoning tasks.

Chinese summary

该论文提出 Perceive-to-Reason (P2R) 框架，将细粒度视觉推理解耦为两阶段：感知器定位与问题相关的图像证据，推理器基于标注图像和裁剪区域回答问题。同时引入感知-推理交替 GRPO (PRA-GRPO)，一种角色感知的强化学习策略，仅使用最终答案监督，交替进行感知和推理训练更新。基于 Qwen3-VL-Instruct-2B/4B/8B 构建，P2R 在所有规模上均带来性能提升；P2R-4B 在 V-Star 上达 93.2%、HR-Bench-4K 上 81.9%、HR-Bench-8K 上 80.5%，大幅超越其基础模型。进一步实验表明，P2R 的优势从高分辨率基准拓展至更广泛的多模态推理任务。

Key points

P2R decouples fine-grained visual reasoning into perception (localizing evidence) and reasoning (answering) stages.
P2R 将细粒度视觉推理解耦为感知（定位证据）和推理（回答问题）两个阶段。
PRA-GRPO alternates training updates between perception and reasoning roles using only final-answer supervision.
PRA-GRPO 仅使用最终答案监督，交替进行面向感知和面向推理的训练更新。
P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, outperforming the backbone.
P2R-4B 在 V-Star 上达 93.2%，HR-Bench-4K 上 81.9%，HR-Bench-8K 上 80.5%，超越基础模型。
The framework is built on Qwen3-VL-Instruct and improves performance across model scales and on broader multimodal tasks.
框架基于 Qwen3-VL-Instruct 构建，在不同模型规模和更广泛的多模态任务上均带来提升。

Open original