Gaze Heads: How VLMs Look at What They Describe
English summary
Researchers discover a mechanism in vision-language models (VLMs) called "gaze heads": a small set of attention heads in the language-model backbone whose attention patterns track the exact image region the model is currently describing. Using comic strips as a controlled testbed, they identify these heads via a simple correlation score from a few forward passes. A single attention-mask intervention on the top-100 gaze heads (fewer than 9% of all heads) forces the VLM to describe a chosen comic panel with 83.1% accuracy, while random-head interventions fail and full-head intervention destroys generation. The steering effect generalizes to natural COCO images, works across model sizes from 2B to 32B parameters, and recurs in multiple VLM architectures (though some frozen-encoder families lack comparable heads). The work demonstrates that mechanistic analysis can yield practical inference-time levers for controlling multimodal model behavior without any retraining, and has released code, a demo, and datasets.
Chinese summary
研究人员在视觉语言模型中发现了名为“gaze heads”的机制:语言模型主干中的一小部分注意力头,其注意力模式会精确跟踪模型正在描述的图像区域。他们以连环画作为受控测试场景,通过少量前向传递的简单相关性得分识别出这些头。对前100个gaze heads(不到总头数的9%)施加单次注意力掩码干预,可以迫使模型以83.1%的准确率描述选定的漫画面板,而随机头的同类干预无效,干预全部头则破坏生成。此控制效果可泛化到自然COCO图像,在2B到32B参数规模的多个模型架构上均成立,但某些冻结编码器类的模型则无类似头组。研究表明,基于机制分析的目标编辑无需重新训练即可作为实用的推理时控制杠杆,项目已开源代码、演示和数据集。
Key points
VLMs develop "gaze heads"—a small subset of attention heads whose attention tracks the image region being described.
视觉语言模型会形成“gaze heads”——一小部分注意力头,其注意力会跟踪正在描述的图像区域。
Gaze heads are identified via a simple correlation score using comic strips as a controlled spatial-narrative testbed.
通过简单的相关性得分,在具有空间叙事顺序的连环画测试场景中识别出gaze heads。
Masking the top-100 gaze heads (less than 9% of all heads) redirects the model to describe a chosen comic panel with 83.1% accuracy.
掩码前100个gaze heads(不到全部头的9%)可将模型引导至描述任意选定漫画面板,准确率达83.1%。
The steering effect works beyond comics—on natural COCO images—and across model sizes (2B–32B) and diverse VLM architectures.
引导效果在自然COCO图像上同样有效,且跨2B–32B参数规模和不同VLM架构。
Some frozen-encoder VLM families do not exhibit a comparable set of gaze heads.
某些使用冻结编码器的VLM族类未发现类似的gaze heads。
The intervention demonstrates inference-time multimodal control without retraining, and the project releases code, demo, and datasets.
该干预实现了无需重新训练的推理时多模态控制,项目已发布代码、交互演示和数据集。