PapersSource: ARXIVJune 16, 2026Importance: 3/5

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

English summary

The paper introduces FusionRS, the first large-scale RGB-infrared-text dataset for remote sensing vision-language learning, built by translating public RGB images into infrared-style counterparts. It provides aligned RGB-IR image pairs with both conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties. The authors train CLIP-style models for RGB-IR-text alignment and fine-tune generative vision-language models for dual-modal captioning. Experiments show FusionRS significantly improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies confirm that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision.

Chinese summary

该论文提出了FusionRS，首个面向遥感视觉语言学习的大规模RGB-红外-文本数据集，通过将公开RGB遥感图像转化为红外风格构建，包含对齐的RGB-红外图像对以及常规场景描述和红外感知描述。作者训练了CLIP风格模型用于RGB-红外-文本对齐，并微调生成式视觉语言模型进行双模态图像描述。实验表明，FusionRS在RGB-红外对齐、红外到文本检索和双模态描述任务上显著优于仅使用RGB或未使用红外感知文本的设置。消融研究证实红外感知描述对于强化红外-语言对齐至关重要，强调了模态特定文本监督的重要性。

Key points

FusionRS is the first large-scale RGB-infrared-text dataset for remote sensing, created by translating public RGB images into infrared-style pairs.
FusionRS是首个面向遥感的大规模RGB-红外-文本数据集，通过将公开RGB图像转化为红外风格图像对而构建。
The dataset includes both conventional scene captions and IR-aware captions that explicitly describe thermal, structural, and illumination-invariant features.
数据集同时提供常规场景描述和红外感知描述，后者明确描述热辐射、结构及光照不变性等红外特性。
Trained dual-modal CLIP and generative vision-language models achieved better RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning.
训练得到的双模态CLIP和生成式视觉语言模型在RGB-红外对齐、红外到文本检索和双模态描述上均取得更优性能。
Ablation studies demonstrate that IR-aware captions are essential for strong infrared-language alignment, validating the need for modality-specific textual supervision.
消融研究证明红外感知描述对于强化红外-语言对齐不可或缺，验证了模态特定文本监督的必要性。

Open original