论文来源: ARXIV2026年6月16日重要度: 4/5

ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

中文标题: ActiveSAM：基于图像条件类别剪枝的快速且准确的开放词汇分割框架

英文摘要

ActiveSAM is a training-free, zero-shot inference framework that adapts SAM 3 for open-vocabulary semantic segmentation by pruning the full dataset vocabulary to an image-conditional active subset via a low-resolution presence preview. Only the retained classes are decoded at full resolution using the frozen SAM 3 decoder with bucketed prompt multiplexing and margin-aware background calibration. On eight OVSS benchmarks, ActiveSAM outperforms the prior state-of-the-art SegEarth-OV3 by +1.4 mIoU on average while running up to 5.5× faster on large-vocabulary datasets. The method requires no target-dataset training, no weight updates, and no oracle class-presence labels. It also exhibits strong robustness under image corruption, making it suitable for noisy-input domains like autonomous driving. Code is available at https://github.com/VILA-Lab/ActiveSAM.

中文摘要

ActiveSAM 是一个无需训练、零样本的推理框架，通过低分辨率存在预览将 SAM 3 的全量词汇剪枝为图像条件的活跃子集，仅对保留类别进行高分辨率解码，利用冻结的 SAM 3 解码器完成开放词汇语义分割。在八个 OVSS 基准上，ActiveSAM 平均 mIoU 超过先前领先的 SegEarth-OV3 约 1.4 分，同时在大词汇数据集上速度提升高达 5.5 倍。该方法无需目标数据集训练、权重更新或真实类别标签，并在模拟真实分布偏移的图像损坏下表现出最强鲁棒性，适合自动驾驶等噪声输入场景。代码已开源。

关键要点

ActiveSAM turns SAM 3 into an active-vocabulary segmenter by pruning the full vocabulary to a small image-conditional set using a low-resolution presence preview, without any training.
ActiveSAM 通过低分辨率存在预览将全量词汇剪枝为图像条件的小规模活跃集，以零训练方式将 SAM 3 转为活跃词汇分割器。
Full-resolution decoding is only applied to the retained classes with bucketed prompt multiplexing and margin-aware background calibration, drastically cutting computation.
仅对保留类别进行高分辨率解码，结合分桶提示复用和边缘感知背景校准，大幅减少计算量。
On eight OVSS benchmarks, it achieves an average +1.4 mIoU gain over the previous state-of-the-art SegEarth-OV3 and runs up to 5.5× faster on large-vocabulary datasets.
在八个 OVSS 基准上，相比先前最先进的 SegEarth-OV3 平均 mIoU 提升 1.4 分，大词汇数据集上速度提升最高 5.5 倍。
The framework requires no target-dataset training, weight updates, or oracle labels, and shows strongest robustness under image corruption, enabling safe deployment in real-world noisy domains.
该框架无需目标数据集训练、权重更新或真实标签，并在图像损坏下表现出最强鲁棒性，可在真实噪声场景中安全部署。

打开原文