AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models
中文标题: AnyGroundBench:面向视觉语言模型的专用领域视频定位基准
英文摘要
AnyGroundBench is a new benchmark for evaluating spatio-temporal video grounding (STVG) in vision-language models, shifting from zero-shot testing to rigorous domain adaptation. It covers five specialized domains: animal, industry, sports, surgery, and public security, using newly captured videos and established datasets with dense annotations. The benchmark includes dedicated training subsets to systematically measure domain adaptability. Evaluation of 15 state-of-the-art VLMs reveals that all models fail to adapt under zero-shot and in-context learning settings, exposing critical flaws in their spatio-temporal reasoning capabilities.
中文摘要
AnyGroundBench是一个新的基准,用于评估视觉语言模型在时空视频定位(STVG)中的表现,将评估范式从零样本测试转向严格的领域适应。它涵盖五个专业领域:动物、工业、体育、手术和公共安全,使用新采集的视频和现有数据集,并带有密集的时空标注。基准提供了专门的训练子集,以系统地衡量领域适应能力。对15个最先进的VLM的评估表明,所有模型在零样本和上下文学习方式下都无法适应,暴露了其时空推理能力的关键缺陷。
关键要点
AnyGroundBench shifts STVG evaluation from zero-shot to domain adaptation, targeting five specialized domains.
AnyGroundBench将STVG评估从零样本转向领域适应,涵盖五个专业领域。
The benchmark provides training subsets and dense annotations to systematically measure adaptation capabilities.
该基准提供训练子集和密集标注,系统衡量适应能力。
Evaluation of 15 SOTA VLMs shows consistent failure in zero-shot and in-context learning adaptation, revealing critical spatio-temporal reasoning flaws.
对15个SOTA VLM的评估显示,零样本和上下文学习适应均失败,暴露关键的时空推理缺陷。