VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement
English summary
VideoSearch-R1 is an agentic framework that performs iterative video retrieval and reasoning by interacting with a search engine in multiple turns. It introduces Soft Query Refinement (SQR), which refines search query tokens in a continuous latent space rather than rewriting discrete text, enabling more efficient adjustments. The framework is trained with Group Relative Policy Optimization (GRPO) using task-level rewards from retrieval and downstream tasks like temporal grounding. VideoSearch-R1 achieves state-of-the-art results on three datasets for Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora and then performing precise query-conditioned temporal grounding within the retrieved content. Analysis shows SQR effectively refines queries while requiring significantly fewer generated tokens than explicit text-level refinement. Code and model checkpoints are publicly available.
Chinese summary
VideoSearch-R1 是一个智能体框架,通过与搜索引擎的多轮交互实现迭代视频检索与推理。它引入了软查询细化(SQR),在连续潜在空间中细化搜索查询 token,而非重写离散文本,从而实现更高效的调整。该框架使用群体相对策略优化(GRPO)进行训练,以来自检索和下游任务(如时间定位)的任务级奖励为指导。VideoSearch-R1 在三个视频语料库时刻检索(VCMR)数据集上达到了最优性能,能够从大规模语料库中迭代检索视频,并在检索到的内容内执行精确的查询条件时间定位。分析表明,SQR 能有效细化原始查询,且所需生成的 token 数明显少于显式文本级查询细化。代码和模型检查点已公开发布。
Key points
Introduces VideoSearch-R1, an agentic video retrieval and reasoning framework that iteratively interacts with a video search engine.
引入 VideoSearch-R1,一个与视频搜索引擎进行迭代交互的智能体视频检索与推理框架。
Proposes Soft Query Refinement (SQR) that refines queries in a continuous latent space, avoiding discrete text rewriting and improving efficiency.
提出软查询细化(SQR),在连续潜在空间中细化查询,避免离散文本重写,提升效率。
Uses Group Relative Policy Optimization (GRPO) with task-level reward signals from retrieval and downstream temporal grounding tasks.
采用群体相对策略优化(GRPO),利用来自检索和下游时间定位任务的任务级奖励信号。
Achieves state-of-the-art performance on three Video Corpus Moment Retrieval (VCMR) datasets.
在三个视频语料库时刻检索(VCMR)数据集上达到最先进性能。
Code and model checkpoints are publicly available at mlvlab.github.io/VideoSearch-R1.
代码和模型检查点已在 mlvlab.github.io/VideoSearch-R1 公开提供。