WorldDirector is a controllable video world model framework that explicitly decouples semantic motion orchestration from visual generation. It uses a large language model to coordinate 3D object trajectories and camera movements, then employs these trajectories as control signals for a video generator. This design ensures strict physical consistency, stable appearance, and persistent memory of dynamic objects—maintaining their exact visual identity even when they re-enter a scene after long occlusions. The framework supports unrestrained viewpoint exploration and can synthesize complex, extended events with high controllability.
AnyGroundBench is a new benchmark for evaluating spatio-temporal video grounding (STVG) in vision-language models, shifting from zero-shot testing to rigorous domain adaptation. It covers five specialized domains: animal, industry, sports, surgery, and public security, using newly captured videos and established datasets with dense annotations. The benchmark includes dedicated training subsets to systematically measure domain adaptability. Evaluation of 15 state-of-the-art VLMs reveals that all models fail to adapt under zero-shot and in-context learning settings, exposing critical flaws in their spatio-temporal reasoning capabilities.
The paper adapts a mixture-of-experts discrete diffusion language model, DiffusionGemma-26B, and benchmarks it against the autoregressive Gemma-4-26B on medical visual question answering. Using the same LoRA fine-tuning recipe, the diffusion model matches or exceeds AR performance, scored by a verbosity-robust LLM judge, while decoding 3.5–4.4× faster. The fine-tuned model (3.8B active parameters) is competitive with frontier vision-language models. Crucially, the diffusion paradigm enables any-order infill: a radiologist can correct parts of a report and the model generates the text between them, a capability inherent to diffusion that autoregressive models cannot easily replicate. This suits real-world radiology reports, which often vary in style and completeness across clinicians and institutions.
This paper proposes Asymmetric Mutual Variational Learning (AMVL), a framework that addresses the train-inference mismatch in continuous latent reasoning for multimodal large language models. The mismatch arises because standard variational training forces the inference-time prior to mimic a posterior conditioned on ground-truth answers, causing answer leakage. AMVL uses a forward KL divergence to align the prior with the posterior and a novel reverse KL divergence to regularize the posterior, preventing collapse into inference-incompatible regions. The method is instantiated in a latent-integrated MLLM and evaluated on the BLINK benchmark, where it improves the average score by +10.83 and achieves gains of up to +32.00 on individual reasoning tasks, with analyses showing improved latent-space stability.
The paper proposes Perceive-to-Reason (P2R), a framework that decouples fine-grained visual reasoning into a two-stage process: a Perceiver that localizes question-relevant evidence in the image, and a Reasoner that answers using the annotated image and cropped regions. It introduces Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates updates between perception-focused and reasoning-focused phases using only final-answer supervision. Built on Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance; P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its backbone. Further experiments show the benefits extend beyond high-resolution benchmarks to broader multimodal reasoning tasks.
VideoSearch-R1 is an agentic framework that performs iterative video retrieval and reasoning by interacting with a search engine in multiple turns. It introduces Soft Query Refinement (SQR), which refines search query tokens in a continuous latent space rather than rewriting discrete text, enabling more efficient adjustments. The framework is trained with Group Relative Policy Optimization (GRPO) using task-level rewards from retrieval and downstream tasks like temporal grounding. VideoSearch-R1 achieves state-of-the-art results on three datasets for Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora and then performing precise query-conditioned temporal grounding within the retrieved content. Analysis shows SQR effectively refines queries while requiring significantly fewer generated tokens than explicit text-level refinement. Code and model checkpoints are publicly available.