AI intelligence feed

HUGGINGFACEJul 2, 2026Highlight

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

WorldDirector is a controllable video world model framework that explicitly decouples semantic motion orchestration from visual generation. It uses a large language model to coordinate 3D object trajectories and camera movements, then employs these trajectories as control signals for a video generator. This design ensures strict physical consistency, stable appearance, and persistent memory of dynamic objects—maintaining their exact visual identity even when they re-enter a scene after long occlusions. The framework supports unrestrained viewpoint exploration and can synthesize complex, extended events with high controllability.

HUGGINGFACEJul 2, 2026

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

AnyGroundBench is a new benchmark for evaluating spatio-temporal video grounding (STVG) in vision-language models, shifting from zero-shot testing to rigorous domain adaptation. It covers five specialized domains: animal, industry, sports, surgery, and public security, using newly captured videos and established datasets with dense annotations. The benchmark includes dedicated training subsets to systematically measure domain adaptability. Evaluation of 15 state-of-the-art VLMs reveals that all models fail to adapt under zero-shot and in-context learning settings, exposing critical flaws in their spatio-temporal reasoning capabilities.

HUGGINGFACEJul 1, 2026Highlight

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

The paper adapts a mixture-of-experts discrete diffusion language model, DiffusionGemma-26B, and benchmarks it against the autoregressive Gemma-4-26B on medical visual question answering. Using the same LoRA fine-tuning recipe, the diffusion model matches or exceeds AR performance, scored by a verbosity-robust LLM judge, while decoding 3.5–4.4× faster. The fine-tuned model (3.8B active parameters) is competitive with frontier vision-language models. Crucially, the diffusion paradigm enables any-order infill: a radiologist can correct parts of a report and the model generates the text between them, a capability inherent to diffusion that autoregressive models cannot easily replicate. This suits real-world radiology reports, which often vary in style and completeness across clinicians and institutions.

HUGGINGFACEJul 1, 2026Highlight

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

This paper proposes Asymmetric Mutual Variational Learning (AMVL), a framework that addresses the train-inference mismatch in continuous latent reasoning for multimodal large language models. The mismatch arises because standard variational training forces the inference-time prior to mimic a posterior conditioned on ground-truth answers, causing answer leakage. AMVL uses a forward KL divergence to align the prior with the posterior and a novel reverse KL divergence to regularize the posterior, preventing collapse into inference-incompatible regions. The method is instantiated in a latent-integrated MLLM and evaluated on the BLINK benchmark, where it improves the average score by +10.83 and achieves gains of up to +32.00 on individual reasoning tasks, with analyses showing improved latent-space stability.

HUGGINGFACEJul 1, 2026Highlight

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

The paper proposes Perceive-to-Reason (P2R), a framework that decouples fine-grained visual reasoning into a two-stage process: a Perceiver that localizes question-relevant evidence in the image, and a Reasoner that answers using the annotated image and cropped regions. It introduces Perception-Reasoning Alternating GRPO (PRA-GRPO), a role-aware reinforcement learning strategy that alternates updates between perception-focused and reasoning-focused phases using only final-answer supervision. Built on Qwen3-VL-Instruct-2B/4B/8B, P2R consistently improves performance; P2R-4B achieves 93.2% on V-Star, 81.9% on HR-Bench-4K, and 80.5% on HR-Bench-8K, substantially outperforming its backbone. Further experiments show the benefits extend beyond high-resolution benchmarks to broader multimodal reasoning tasks.

HUGGINGFACEJul 1, 2026Highlight

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement

VideoSearch-R1 is an agentic framework that performs iterative video retrieval and reasoning by interacting with a search engine in multiple turns. It introduces Soft Query Refinement (SQR), which refines search query tokens in a continuous latent space rather than rewriting discrete text, enabling more efficient adjustments. The framework is trained with Group Relative Policy Optimization (GRPO) using task-level rewards from retrieval and downstream tasks like temporal grounding. VideoSearch-R1 achieves state-of-the-art results on three datasets for Video Corpus Moment Retrieval (VCMR), iteratively retrieving videos from large-scale corpora and then performing precise query-conditioned temporal grounding within the retrieved content. Analysis shows SQR effectively refines queries while requiring significantly fewer generated tokens than explicit text-level refinement. Code and model checkpoints are publicly available.

AI signal, minus the noise.

WorldDirector: Building Controllable World Simulators with Persistent Dynamic Memory

AnyGroundBench: A Specialized-Domain Benchmark for Video Grounding in Vision-Language Models

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Multimodal Continuous Reasoning via Asymmetric Mutual Variational Learning

Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

VideoSearch-R1: Iterative Video Retrieval and Reasoning via Soft Query Refinement