This paper derives the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, reducing posterior sampling to a denoising problem at an operator-dependent shifted pivot with anisotropic noise covariance. The method, Exact Posterior Score (EPS), defines a denoising training objective that mirrors standard pretraining, enabling training from scratch or fine-tuning a pretrained denoiser. At inference, EPS uses the identical sampler as the base model, eliminating the need for likelihood gradients or projections. Evaluated on five linear inverse tasks across FFHQ and ImageNet, EPS surpasses both training-free and training-based baselines in fidelity, perceptual, and distributional metrics while requiring roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.
KVEraser is a learned method for post-hoc context erasing in long-context LLMs that avoids full recomputation. It replaces only the KV states of the to-be-erased span with learned steering values while keeping the rest of the cache intact. A two-stage training pipeline first pre-trains on generic span-neighbor suppression, then fine-tunes for downstream tasks. On in-domain tasks with 1K–32K context, KVEraser nearly matches the accuracy of full recomputation but increases latency by only 24% versus a 17.6× increase for full recomputation. The method also generalizes to unseen long-document QA with harmful distractors, achieving the best approximate baseline performance and a 3–4× speedup over full recomputation.
The paper introduces TokenPilot, a dual-granularity context management framework for long-horizon LLM agents that preserves prompt cache continuity while reducing token footprints. It contains a global Ingestion-Aware Compaction that stabilizes prompt prefixes and filters environmental noise, and a local Lifecycle-Aware Eviction that monitors segment utility and evicts only when task relevance expires. On PinchBench and Claw-Eval, TokenPilot reduces costs by 61%/56% in isolated mode and 61%/87% in continuous mode versus prior systems, while maintaining competitive performance. The method has been integrated into the open-source LightMem2 library.
ActiveSAM is a training-free, zero-shot inference framework that adapts SAM 3 for open-vocabulary semantic segmentation by pruning the full dataset vocabulary to an image-conditional active subset via a low-resolution presence preview. Only the retained classes are decoded at full resolution using the frozen SAM 3 decoder with bucketed prompt multiplexing and margin-aware background calibration. On eight OVSS benchmarks, ActiveSAM outperforms the prior state-of-the-art SegEarth-OV3 by +1.4 mIoU on average while running up to 5.5× faster on large-vocabulary datasets. The method requires no target-dataset training, no weight updates, and no oracle class-presence labels. It also exhibits strong robustness under image corruption, making it suitable for noisy-input domains like autonomous driving. Code is available at https://github.com/VILA-Lab/ActiveSAM.
Researchers discover a mechanism in vision-language models (VLMs) called "gaze heads": a small set of attention heads in the language-model backbone whose attention patterns track the exact image region the model is currently describing. Using comic strips as a controlled testbed, they identify these heads via a simple correlation score from a few forward passes. A single attention-mask intervention on the top-100 gaze heads (fewer than 9% of all heads) forces the VLM to describe a chosen comic panel with 83.1% accuracy, while random-head interventions fail and full-head intervention destroys generation. The steering effect generalizes to natural COCO images, works across model sizes from 2B to 32B parameters, and recurs in multiple VLM architectures (though some frozen-encoder families lack comparable heads). The work demonstrates that mechanistic analysis can yield practical inference-time levers for controlling multimodal model behavior without any retraining, and has released code, a demo, and datasets.
The paper proposes AdaSR, an adaptive streaming reasoning framework that lets large language models reason during continuous input streaming and perform final deliberation after the stream ends, learning when and how much to think. To optimize this hierarchical process, the authors introduce Hierarchical Relative Policy Optimization (HRPO), which decomposes policy optimization into streaming reasoning and deep reasoning phases, provides fine-grained advantage assignment, and combines format, accuracy, and adaptive thinking rewards. Experiments show AdaSR attains a better trade-off among reasoning accuracy, computational efficiency, and streaming latency compared to supervised fine-tuning baselines. Code is publicly released.