The Value Axis: Language Models Encode Whether They're on the Right Track
Processing is temporarily unavailable. The original item should be reviewed from its source link. This fallback keeps the item compatible with the processing contract.
AI intelligence feed
Loading / 加载中
Infogap feed
Curated items are read from the processed items table and served as a bilingual feed.
210 items · Page 1 of 35
Processing is temporarily unavailable. The original item should be reviewed from its source link. This fallback keeps the item compatible with the processing contract.
The authors propose ContextRL, a context-aware reinforcement learning method that improves long-horizon reasoning and multimodal performance in LLMs. It uses an indirect objective: the model is rewarded for selecting which of two highly similar contexts supports a given query–answer pair, promoting fine-grained evidence grounding. Contrastive context data is constructed from coding agent trajectories (1K pairs) and multimodal images (7K pairs) via condition filtering and generative editing. ContextRL yields average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks and +1.8% on 12 visual question answering benchmarks. Data-augmentation baselines that repurpose the same contrastive data as standard examples show little improvement, confirming that the gains arise from the context-selection objective rather than from added data alone.
This paper derives the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, reducing posterior sampling to a denoising problem at an operator-dependent shifted pivot with anisotropic noise covariance. The method, Exact Posterior Score (EPS), defines a denoising training objective that mirrors standard pretraining, enabling training from scratch or fine-tuning a pretrained denoiser. At inference, EPS uses the identical sampler as the base model, eliminating the need for likelihood gradients or projections. Evaluated on five linear inverse tasks across FFHQ and ImageNet, EPS surpasses both training-free and training-based baselines in fidelity, perceptual, and distributional metrics while requiring roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.
The paper introduces the Geometric Action Model (GAM), a language-conditioned manipulation policy that leverages a pretrained geometric foundation model (GFM) to explicitly incorporate 3D geometry for contact-rich tasks. GAM splits the GFM at an intermediate layer, using shallow layers for observation encoding and inserting a causal future predictor that forecasts future latent tokens based on language, proprioception, and action history. The predicted tokens are then processed by the remaining GFM blocks, enabling a single backbone to jointly predict future geometry scenes and robot actions with minimal architectural changes. Across simulation and real-robot benchmarks, GAM achieves higher accuracy, robustness, speed, and model compactness compared to existing foundation-model-scale baselines.
Online RL fine-tuning of pretrained VLA policies suffers from sparse binary episode outcomes that conflate viability and efficiency, providing poor per-transition supervision, and naive outcome assignment across human interventions leads to incorrect credit. The paper proposes Hierarchical Advantage-Weighted Behavior Cloning (HABC), which trains separate critic heads for viability and efficiency on distinct data subsets and merges their one-step advantages via a state-adaptive gate that prioritizes viability when success is uncertain and shifts to efficiency only when viability is high. Intervention-aware credit assignment restricts outcome labels to autonomous segments, preventing supervision leakage. On three contact-rich bimanual real-robot tasks, HABC raises success rates from supervised fine-tuning baselines of 36%, 44%, and 12% to 92%, 88%, and 38%, respectively.
This paper introduces MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals, each paired with PI/ECO criteria, a 140k PubMed retrieval corpus, verified relevant studies, and hard negative distractors. Twelve pipeline configurations—nine retrieval-augmented generation (RAG) variants and one protocol-driven agent—were benchmarked on the full retrieval-screening-synthesis workflow. While retrieval recall reached 90.9% at K=200, no system achieved more than 52.7% recall for ground-truth included studies, exposing a critical screening bottleneck. Current LLMs struggle to reliably distinguish PI/ECO-eligible articles from topically similar but ineligible distractors. To isolate failure points, the authors propose stage-attributed metrics instead of a single end-to-end score.