AI intelligence feed

ARXIVJun 16, 2026Highlight

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

The paper proposes VERITAS, a generator-verifier framework for generalist robot policies. It pairs a pre-trained robot policy (generator) with a gradient-free visual verifier that evaluates actions at inference time, enabling policy steering without additional training. Verified rollouts are then used as supervision for offline fine-tuning, yielding consistent performance gains. The approach matches the efficiency of expert demonstrations but requires no human intervention, highlighting inference-time verification as a scalable mechanism for self-improvement in real-world deployment.

ARXIVJun 16, 2026Highlight

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

The paper introduces Zone of Proximal Policy Optimization (ZPPO), a method that keeps a stronger teacher inside the prompt rather than the policy gradient to avoid drift when student rollouts fail on hard questions. ZPPO constructs two reformulated prompts for difficult queries: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates for discrimination, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts to surface shared failure modes. A prompt replay buffer recirculates each hard question until the student achieves half mean rollout accuracy or is evicted, focusing the student’s current zone of proximal development. Evaluated on the Qwen3.5 family at four student scales (0.8B–9B) with a 27B teacher, post-trained as vision-language models, ZPPO outperforms off/on-policy distillation and GRPO across a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), with the largest gains at the smallest scale.

XJun 16, 2026Highlight

DecagonAI cut voice agent cost per turn by nearly 6x using fine-tuned open models on Together AI

DecagonAI reduced voice agent cost per turn nearly 6x by migrating from closed models to fine-tuned open models on Together AI. They maintained p95 model latency under 400 ms per turn, low enough for real-time voice, through custom speculative decoding, prompt caching, and optimized serving on NVIDIA Blackwell GPUs. The team deploys new models weekly or even daily, demonstrating rapid iteration and full control over their AI stack without proprietary API lock-in.

ARXIVJun 16, 2026Highlight

Learning from the Self-future: On-policy Self-distillation for dLLMs

The paper introduces d-OPSD, the first on-policy self-distillation framework designed for diffusion large language models (dLLMs). It replaces autoregressive-centric prefix conditioning with self-generated suffix conditioning, allowing the student to learn from future self-experience. Supervision shifts from token-level to step-level to align with the iterative denoising process. Experiments on four reasoning benchmarks show d-OPSD consistently outperforms RLVR and SFT baselines while requiring only about 10% of RLVR's optimization steps, demonstrating superior sample efficiency. Code is available on GitHub.

ARXIVJun 16, 2026

Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation

The paper introduces a new dataset that combines system, network, and browser activity logs, containing 2.3 million events from 870 sessions (70 attack, 800 benign). All malicious events are labeled with MITRE ATT&CK technique IDs, covering 12 tactics and 53 techniques, and attacks were generated using real tools including RAT, C2 tunnels, and cloud exfiltration. The authors fine-tuned three Small Language Models (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) with LoRA and evaluated them on chunk classification and ATT&CK technique identification. Fine-tuning raised chunk classification accuracy from ~8% for base models to 90–97%. Technique identification remained hard, with the best exact-match accuracy at 42%, though high partial-match scores indicate the models learned the underlying reasoning.

LEIPHONEJun 16, 2026

Tencent Rhino Bird Elite Program Presents Three ICML 2026 Papers on Efficient Distillation, Long-Context Reasoning, and Sparse-View Video Generation

The Tencent Rhino Bird Elite Talent Program released three papers accepted at ICML 2026. The first, Hybrid Policy Distillation (HPD), unifies forward and reverse KL divergence and on/off-policy data to improve LLM distillation stability, efficiency, and performance across math reasoning, dialogue, and code generation. The second, Many-Shot CoT-ICL, studies in-context learning with many chain-of-thought examples for reasoning tasks, finding that similarity-based retrieval fails and proposing CDS, a method that orders examples by conceptual progression to boost reasoning accuracy by 3.81% on average. The third, CamGeo, distills 3D geometry priors from a video-to-3D model into a diffusion backbone via trajectory and cross-frame consistency distillation and a coarse-to-fine curriculum, achieving stable performance gains in sparse camera-conditioned image-to-video generation.

AI signal, minus the noise.

Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

DecagonAI cut voice agent cost per turn by nearly 6x using fine-tuned open models on Together AI

Learning from the Self-future: On-policy Self-distillation for dLLMs

Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation

Tencent Rhino Bird Elite Program Presents Three ICML 2026 Papers on Efficient Distillation, Long-Context Reasoning, and Sparse-View Video Generation