The paper proposes VERITAS, a generator-verifier framework for generalist robot policies. It pairs a pre-trained robot policy (generator) with a gradient-free visual verifier that evaluates actions at inference time, enabling policy steering without additional training. Verified rollouts are then used as supervision for offline fine-tuning, yielding consistent performance gains. The approach matches the efficiency of expert demonstrations but requires no human intervention, highlighting inference-time verification as a scalable mechanism for self-improvement in real-world deployment.
The paper introduces Zone of Proximal Policy Optimization (ZPPO), a method that keeps a stronger teacher inside the prompt rather than the policy gradient to avoid drift when student rollouts fail on hard questions. ZPPO constructs two reformulated prompts for difficult queries: a Binary Candidate-included Question (BCQ) pairs one correct teacher response with one incorrect student response as anonymized candidates for discrimination, and a Negative Candidate-included Question (NCQ) aggregates the student's wrong rollouts to surface shared failure modes. A prompt replay buffer recirculates each hard question until the student achieves half mean rollout accuracy or is evicted, focusing the student’s current zone of proximal development. Evaluated on the Qwen3.5 family at four student scales (0.8B–9B) with a 27B teacher, post-trained as vision-language models, ZPPO outperforms off/on-policy distillation and GRPO across a 31-benchmark suite (16 VLM, 10 LLM, 5 Video), with the largest gains at the smallest scale.
DecagonAI reduced voice agent cost per turn nearly 6x by migrating from closed models to fine-tuned open models on Together AI. They maintained p95 model latency under 400 ms per turn, low enough for real-time voice, through custom speculative decoding, prompt caching, and optimized serving on NVIDIA Blackwell GPUs. The team deploys new models weekly or even daily, demonstrating rapid iteration and full control over their AI stack without proprietary API lock-in.
The paper introduces d-OPSD, the first on-policy self-distillation framework designed for diffusion large language models (dLLMs). It replaces autoregressive-centric prefix conditioning with self-generated suffix conditioning, allowing the student to learn from future self-experience. Supervision shifts from token-level to step-level to align with the iterative denoising process. Experiments on four reasoning benchmarks show d-OPSD consistently outperforms RLVR and SFT baselines while requiring only about 10% of RLVR's optimization steps, demonstrating superior sample efficiency. Code is available on GitHub.
The paper introduces a new dataset that combines system, network, and browser activity logs, containing 2.3 million events from 870 sessions (70 attack, 800 benign). All malicious events are labeled with MITRE ATT&CK technique IDs, covering 12 tactics and 53 techniques, and attacks were generated using real tools including RAT, C2 tunnels, and cloud exfiltration. The authors fine-tuned three Small Language Models (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) with LoRA and evaluated them on chunk classification and ATT&CK technique identification. Fine-tuning raised chunk classification accuracy from ~8% for base models to 90–97%. Technique identification remained hard, with the best exact-match accuracy at 42%, though high partial-match scores indicate the models learned the underlying reasoning.
The Tencent Rhino Bird Elite Talent Program released three papers accepted at ICML 2026. The first, Hybrid Policy Distillation (HPD), unifies forward and reverse KL divergence and on/off-policy data to improve LLM distillation stability, efficiency, and performance across math reasoning, dialogue, and code generation. The second, Many-Shot CoT-ICL, studies in-context learning with many chain-of-thought examples for reasoning tasks, finding that similarity-based retrieval fails and proposing CDS, a method that orders examples by conceptual progression to boost reasoning accuracy by 3.81% on average. The third, CamGeo, distills 3D geometry priors from a video-to-3D model into a diffusion backbone via trajectory and cross-frame consistency distillation and a coarse-to-fine curriculum, achieving stable performance gains in sparse camera-conditioned image-to-video generation.