The Qwen team released Qwen-RobotSuite, a suite of three independent embodied AI foundation models for robotics. Qwen-RobotManip is a Vision-Language-Action model based on Qwen3.5-4B that aligns heterogeneous manipulation data into a unified 80-dimensional action vector, achieving 1st place on RoboChallenge Table30-v1 and strong cross-embodiment transfer. Qwen-RobotWorld is a language-conditioned video world model using a 60-layer dual-stream MMDiT and a frozen Qwen2.5-VL encoder, ranking 1st overall on EWMBench and DreamGen Bench. Qwen-RobotNav is a scalable navigation model built on Qwen3-VL with a parameterized observation interface, reaching 76.5% success rate on VLN-CE RxR and enabling agentic planning. RobotManip and RobotNav have public GitHub repositories; RobotWorld is presented as a research paper.
Zyphra has released Zamba2-VL, a family of open vision-language models in three sizes: 1.2B, 2.7B, and 7B parameters. Each model uses a hybrid Mamba2 state-space model combined with a small number of shared transformer blocks, replacing dense attention to achieve near-linear inference scaling. The models pair a Qwen2.5-VL vision encoder with this backbone, supporting single- and multi-image understanding and grounding. On 14 benchmarks, Zamba2-VL shows strong visual counting and document understanding (e.g., 90.9 DocVQA for the 2.7B model) but lags larger baselines on knowledge-heavy reasoning like MMMU and MathVista. Its main advantage is an order-of-magnitude lower time-to-first-token compared to comparable Transformer VLMs, particularly beneficial for long multimodal inputs and on-device deployment. Weights are released under Apache 2.0 license on HuggingFace with inference code available.
Anthropic released two models, Claude Fable 5 and Claude Mythos 5, on June 9, 2026. Both belong to the new Mythos class, positioned above the Opus tier, and share the same underlying model. Fable 5 is generally available with safety classifiers that fall back to Opus 4.8 on flagged requests, while Mythos 5 has lifted cyber safeguards and is limited to Project Glasswing. The models offer a 1M-token context window and 128k output tokens, priced at $10/M input and $50/M output. Anthropic reports Fable 5 achieves state-of-the-art results across nearly all benchmarks, including software engineering, finance, vision, and long-context tasks, with Stripe demonstrating a 50-million-line code migration in one day. Classifiers activate in under 5% of sessions, and over 95% of Fable sessions experience no fallback, effectively matching Mythos 5 performance.
Google announced Gemini 3.5 Live Translate, a dedicated speech-to-speech audio model that continuously translates spoken audio into 70+ languages while preserving the speaker's intonation and pacing. Unlike turn-based agents, it processes audio as a stream, producing translated speech a few seconds behind the speaker. Developers can configure it via the Gemini Live API using a translationConfig with a BCP-47 target language code; the model accepts only raw 16-bit 16kHz PCM audio input and outputs 24kHz audio. It is rolling out in public preview on the Live API and Google AI Studio, a private preview in Google Meet (expanding from 5 to 70+ languages), and will launch in the Google Translate app on Android and iOS. All generated audio is watermarked with SynthID for detectability.