Qwen-RobotSuite 发布三款具身 AI 模型:面向 VLA 操作、视频世界建模和导航
英文摘要
The Qwen team released Qwen-RobotSuite, a suite of three independent embodied AI foundation models for robotics. Qwen-RobotManip is a Vision-Language-Action model based on Qwen3.5-4B that aligns heterogeneous manipulation data into a unified 80-dimensional action vector, achieving 1st place on RoboChallenge Table30-v1 and strong cross-embodiment transfer. Qwen-RobotWorld is a language-conditioned video world model using a 60-layer dual-stream MMDiT and a frozen Qwen2.5-VL encoder, ranking 1st overall on EWMBench and DreamGen Bench. Qwen-RobotNav is a scalable navigation model built on Qwen3-VL with a parameterized observation interface, reaching 76.5% success rate on VLN-CE RxR and enabling agentic planning. RobotManip and RobotNav have public GitHub repositories; RobotWorld is presented as a research paper.
中文摘要
Qwen 团队发布了 Qwen-RobotSuite,包含三款独立的具身 AI 基础模型。Qwen-RobotManip 基于 Qwen3.5-4B 构建,是一个视觉-语言-动作模型,可将异构操作数据对齐到统一的 80 维动作向量,在 RoboChallenge Table30-v1 上排名第一,并展现出强大的跨具身迁移能力。Qwen-RobotWorld 是一个语言条件的视频世界模型,采用 60 层双流 MMDiT 和冻结的 Qwen2.5-VL 编码器,在 EWMBench 和 DreamGen Bench 上均获总体第一。Qwen-RobotNav 是基于 Qwen3-VL 的可扩展导航模型,具有参数化观察接口,在 VLN-CE RxR 上达到 76.5% 成功率,并支持智能体规划。RobotManip 和 RobotNav 已在 GitHub 开源;RobotWorld 以论文形式发布。
关键要点
Qwen-RobotManip uses a unified alignment framework with an 80-dimensional canonical action vector and per-dimension masking, enabling a single VLA model to handle diverse robots and achieving state-of-the-art on out-of-distribution manipulation benchmarks.
Qwen-RobotManip 采用统一对齐框架,包含 80 维规范动作向量和逐维掩码,使单一 VLA 模型能处理多种机器人,并在分布外操作基准上达到最先进水平。
Qwen-RobotWorld employs natural language as the universal action interface for video prediction, with a 20B-parameter double-stream MMDiT architecture, ranking first on EWMBench and DreamGen Bench.
Qwen-RobotWorld 将自然语言作为视频预测的通用动作接口,采用 20B 参数的双流 MMDiT 架构,在 EWMBench 和 DreamGen Bench 上排名第一。
Qwen-RobotNav introduces a controllable observation interface with configurable token budget, temporal decay, and per-camera importance weights, enabling a single model to serve multiple navigation tasks and achieving 76.5% SR on VLN-CE RxR.
Qwen-RobotNav 引入了可控观察接口,允许配置 token 预算、时间衰减和逐摄像头重要性权重,使单一模型能够服务多种导航任务,在 VLN-CE RxR 上达到 76.5% 成功率。
Two of the three models, RobotManip and RobotNav, are available with public GitHub repositories; RobotWorld is currently documented in a research report only.
三款模型中的 RobotManip 和 RobotNav 已在 GitHub 上公开代码仓库;RobotWorld 目前仅作为研究论文发布。
The suite collectively addresses complementary layers of embodied AI: RobotWorld for simulation and data generation, RobotManip for physical manipulation, and RobotNav for mobility.
该套件覆盖了具身 AI 的互补层面:RobotWorld 用于仿真和数据生成,RobotManip 负责物体操作,RobotNav 处理移动导航。