Geometric Action Model for Robot Policy Learning
中文标题: 面向机器人策略学习的几何动作模型
英文摘要
The paper introduces the Geometric Action Model (GAM), a language-conditioned manipulation policy that leverages a pretrained geometric foundation model (GFM) to explicitly incorporate 3D geometry for contact-rich tasks. GAM splits the GFM at an intermediate layer, using shallow layers for observation encoding and inserting a causal future predictor that forecasts future latent tokens based on language, proprioception, and action history. The predicted tokens are then processed by the remaining GFM blocks, enabling a single backbone to jointly predict future geometry scenes and robot actions with minimal architectural changes. Across simulation and real-robot benchmarks, GAM achieves higher accuracy, robustness, speed, and model compactness compared to existing foundation-model-scale baselines.
中文摘要
该论文提出几何动作模型(GAM),一种语言条件下的操控策略,通过重构预训练的几何基础模型(GFM)显式引入3D几何信息以处理需要精细接触的任务。GAM在GFM的中间层进行拆分,浅层用于观测编码,并插入一个因果未来预测器,该预测器根据语言指令、本体感知和动作历史预测未来的隐式令牌。预测的令牌随后流经剩余的GFM模块,使得同一骨干网络能以最小的架构改动同时输出未来几何场景和机器人动作。在仿真和真实机器人基准测试中,GAM在准确性、鲁棒性、速度和模型轻量化方面均优于现有的基础模型规模方法。
关键要点
Problem: Current vision-language-action models and video world-action models operate on 2D frames or latent spaces, lacking explicit 3D geometric reasoning needed for precise manipulation.
问题:当前的视觉-语言-动作模型和视频世界-动作模型主要基于2D图像或隐空间,缺乏精确操控所需的显式3D几何推理。
Method: GAM repurposes a pretrained geometric foundation model by splitting it into an observation encoder and a future predictor, enabling the model to forecast future geometry and actions conditioned on language, proprioception, and history.
方法:GAM通过将预训练的几何基础模型拆分为观测编码器和未来预测器,使模型能够在语言、本体感知和历史条件下同时预测未来几何和动作。
Architecture: The causal future predictor is inserted at a split layer, and the remaining GFM blocks serve as a shared decoder for both future geometry and action outputs, preserving rich geometric priors with minimal modification.
架构:在分裂层插入因果未来预测器,剩余的GFM模块作为未来几何和动作输出的共享解码器,以最小的修改保留了丰富的几何先验。
Performance: On simulation and real-robot manipulation benchmarks, GAM outperforms foundation-model-scale baselines in accuracy, robustness, speed, and model size.
性能:在仿真和真实机器人操控基准测试中,GAM在准确性、鲁棒性、速度和模型大小方面均优于基础模型规模的基线方法。