Orchestra-o1:全模态智能体编排框架,通过DA-GRPO优化实现OmniGAIA基准上SOTA性能
英文摘要
The paper introduces Orchestra-o1, an omnimodal agent orchestration framework that enables efficient collaboration among agents handling text, image, audio, and video inputs simultaneously. It addresses the limitation of existing systems in complex multi-modal settings by streamlining task decomposition, sub-agent specialization, and parallel sub-task execution. The framework employs a novel decision-aligned group relative policy optimization (DA-GRPO) algorithm. On the OmniGAIA benchmark, Orchestra-o1 achieves state-of-the-art performance, surpassing the second-best approach by 10.3% in accuracy. The work demonstrates that coordinated multi-agent orchestration across modalities significantly boosts task performance.
中文摘要
论文提出了全模态智能体编排框架Orchestra-o1,实现文本、图像、音频和视频并发输入的多个智能体高效协作。它通过简化任务分解、子智能体专门化以及并行子任务执行,解决了现有系统在复杂多模态场景中的局限。框架采用了一种新颖的决策对齐组相对策略优化(DA-GRPO)算法。在OmniGAIA基准上,Orchestra-o1取得了最先进性能,准确率超过第二名10.3%。该工作证明跨模态的协调多智能体编排能显著提升任务表现。
关键要点
Proposes Orchestra-o1, an omnimodal agent orchestration framework for concurrent text, image, audio, and video inputs.
提出全模态智能体编排框架Orchestra-o1,支持文本、图像、音频和视频的同时输入。
Introduces decision-aligned group relative policy optimization (DA-GRPO) to improve multi-agent coordination.
引入决策对齐组相对策略优化(DA-GRPO)来增强多智能体协调。
Achieves state-of-the-art on the OmniGAIA benchmark, outperforming the next best method by 10.3% accuracy.
在OmniGAIA基准上实现最先进性能,准确率超越次优方法10.3%。
Streamlines task decomposition, sub-agent specialization, and parallel execution for efficient orchestration.
通过简化任务分解、子智能体专门化和并行执行实现高效编排。