Orchestra-o1：全模态智能体编排框架，通过DA-GRPO优化实现OmniGAIA基准上SOTA性能

Loading / 加载中

英文摘要

The paper introduces Orchestra-o1, an omnimodal agent orchestration framework that enables efficient collaboration among agents handling text, image, audio, and video inputs simultaneously. It addresses the limitation of existing systems in complex multi-modal settings by streamlining task decomposition, sub-agent specialization, and parallel sub-task execution. The framework employs a novel decision-aligned group relative policy optimization (DA-GRPO) algorithm. On the OmniGAIA benchmark, Orchestra-o1 achieves state-of-the-art performance, surpassing the second-best approach by 10.3% in accuracy. The work demonstrates that coordinated multi-agent orchestration across modalities significantly boosts task performance.

中文摘要

论文提出了全模态智能体编排框架Orchestra-o1，实现文本、图像、音频和视频并发输入的多个智能体高效协作。它通过简化任务分解、子智能体专门化以及并行子任务执行，解决了现有系统在复杂多模态场景中的局限。框架采用了一种新颖的决策对齐组相对策略优化（DA-GRPO）算法。在OmniGAIA基准上，Orchestra-o1取得了最先进性能，准确率超过第二名10.3%。该工作证明跨模态的协调多智能体编排能显著提升任务表现。

关键要点

Proposes Orchestra-o1, an omnimodal agent orchestration framework for concurrent text, image, audio, and video inputs.

提出全模态智能体编排框架Orchestra-o1，支持文本、图像、音频和视频的同时输入。

Introduces decision-aligned group relative policy optimization (DA-GRPO) to improve multi-agent coordination.

引入决策对齐组相对策略优化（DA-GRPO）来增强多智能体协调。

Achieves state-of-the-art on the OmniGAIA benchmark, outperforming the next best method by 10.3% accuracy.

在OmniGAIA基准上实现最先进性能，准确率超越次优方法10.3%。

Streamlines task decomposition, sub-agent specialization, and parallel execution for efficient orchestration.

通过简化任务分解、子智能体专门化和并行执行实现高效编排。