The paper introduces the Geometric Action Model (GAM), which leverages a pretrained geometric foundation model to enhance language-conditioned manipulation in 3D physical environments. GAM splits the foundation model into an observation encoding layer and a future prediction layer, enabling it to predict future tokens from language, proprioception, and action history before decoding them into actions. This 3D-aware approach significantly improves accuracy, robustness, efficiency, and speed over standard 2D vision-language-action models in both simulated and real-robot contact-rich tasks.
JoyAI-VL-Interaction is an 8B-scale, vision-first model that autonomously decides to respond or delegate without user prompting, aiming to interact with environmental changes like a human would. The system streams ongoing videos for real-time interaction, with pluggable ASR/TTS modules and a background brain. In evaluations, human raters preferred this model over existing video-call assistants across multiple scenarios. The model and system are open-source, representing a new paradigm in interaction modeling for always-on, perceptive agents.
The paper proposes Data2Story, a multi-agent framework that automates data journalism by mimicking a virtual newsroom with distinct roles. It generates evidence-based news stories in multiple formats, such as text articles, interactive maps, and audio, each linked to data sources for verifiability. In evaluations against expert human journalists, Data2Story showed competitive performance, particularly excelling in transparency and auditability. Human journalists still outperform in editorial angle and creative design. The system is designed as a collaborative tool for journalists, not a replacement.
OmniDirector introduces a unified framework for camera motion cloning in video generation that uses grid motion videos to visually encode camera parameters, supporting diverse trajectories for multi-shot scenes. It trains on a large dataset of camera grid-video pairs, eliminating the need for cross-paired data. The framework integrates characters, actions, and cameras via multimodal diffusion transformers, providing director-level control. A hierarchical prompt expansion agent harmonizes different control signals to enhance camera motion and visual content descriptions. Extensive experiments demonstrate its superior performance and controllability over existing methods.
The paper introduces Orchestra-o1, an omnimodal agent orchestration framework that enables efficient collaboration among agents handling text, image, audio, and video inputs simultaneously. It addresses the limitation of existing systems in complex multi-modal settings by streamlining task decomposition, sub-agent specialization, and parallel sub-task execution. The framework employs a novel decision-aligned group relative policy optimization (DA-GRPO) algorithm. On the OmniGAIA benchmark, Orchestra-o1 achieves state-of-the-art performance, surpassing the second-best approach by 10.3% in accuracy. The work demonstrates that coordinated multi-agent orchestration across modalities significantly boosts task performance.
MiniMax Sparse Attention (MSA) is a new method for efficient processing of ultra-long contexts (hundreds of thousands to millions of tokens) in large language models. It uses blockwise sparsity and an optimized GPU execution path to achieve significant speedups in both training and inference while maintaining performance. The method is built on Grouped Query Attention (GQA), introducing a lightweight Index Branch for group-specific sparse token retrieval and a Main Branch for exact block-sparse attention. MSA is co-designed with GPU kernels for cross-GPU scalability and has been deployed in a production-grade multimodal model, reducing per-token attention compute. Its inference kernel and model are openly available online.