JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence
English summary
JoyAI-VL-Interaction is an 8B-scale, vision-first model that autonomously decides to respond or delegate without user prompting, aiming to interact with environmental changes like a human would. The system streams ongoing videos for real-time interaction, with pluggable ASR/TTS modules and a background brain. In evaluations, human raters preferred this model over existing video-call assistants across multiple scenarios. The model and system are open-source, representing a new paradigm in interaction modeling for always-on, perceptive agents.
Chinese summary
JoyAI-VL-Interaction 是一个 80 亿参数、视觉优先的模型,能够无需用户提示自主决策响应或委托,旨在像人类一样感知环境变化并互动。该系统通过流式持续视频实现实时交互,配备可插拔的 ASR/TTS 模块和后台大脑。人类评估者在多种场景下更偏好该模型,而非现有的视频通话助手。该开源模型和系统代表了一种交互建模的新范式,用于始终在线的感知智能体。
Key points
8B parameter vision-first model that autonomously triggers responses based on visual input, without user prompts.
80 亿参数视觉优先模型,可基于视觉输入自主触发响应,无需用户提示。
Real-time streaming video interaction system with pluggable speech modules (ASR/TTS) and a background brain.
实时流视频交互系统,配备可插拔的语音模块(ASR/TTS)和后台大脑。
Human raters preferred JoyAI-VL-Interaction over existing video-call assistants in multiple scenarios.
人类评估者在多种场景下更偏好 JoyAI-VL-Interaction,而非现有视频通话助手。
Open-source release of both model and system, introducing a new paradigm for always-on interactive agents.
模型和系统均开源,为始终在线交互智能体引入了新范式。