Loading / 加载中

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence | thinkgap

SocialSource: TELEGRAM HUGGINGFACEPAPERSJune 16, 2026Importance: 4/5

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

English summary

JoyAI-VL-Interaction is an 8B-scale, vision-first model that autonomously decides to respond or delegate without user prompting, aiming to interact with environmental changes like a human would. The system streams ongoing videos for real-time interaction, with pluggable ASR/TTS modules and a background brain. In evaluations, human raters preferred this model over existing video-call assistants across multiple scenarios. The model and system are open-source, representing a new paradigm in interaction modeling for always-on, perceptive agents.

Chinese summary

JoyAI-VL-Interaction 是一个 80 亿参数、视觉优先的模型，能够无需用户提示自主决策响应或委托，旨在像人类一样感知环境变化并互动。该系统通过流式持续视频实现实时交互，配备可插拔的 ASR/TTS 模块和后台大脑。人类评估者在多种场景下更偏好该模型，而非现有的视频通话助手。该开源模型和系统代表了一种交互建模的新范式，用于始终在线的感知智能体。

Key points

8B parameter vision-first model that autonomously triggers responses based on visual input, without user prompts.
80 亿参数视觉优先模型，可基于视觉输入自主触发响应，无需用户提示。
Real-time streaming video interaction system with pluggable speech modules (ASR/TTS) and a background brain.
实时流视频交互系统，配备可插拔的语音模块（ASR/TTS）和后台大脑。
Human raters preferred JoyAI-VL-Interaction over existing video-call assistants in multiple scenarios.
人类评估者在多种场景下更偏好 JoyAI-VL-Interaction，而非现有视频通话助手。
Open-source release of both model and system, introducing a new paradigm for always-on interactive agents.
模型和系统均开源，为始终在线交互智能体引入了新范式。

Open original