Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
English summary
The paper introduces a reinforcement learning post-training method to comprehensively improve interactivity in full-duplex spoken dialogue models. It addresses four axes of interactive behavior: pause handling, turn-taking, backchanneling, and user interruption, using axis-specific reward functions trained on short audio segments extracted from human conversation corpora. An auxiliary LLM-based reward preserves semantic response quality during optimization. The approach is applied to two open-source models, Moshi and PersonaPlex, and demonstrates consistent gains in both offline evaluation with pre-recorded audio and real-time multi-turn dialogue tests.
Chinese summary
论文提出一种基于强化学习的后训练方法,全面改进全双工口语对话模型的交互性。该方法针对暂停处理、轮换说话、反向通道和用户打断四个交互维度,利用从人类对话语料中提取的短音频片段训练特定维度的奖励函数。额外的基于大语言模型的奖励确保回复质量不下降。在 Moshi 和 PersonaPlex 两个开源模型上进行的离线预录音频评估和实时多轮对话评估均显示交互性持续提升。
Key points
Full-duplex models trained only with token-level likelihood suffer from interactivity issues like excessive silence and poor turn-taking; reinforcement learning is used to directly optimize dialogue-level behaviors.
仅用 token 级别似然训练的全双工模型存在过度沉默和轮换不当等交互性问题;通过强化学习直接优化对话层行为。
The method defines four axis-specific reward functions covering pause handling, turn-taking, backchanneling, and user interruption, each trained on human conversation audio segments.
定义四个维度特定的奖励函数,涵盖暂停处理、轮换说话、反向通道和用户打断,均用人类对话音频片段训练。
An additional LLM-based reward is integrated to maintain semantic response quality and prevent degradation during RL fine-tuning.
引入额外的基于大语言模型的奖励,以保持回复的语义质量并防止强化学习微调中的质量下降。
Evaluation on Moshi and PersonaPlex shows consistent interactivity improvements in both offline and real-time multi-turn dialogue settings.
在 Moshi 和 PersonaPlex 上的评估表明,离线和实时多轮对话场景中的交互性均得到持续提升。