DecagonAI 借助 Together AI 微调开源模型，将语音智能体单轮成本降低近 6 倍

英文摘要

DecagonAI reduced voice agent cost per turn nearly 6x by migrating from closed models to fine-tuned open models on Together AI. They maintained p95 model latency under 400 ms per turn, low enough for real-time voice, through custom speculative decoding, prompt caching, and optimized serving on NVIDIA Blackwell GPUs. The team deploys new models weekly or even daily, demonstrating rapid iteration and full control over their AI stack without proprietary API lock-in.

中文摘要

DecagonAI 通过从闭源模型转向在 Together AI 上微调的开源模型，将语音智能体单轮成本降低近 6 倍。他们通过自定义推测解码、提示缓存和在 NVIDIA Blackwell GPU 上的优化服务，将每轮端到端模型延迟（p95）控制在 400 毫秒以内，足以支撑实时语音交互。该公司每周甚至每天部署新模型，展现出快速的迭代能力和对技术栈的完全掌控，摆脱了专有 API 的锁定。

关键要点

DecagonAI achieved a nearly 6x reduction in voice agent cost per turn by switching from closed models to fine-tuned open models on Together AI.
将模型从闭源切换为在 Together AI 上微调的开源模型后，语音智能体单轮成本降低近 6 倍。
Real-time voice readiness is maintained with p95 model latency under 400 ms per turn.
通过每轮端到端模型延迟（p95）低于 400 毫秒，保持了实时语音交互的可用性。
Latency optimization used custom speculative decoding, prompt caching, and NVIDIA Blackwell-based serving.
延迟优化结合了自定义推测解码、提示缓存以及基于 NVIDIA Blackwell 的推理服务。
The team deploys models weekly or daily, enabling very rapid iteration.
模型部署频率达到每周甚至每天一次，支持极快的迭代节奏。
This closed-to-open shift provides greater control, improved token economics, and avoids vendor lock-in.
从闭源到开源的转变带来了更多的控制权、更优的 token 经济性，并避免了供应商锁定。

打开原文