Together AI 为 NVIDIA Blackwell GPU 定制内核,在生产级编码智能体推理中实现 31% 吞吐量提升
英文摘要
Together AI announced custom inference kernels optimized for NVIDIA's Blackwell Tensor Core instructions, achieving 31% more tokens per second (TPS) than the next-fastest open-source engine on the same Blackwell hardware. The performance was measured on coding agent benchmarks, with the hardware picture provided by Artificial Analysis' AgentPerf. Cursor, the AI code editor, is using this inference stack to power its real-time coding agents in production.
中文摘要
Together AI 公布了针对 NVIDIA Blackwell GPU Tensor Core 指令优化的定制推理内核,在相同 Blackwell 硬件上比最快的开源引擎实现了 31% 的每秒令牌数(TPS)提升。该性能在编码智能体基准测试中得到验证,硬件对比由 Artificial Analysis 的 AgentPerf 提供。AI 代码编辑器 Cursor 已将该推理栈用于生产环境中的实时编码智能体。
关键要点
Together AI's custom kernels for Blackwell Tensor Core instructions yield 31% higher TPS compared to the next-fastest open-source inference engine on identical Blackwell hardware.
Together AI 针对 Blackwell Tensor Core 指令的定制内核,在相同 Blackwell 硬件上比第二快的开源推理引擎实现 31% 的 TPS 提升。
The performance gains were demonstrated on coding agent benchmarks, with hardware-level benchmarking from AgentPerf by Artificial Analysis.
性能提升在编码智能体基准上得到证明,硬件层面基准测试来自 Artificial Analysis 的 AgentPerf。
Cursor, the AI code editor, runs its production real-time coding agents on this inference stack.
AI 代码编辑器 Cursor 在实际生产环境中使用此推理栈驱动其实时编码智能体。