Together AI shared the three main optimizations applied to accelerate GLM 5.1 inference. They rewrote the indexer topk kernel and fused the indexer kernel to reduce memory and launch overhead. Additionally, CPU overhead that was bottlenecking prefill throughput was eliminated. The indexer changes yielded the largest performance gain. GLM 5.1 is now available on the Together AI platform.
Moonshot AI has introduced a high-speed mode for its open-source multimodal coding model Kimi K2.7 Code. The new mode achieves up to 6× faster inference, delivering around 180 tokens per second on coding tasks with median-length inputs and up to 260 tokens per second on shorter-context tasks. The HighSpeed mode is currently rolling out to participants in the Kimi Code Beta Program, Kimi API developers, and Kimi Business users, though access remains limited due to capacity constraints. No invitation is needed; anyone joining the Beta Program can gain access. The company states it will continue improving the model and expanding access as capacity grows.
Together AI has optimized serving of DeepSeek V4 Pro to achieve top performance on the Artificial Analysis benchmark, ranking #1 for both output speed (tokens per second) and latency. The inference optimizations tackled KV cache efficiency, prefix reuse, custom kernel implementation, and endpoint profiling. This breakthrough provides developers with the fastest DeepSeek V4 Pro API experience currently available. The company shared a detailed breakdown of their systems work via a linked blog post.
DeepSeek V4 Pro, when deployed on Together Compute's inference platform, has been ranked first in both latency and speed benchmarks. The announcement, originating from a tweet by Vipul Ved and retweeted by Together Compute, positions the model as the current leader in inference performance on the service. No specific metrics or comparative figures were disclosed in the social media post.
MiniMax-M3, an open-weight native multimodal model from MiniMax, is now available on Together AI, the company’s preferred cloud partner. The model features a 1 million token context window, MiniMax Sparse Attention for efficiency, and supports both thinking and non-thinking inference modes. Together AI has optimized inference for MiniMax-M3, achieving up to 125% higher throughput across various concurrency levels, making the model accessible with enhanced performance.
Together AI announced custom inference kernels optimized for NVIDIA's Blackwell Tensor Core instructions, achieving 31% more tokens per second (TPS) than the next-fastest open-source engine on the same Blackwell hardware. The performance was measured on coding agent benchmarks, with the hardware picture provided by Artificial Analysis' AgentPerf. Cursor, the AI code editor, is using this inference stack to power its real-time coding agents in production.