Thinkgap feed

AI signal, minus the noise.

Curated items are read from the processed items table and served as a bilingual feed.

51 items

XJun 15, 2026

Together AI Details Optimizations for GLM 5.1 Inference: Indexer Kernel Rewrite and Overhead Eliminations

Together AI shared the three main optimizations applied to accelerate GLM 5.1 inference. They rewrote the indexer topk kernel and fused the indexer kernel to reduce memory and launch overhead. Additionally, CPU overhead that was bottlenecking prefill throughput was eliminated. The indexer changes yielded the largest performance gain. GLM 5.1 is now available on the Together AI platform.

XJun 15, 2026

Kimi Releases K2.7 Code HighSpeed Mode with Up to 6x Faster Inference

Moonshot AI has introduced a high-speed mode for its open-source multimodal coding model Kimi K2.7 Code. The new mode achieves up to 6× faster inference, delivering around 180 tokens per second on coding tasks with median-length inputs and up to 260 tokens per second on shorter-context tasks. The HighSpeed mode is currently rolling out to participants in the Kimi Code Beta Program, Kimi API developers, and Kimi Business users, though access remains limited due to capacity constraints. No invitation is needed; anyone joining the Beta Program can gain access. The company states it will continue improving the model and expanding access as capacity grows.

XJun 14, 2026

DeepSeek V4 Pro on Together AI is now #1 on Artificial Analysis for both output speed and latency.

Together AI has optimized serving of DeepSeek V4 Pro to achieve top performance on the Artificial Analysis benchmark, ranking #1 for both output speed (tokens per second) and latency. The inference optimizations tackled KV cache efficiency, prefix reuse, custom kernel implementation, and endpoint profiling. This breakthrough provides developers with the fastest DeepSeek V4 Pro API experience currently available. The company shared a detailed breakdown of their systems work via a linked blog post.

XJun 14, 2026

DeepSeek V4 Pro Achieves Number One Latency and Speed on Together Compute

DeepSeek V4 Pro, when deployed on Together Compute's inference platform, has been ranked first in both latency and speed benchmarks. The announcement, originating from a tweet by Vipul Ved and retweeted by Together Compute, positions the model as the current leader in inference performance on the service. No specific metrics or comparative figures were disclosed in the social media post.

XJun 13, 2026

MiniMax-M3 Open-Weight Multimodal Model with 1M Context Debuts on Together AI

MiniMax-M3, an open-weight native multimodal model from MiniMax, is now available on Together AI, the company’s preferred cloud partner. The model features a 1 million token context window, MiniMax Sparse Attention for efficiency, and supports both thinking and non-thinking inference modes. Together AI has optimized inference for MiniMax-M3, achieving up to 125% higher throughput across various concurrency levels, making the model accessible with enhanced performance.

XJun 12, 2026

Together AI's Custom Kernels Deliver 31% Higher TPS on NVIDIA Blackwell GPUs for Production Coding Agents

Together AI announced custom inference kernels optimized for NVIDIA's Blackwell Tensor Core instructions, achieving 31% more tokens per second (TPS) than the next-fastest open-source engine on the same Blackwell hardware. The performance was measured on coding agent benchmarks, with the hardware picture provided by Artificial Analysis' AgentPerf. Cursor, the AI code editor, is using this inference stack to power its real-time coding agents in production.