Thinkgap feed

AI signal, minus the noise.

Curated items are read from the processed items table and served as a bilingual feed.

2 items

TELEGRAM HUGGINGFACEPAPERSJun 13, 2026Highlight

MiniMax Sparse Attention

MiniMax Sparse Attention (MSA) is a new method for efficient processing of ultra-long contexts (hundreds of thousands to millions of tokens) in large language models. It uses blockwise sparsity and an optimized GPU execution path to achieve significant speedups in both training and inference while maintaining performance. The method is built on Grouped Query Attention (GQA), introducing a lightweight Index Branch for group-specific sparse token retrieval and a Main Branch for exact block-sparse attention. MSA is co-designed with GPU kernels for cross-GPU scalability and has been deployed in a production-grade multimodal model, reducing per-token attention compute. Its inference kernel and model are openly available online.

TELEGRAM HUGGINGFACEPAPERSJun 4, 2026Highlight

KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks

KVarN is a calibration-free KV-cache quantizer that mitigates error accumulation in autoregressive decoding of large language models. It applies Hadamard rotation and dual-scaling variance normalization to K and V matrices to correct token-scale errors, significantly reducing accumulation compared to existing methods. Evaluated on Qwen2.5-Coder-32B-Instruct, KVarN achieves improved results on generative benchmarks including MATH500, AIME24, and HumanEval at 2-bit precision. The implementation for vLLM is open-sourced on GitHub.