KVarN: Variance-Normalized KV-Cache Quantization Mitigates Error Accumulation in Reasoning Tasks
English summary
KVarN is a calibration-free KV-cache quantizer that mitigates error accumulation in autoregressive decoding of large language models. It applies Hadamard rotation and dual-scaling variance normalization to K and V matrices to correct token-scale errors, significantly reducing accumulation compared to existing methods. Evaluated on Qwen2.5-Coder-32B-Instruct, KVarN achieves improved results on generative benchmarks including MATH500, AIME24, and HumanEval at 2-bit precision. The implementation for vLLM is open-sourced on GitHub.
Chinese summary
KVarN 是一种无需校准的 KV 缓存量化器,可缓解大语言模型自回归解码中的误差累积。它通过对 K 和 V 矩阵应用 Hadamard 旋转和双缩放方差归一化,纠正 token 尺度误差,显著减少了与现有方法相比的累积误差。在 Qwen2.5-Coder-32B-Instruct 上评估,KVarN 在 MATH500、AIME24 和 HumanEval 等生成基准上以 2 比特精度取得了更好的结果。vLLM 实现已在 GitHub 上开源。
Key points
KVarN is a calibration-free KV-cache quantizer that addresses error accumulation during autoregressive decoding.
KVarN 是一种无需校准的 KV 缓存量化器,旨在解决自回归解码过程中的误差累积问题。
It uses Hadamard rotation and dual-scaling variance normalization to correct token-scale errors in K and V matrices.
它利用 Hadamard 旋转和双缩放方差归一化来纠正 K 和 V 矩阵中的 token 尺度误差。
KVarN significantly reduces error accumulation compared to existing baselines, setting a new standard for KV-cache quantization.
与现有基线相比,KVarN 显著减少了误差累积,为 KV 缓存量化设立了新标准。
Improved results were demonstrated on Qwen2.5-Coder-32B-Instruct across MATH500, AIME24, and HumanEval at 2-bit precision.
在 Qwen2.5-Coder-32B-Instruct 上,以 2 比特精度在 MATH500、AIME24 和 HumanEval 基准上展现了更优结果。
Open-source implementation available for vLLM on GitHub.
开源的 vLLM 实现已在 GitHub 上发布。