量化Gemma 4-31B MTP草稿接受率：Q5_K_S最高，IQ4_XS和IQ3_M几乎持平，IQ2_M在n=1时仍达84.5%

Loading / 加载中

英文摘要

A community experiment measured MTP speculative decoding acceptance rates for Gemma 4-31B-it trunk quantized to Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M, paired with its MTP drafter. Single-token draft acceptance (n=1) fell from 88.5% (Q5_K_S) to 84.5% (IQ2_M); at n=4, it dropped to 66.7% and 61.2% respectively. IQ4_XS and IQ3_M performed nearly identically across all depths. The greatest speed gains occur with n=2 on CUDA, while Apple Metal benefits only marginally from n=1. The IQ2_M trunk requires about 12 GB memory, enabling speculative decoding on consumer GPUs.

中文摘要

一项社区实验测量了Gemma 4-31B-it主干在Q5_K_S、IQ4_XS、IQ3_M、IQ2_M量化下配合其MTP起草器的推测解码接受率。单token草稿接受率（n=1）从88.5%（Q5_K_S）降至84.5%（IQ2_M）；n=4时分别为66.7%和61.2%。IQ4_XS与IQ3_M在各个深度下表现几乎一致。最大加速收益来自CUDA上的n=2配置，Apple Metal仅从n=1中获益微小。IQ2_M主干约需12 GB内存，使消费级GPU可运行推测解码。

关键要点

Acceptance rates were measured for Gemma 4-31B-it trunk with its MTP drafter across Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M at draft depths n=1 to 4.

测量了Gemma 4-31B-it主干与其MTP起草器在Q5_K_S、IQ4_XS、IQ3_M、IQ2_M四种量化及草稿深度n=1至4下的接受率。

Single-token draft acceptance ranged from 88.5% (Q5_K_S) to 84.5% (IQ2_M), showing only a small impact at low depth.

单token草稿接受率从88.5%（Q5_K_S）到84.5%（IQ2_M），低深度下量化影响较小。

At depth n=4, acceptance fell to 66.7% (Q5_K_S) and 61.2% (IQ2_M), indicating larger degradation with higher speculation.

深度n=4时接受率降至66.7%（Q5_K_S）和61.2%（IQ2_M），更高推测深度下性能下降更大。

IQ4_XS and IQ3_M performed nearly identically, offering a good trade-off between model size and draft quality.

IQ4_XS与IQ3_M表现几乎一致，提供了模型大小与草稿质量的良好权衡。

Hardware heavily influences speed gains: n=2 yields the largest boost on CUDA, while Apple Metal sees only marginal improvement from n=1.

硬件对加速效果影响显著：CUDA上n=2收益最大，Apple Metal仅从n=1中获得微小提升。