量化Gemma 4-31B MTP草稿接受率:Q5_K_S最高,IQ4_XS和IQ3_M几乎持平,IQ2_M在n=1时仍达84.5%
英文摘要
A community experiment measured MTP speculative decoding acceptance rates for Gemma 4-31B-it trunk quantized to Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M, paired with its MTP drafter. Single-token draft acceptance (n=1) fell from 88.5% (Q5_K_S) to 84.5% (IQ2_M); at n=4, it dropped to 66.7% and 61.2% respectively. IQ4_XS and IQ3_M performed nearly identically across all depths. The greatest speed gains occur with n=2 on CUDA, while Apple Metal benefits only marginally from n=1. The IQ2_M trunk requires about 12 GB memory, enabling speculative decoding on consumer GPUs.
中文摘要
一项社区实验测量了Gemma 4-31B-it主干在Q5_K_S、IQ4_XS、IQ3_M、IQ2_M量化下配合其MTP起草器的推测解码接受率。单token草稿接受率(n=1)从88.5%(Q5_K_S)降至84.5%(IQ2_M);n=4时分别为66.7%和61.2%。IQ4_XS与IQ3_M在各个深度下表现几乎一致。最大加速收益来自CUDA上的n=2配置,Apple Metal仅从n=1中获益微小。IQ2_M主干约需12 GB内存,使消费级GPU可运行推测解码。
关键要点
Acceptance rates were measured for Gemma 4-31B-it trunk with its MTP drafter across Q5_K_S, IQ4_XS, IQ3_M, and IQ2_M at draft depths n=1 to 4.
测量了Gemma 4-31B-it主干与其MTP起草器在Q5_K_S、IQ4_XS、IQ3_M、IQ2_M四种量化及草稿深度n=1至4下的接受率。
Single-token draft acceptance ranged from 88.5% (Q5_K_S) to 84.5% (IQ2_M), showing only a small impact at low depth.
单token草稿接受率从88.5%(Q5_K_S)到84.5%(IQ2_M),低深度下量化影响较小。
At depth n=4, acceptance fell to 66.7% (Q5_K_S) and 61.2% (IQ2_M), indicating larger degradation with higher speculation.
深度n=4时接受率降至66.7%(Q5_K_S)和61.2%(IQ2_M),更高推测深度下性能下降更大。
IQ4_XS and IQ3_M performed nearly identically, offering a good trade-off between model size and draft quality.
IQ4_XS与IQ3_M表现几乎一致,提供了模型大小与草稿质量的良好权衡。
Hardware heavily influences speed gains: n=2 yields the largest boost on CUDA, while Apple Metal sees only marginal improvement from n=1.
硬件对加速效果影响显著:CUDA上n=2收益最大,Apple Metal仅从n=1中获得微小提升。