User Benchmark Finds Gemma 4 26B IT QAT 8-bit Underperforms Regular 6-bit Quant on HumanEval
English summary
A Reddit user ran MMLU_PRO and HumanEval benchmarks on Gemma 4 26B IT using MLX on a Mac M5 Pro to compare regular 4-bit, 6-bit, and QAT 8-bit quantizations. The QAT 8-bit model scored 90% on HumanEval, significantly lower than the 98% achieved by the regular 6-bit model, while MMLU_PRO differences were not statistically significant. The results suggest that the unquantized QAT model is inferior to the unquantized original model and that replacing higher-bit quants (like 5‑bit or 6‑bit) with QAT versions may not be justified. The tests had limited sample sizes (50 and 100 questions) and may not generalize to other Gemma 4 sizes such as 31B, 12B, or E2/4B.
Chinese summary
一名Reddit用户在一台Mac M5 Pro上使用MLX对Gemma 4 26B IT进行了MMLU_PRO和HumanEval基准测试,对比了常规4-bit、6-bit以及QAT 8-bit量化模型的性能。QAT 8-bit模型在HumanEval上的得分为90%,显著低于常规6-bit模型的98%,而MMLU_PRO上的差异无统计学意义。结果表明,未量化的QAT模型性能不及未量化的原始模型,因此用QAT量化版本来替代5-bit、6-bit等更高位宽的常规量化可能并不合理。该测试样本量有限(50和100道题目),结论未必适用于31B、12B或E2/4B等其他Gemma 4型号。
Key points
Compared Gemma 4 26B IT regular 4‑bit, 6‑bit, and QAT 8‑bit quantizations via MLX on a Mac M5 Pro.
在Mac M5 Pro上通过MLX对比了Gemma 4 26B IT的常规4-bit、6-bit和QAT 8-bit量化。
On HumanEval (100 questions), regular 6‑bit scored 98% while QAT 8‑bit scored only 90%, a statistically significant gap.
在HumanEval(100题)上,常规6-bit得分为98%,而QAT 8-bit仅得90%,差异具有统计学显著性。
MMLU_PRO (50 questions) showed no significant differences among the three variants.
在MMLU_PRO(50题)上,三种量化变体之间的差异均不显著。
The findings indicate that the full QAT model is likely worse than the full original model, contradicting claims that QAT is indistinguishable from BF16.
结果表明完整QAT模型很可能不如完整原始模型,这与“QAT与BF16难以区分”的说法相矛盾。
The results may not generalize to other Gemma 4 sizes, and sample sizes are limited; further testing is needed.
该结论可能不适用于其他Gemma 4尺寸,且样本量有限,需进一步测试。