Qwen3.6-35B-A3B工具调用基准测试:ByteShape对比Unsloth GGUFs、KV缓存量化与长上下文性能
英文摘要
A Reddit user conducted a comprehensive benchmark comparing ByteShape and Unsloth quantized versions of the Qwen3.6-35B-A3B model on tool calling tasks. Tests included three KV cache quantizations (f16, q8_0, q4_0) and two context lengths (short ~5k tokens and long with ~122k filler tokens). Results showed no clear winner between ByteShape and Unsloth quants overall, but q8_0 KV cache quant was virtually indistinguishable from f16, offering a free lunch, while q4_0 degraded scores slightly. Long context (50% filled context) significantly reduced tool calling performance across all configurations. The best performing quant was ByteShape GPU-5 (IQ4_XS style), which showed resilience under long context pressure.
中文摘要
一位Reddit用户对Qwen3.6-35B-A3B模型的ByteShape和Unsloth量化版本进行了全面的工具调用基准测试。测试涵盖了三种KV缓存量化(f16、q8_0、q4_0)和两种上下文长度(短上下文约5000词元,长上下文添加约122000词元填充)。结果显示ByteShape和Unsloth的量化版本整体上没有明显赢家,但q8_0 KV缓存量化与f16几乎无异,提供了免费午餐,而q4_0略微降低得分。长上下文(上下文已填充50%)显著降低了所有配置下的工具调用性能。表现最好的量化是ByteShape GPU-5(类似IQ4_XS),在长上下文压力下表现出韧性。
关键要点
Eight GGUFs from ByteShape and Unsloth were tested using tool-eval-bench with 84 tool calling tasks.
使用tool-eval-bench测试了来自ByteShape和Unsloth的八个GGUF文件,涉及84个工具调用任务。
KV cache quantization results: f16 and q8_0 are nearly tied; q4_0 lags by about 1 point on average.
KV缓存量化结果:f16和q8_0几乎持平;q4_0平均落后约1分。
Long context (122k filler tokens) caused an average drop of about 10 points in total tool calling score.
长上下文(12.2万填充词元)导致工具调用总分平均下降约10分。
ByteShape GPU-5 (IQ4_XS, 18.0GB) achieved the highest average score and was more resilient to long context.
ByteShape GPU-5(IQ4_XS,18.0GB)获得了最高平均分,并且对长上下文的韧性更强。
Model size weakly correlated with performance; smallest quants sometimes beat larger ones due to noise.
模型大小与性能弱相关;由于噪声,最小的量化版本有时能击败更大的版本。