社交来源: REDDIT LOCALLLAMA2026年6月9日重要度: 3/5

Qwen3.6-35B-A3B工具调用基准测试：ByteShape对比Unsloth GGUFs、KV缓存量化与长上下文性能

英文摘要

A Reddit user conducted a comprehensive benchmark comparing ByteShape and Unsloth quantized versions of the Qwen3.6-35B-A3B model on tool calling tasks. Tests included three KV cache quantizations (f16, q8_0, q4_0) and two context lengths (short ~5k tokens and long with ~122k filler tokens). Results showed no clear winner between ByteShape and Unsloth quants overall, but q8_0 KV cache quant was virtually indistinguishable from f16, offering a free lunch, while q4_0 degraded scores slightly. Long context (50% filled context) significantly reduced tool calling performance across all configurations. The best performing quant was ByteShape GPU-5 (IQ4_XS style), which showed resilience under long context pressure.

中文摘要

一位Reddit用户对Qwen3.6-35B-A3B模型的ByteShape和Unsloth量化版本进行了全面的工具调用基准测试。测试涵盖了三种KV缓存量化（f16、q8_0、q4_0）和两种上下文长度（短上下文约5000词元，长上下文添加约122000词元填充）。结果显示ByteShape和Unsloth的量化版本整体上没有明显赢家，但q8_0 KV缓存量化与f16几乎无异，提供了免费午餐，而q4_0略微降低得分。长上下文（上下文已填充50%）显著降低了所有配置下的工具调用性能。表现最好的量化是ByteShape GPU-5（类似IQ4_XS），在长上下文压力下表现出韧性。

关键要点

Eight GGUFs from ByteShape and Unsloth were tested using tool-eval-bench with 84 tool calling tasks.
使用tool-eval-bench测试了来自ByteShape和Unsloth的八个GGUF文件，涉及84个工具调用任务。
KV cache quantization results: f16 and q8_0 are nearly tied; q4_0 lags by about 1 point on average.
KV缓存量化结果：f16和q8_0几乎持平；q4_0平均落后约1分。
Long context (122k filler tokens) caused an average drop of about 10 points in total tool calling score.
长上下文（12.2万填充词元）导致工具调用总分平均下降约10分。
ByteShape GPU-5 (IQ4_XS, 18.0GB) achieved the highest average score and was more resilient to long context.
ByteShape GPU-5（IQ4_XS，18.0GB）获得了最高平均分，并且对长上下文的韧性更强。
Model size weakly correlated with performance; smallest quants sometimes beat larger ones due to noise.
模型大小与性能弱相关；由于噪声，最小的量化版本有时能击败更大的版本。

打开原文