CPU Inference Performance: Qwen3.6 35B A3B Q4_K_M on Intel Core Ultra 7 165H (AVX2, 64GB RAM) Yields 10 tps in Non-Thinking Mode
English summary
A Reddit user with an Intel Core Ultra 7 165H (AVX2, no AVX512) and 64GB RAM tested Qwen3.6 35B A3B Q4_K_M using standard llama.cpp for CPU inference. They observed approximately 10 tokens per second (tps) in non-thinking mode, which they considered usable, but performance in thinking mode was not usable. The user is seeking recommendations for other models, quantizations, or llama.cpp versions that might better leverage the high RAM but limited compute/bandwidth of their massive MoE setup.
Chinese summary
一位Reddit用户使用Intel Core Ultra 7 165H(AVX2,无AVX512)和64GB内存,通过标准llama.cpp测试了Qwen3.6 35B A3B Q4_K_M的CPU推理。在非思考模式下获得约10 tps,用户认为可用,但思考模式性能不可接受。该用户正在寻求其他模型、量化或llama.cpp版本的建议,以更好地利用其高内存但有限计算/带宽的庞大MoE配置。
Key points
Hardware: Intel Core Ultra 7 165H, AVX2 support, 64GB RAM, no AVX512.
硬件:Intel Core Ultra 7 165H,支持AVX2,64GB内存,无AVX512。
Model tested: Qwen3.6 35B A3B (Mixture-of-Experts) with Q4_K_M quantization.
测试模型:Qwen3.6 35B A3B(混合专家模型),采用Q4_K_M量化。
Inference engine: standard llama.cpp, yielding ~10 tps in non-thinking mode, but unusable in thinking mode.
推理引擎:标准llama.cpp,非思考模式约10 tps,思考模式不可用。
User seeks better model/quants/forks for CPU inference on high-RAM, low-bandwidth systems.
用户寻求在高内存、低带宽系统上更优的CPU推理模型、量化或分支。