DeepSeek v4 Pro 登顶编程基准测试,但 CAISI 评估其落后前沿模型 8 个月
英文摘要
DeepSeek v4 Pro achieves top coding scores: 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench. However, CAISI’s multi-domain evaluation places it roughly 8 months behind the US frontier, contrasting with DeepSeek’s own claim of 2 months behind. The discrepancy is attributed to narrow coding benchmarks versus broader requirements in cybersecurity and abstract reasoning. The frontier has also advanced, with closed models like Fable 5 recently released. For local users, quantized versions of the model may yield different real-world agent performance than the full 1.6T-parameter Pro configuration.
中文摘要
DeepSeek v4 Pro 在 SWE-bench Verified 上获得 80.6%,在 LiveCodeBench 上获得 93.5% 的顶级编程得分。但 CAISI 跨领域评估显示其大约落后于美国前沿模型 8 个月,而 DeepSeek 官方声称落后 2 个月。差异源于编码基准测试的局限性,而 CAISI 测试涵盖了网络安全和抽象推理等更广领域。前沿模型也在进步,如封闭模型 Fable 5 已发布。对于本地用户,量化后的版本在执行工具调用时可能表现不同于 1.6T 参数的完整 Pro 配置。
关键要点
DeepSeek v4 Pro scores 80.6% on SWE-bench Verified and 93.5% on LiveCodeBench, placing near the top.
DeepSeek v4 Pro 在 SWE-bench Verified 上得分为 80.6%,在 LiveCodeBench 上得分为 93.5%,名列前茅。
CAISI's broader evaluation finds the model roughly 8 months behind the US frontier, while DeepSeek claimed 2 months behind.
CAISI 的广泛评估发现该模型大约落后于美国前沿模型 8 个月,而 DeepSeek 声称落后 2 个月。
The gap is explained by coding leaderboards being a narrow performance slice, missing agentic and reasoning gaps.
差距源于编码排行榜仅反映狭窄的性能范围,未体现智能代理和推理能力的差距。
Closed-source frontier models like Fable 5 have advanced further, leaving open-weight models like DeepSeek v4 trailing in broader capability.
像 Fable 5 这样的闭源前沿模型已进一步推进,使得 DeepSeek v4 在更广泛能力上落后。
Locally run quantized versions of DeepSeek v4 may underperform the Pro configuration, especially in agentic tool-use tasks.
本地运行的量化版 DeepSeek v4 在智能代理工具调用任务中可能不如 Pro 配置表现好。