社交来源: REDDIT MACHINELEARNING2026年6月11日重要度: 3/5

基于卡帕西框架的LLM可验证性路由小实验：120个任务、3个模型对比

英文摘要

A practitioner ran an informal 120-task experiment comparing Claude Sonnet 4.6, GPT 5.5, and open-source Mistral 3 8B across four task categories (code unit tests, structured JSON extraction, multi-hop reasoning, creative summarization) to test whether high-verifiability tasks can be handled by a weaker model plus verifier. For code and structured extraction, Mistral 3 8B achieved 87–89% pass rates, rising to 95–96% with one retry, nearly matching Sonnet 4.6’s 94–97%. On low-verifiability tasks, the capability gap persisted: Mistral 3 scored only 51% on multi-hop reasoning (vs. 71–78%) and 3.1/5 on creative summarization (vs. 3.9–4.2). The experiment also revealed that verifier quality is crucial: an ambiguous JSON schema initially confused Claude’s parser, underscoring that a verifier is only as good as its specification.

中文摘要

一位从业者进行了非正式实验，用120个任务对比Claude Sonnet 4.6、GPT 5.5和开源Mistral 3 8B，任务分为代码单元测试、结构化JSON抽取、多跳推理和创意摘要四类，以检验高可验证性任务是否能由小模型加验证器完成。在代码和结构化抽取中，Mistral 3 8B的通过率分别为87%和89%，重试一次后升至95%和96%，接近Sonnet 4.6的94%和97%；而在低可验证性任务上差距明显：Mistral 3多跳推理仅51%（对比71–78%），创意摘要评分3.1/5（对比3.9–4.2）。实验还发现验证器质量至关重要——一次模糊的JSON schema导致Claude解析出错，表明验证器效果依赖于规范的完善程度。

关键要点

For high-verifiability tasks (code unit tests, structured extraction), a weaker model like Mistral 3 8B with a retry loop can nearly match frontier models.
在代码单元测试和结构化抽取等高可验证性任务上，Mistral 3 8B搭配重试机制可接近前沿模型的表现。
On low-verifiability tasks (multi-hop reasoning, creative summarization) there is a significant capability gap between the small and large models.
在多跳推理和创意摘要等低可验证性任务上，小模型与前沿模型之间存在明显的能力差距。
Verifier quality is critical; an ambiguous JSON schema caused an initial performance drop, showing that schema design directly impacts reliability.
验证器质量至关重要，模糊的JSON schema导致初期性能下降，表明模式设计直接影响可靠性。
The experiment is small (n=120) and not peer-reviewed, but provides directional evidence for cost-efficient routing by task verifiability.
实验规模小（120个任务），未经同行评审，但为根据任务可验证性进行成本效益路由提供了方向性证据。

打开原文