基于卡帕西框架的LLM可验证性路由小实验:120个任务、3个模型对比
英文摘要
A practitioner ran an informal 120-task experiment comparing Claude Sonnet 4.6, GPT 5.5, and open-source Mistral 3 8B across four task categories (code unit tests, structured JSON extraction, multi-hop reasoning, creative summarization) to test whether high-verifiability tasks can be handled by a weaker model plus verifier. For code and structured extraction, Mistral 3 8B achieved 87–89% pass rates, rising to 95–96% with one retry, nearly matching Sonnet 4.6’s 94–97%. On low-verifiability tasks, the capability gap persisted: Mistral 3 scored only 51% on multi-hop reasoning (vs. 71–78%) and 3.1/5 on creative summarization (vs. 3.9–4.2). The experiment also revealed that verifier quality is crucial: an ambiguous JSON schema initially confused Claude’s parser, underscoring that a verifier is only as good as its specification.
中文摘要
一位从业者进行了非正式实验,用120个任务对比Claude Sonnet 4.6、GPT 5.5和开源Mistral 3 8B,任务分为代码单元测试、结构化JSON抽取、多跳推理和创意摘要四类,以检验高可验证性任务是否能由小模型加验证器完成。在代码和结构化抽取中,Mistral 3 8B的通过率分别为87%和89%,重试一次后升至95%和96%,接近Sonnet 4.6的94%和97%;而在低可验证性任务上差距明显:Mistral 3多跳推理仅51%(对比71–78%),创意摘要评分3.1/5(对比3.9–4.2)。实验还发现验证器质量至关重要——一次模糊的JSON schema导致Claude解析出错,表明验证器效果依赖于规范的完善程度。
关键要点
For high-verifiability tasks (code unit tests, structured extraction), a weaker model like Mistral 3 8B with a retry loop can nearly match frontier models.
在代码单元测试和结构化抽取等高可验证性任务上,Mistral 3 8B搭配重试机制可接近前沿模型的表现。
On low-verifiability tasks (multi-hop reasoning, creative summarization) there is a significant capability gap between the small and large models.
在多跳推理和创意摘要等低可验证性任务上,小模型与前沿模型之间存在明显的能力差距。
Verifier quality is critical; an ambiguous JSON schema caused an initial performance drop, showing that schema design directly impacts reliability.
验证器质量至关重要,模糊的JSON schema导致初期性能下降,表明模式设计直接影响可靠性。
The experiment is small (n=120) and not peer-reviewed, but provides directional evidence for cost-efficient routing by task verifiability.
实验规模小(120个任务),未经同行评审,但为根据任务可验证性进行成本效益路由提供了方向性证据。