Claw-SWE-Bench:评估OpenClaw风格Agent框架的编码任务基准
英文摘要
The paper presents Claw-SWE-Bench, a benchmark designed to standardize evaluation of OpenClaw-style coding agent harnesses. It provides 350 GitHub issue-resolution instances spanning various programming languages and repositories, along with a Lite version for rapid validation. An adapter protocol is introduced to decouple agent logic from harness execution, and experiments show that adapter choice significantly impacts agent performance. The results highlight the critical role of harness design and cost in fair comparisons, offering a reproducible and cost-effective reference set for coding-agent evaluation.
中文摘要
该论文提出Claw-SWE-Bench基准,旨在标准化OpenClaw风格编码Agent框架的评估。它包含350个跨多种语言和代码库的GitHub问题解决实例,并提供精简版用于快速验证。引入适配器协议解耦Agent逻辑与框架执行,实验表明适配器选择显著影响Agent性能。结果强调了框架设计和成本在公平比较中的关键作用,为编码Agent评估提供了可复现且经济高效的参考集。
关键要点
Introduces Claw-SWE-Bench, a benchmark with 350 GitHub issue-resolution instances and a Lite version for fast validation.
推出Claw-SWE-Bench基准,包含350个GitHub问题解决实例和用于快速验证的精简版。
Proposes an adapter protocol to standardize the evaluation of OpenClaw-style agent harnesses.
提出适配器协议标准化OpenClaw风格Agent框架的评估。
Experiments show that adapter choice significantly influences coding agent performance, emphasizing the role of harness design and cost.
实验显示适配器选择显著影响编码Agent性能,凸显框架设计和成本的作用。
Provides a reproducible and cost-effective reference set for coding-agent evaluation.
提供可复现且经济高效的编码Agent评估参考集。