PACE: A Proxy for Agentic Capability Evaluation
中文标题: PACE:智能体能力评估的代理基准
英文摘要
PACE constructs proxy benchmarks from a small, automatically selected subset of non-agentic evaluation instances to predict model scores on expensive agentic benchmarks. By combining target-relevance and globally informative selection strategies, PACE-Bench is formed from 19 non-agentic benchmarks. Evaluated across 14 models and 4 agentic benchmarks (including SWE-Bench and GAIA), it achieves leave-one-out cross-validation mean absolute error under 4%, Spearman correlation above 0.80, and pairwise model-ranking accuracy around 85%, all at less than 1% of the full agentic evaluation cost. The selected instances also reveal the distinct skill demands of each agentic benchmark. PACE enables practical performance estimation for model development, selection, and routing without full agent evaluation overhead.
中文摘要
PACE 构建了一个代理基准,通过自动从非智能体评估实例中选择一小部分子集来预测昂贵智能体基准上的模型得分。结合目标相关性和全局信息性两种选择策略,PACE-Bench 由19个非智能体基准构成。在14个模型和4个智能体基准(包括 SWE-Bench 和 GAIA)上的评估显示,留一交叉验证的平均绝对误差低于4%,Spearman 相关系数高于0.80,成对模型排序准确性约85%,而成本不到完整智能体评估的1%。所选实例还揭示了每个智能体基准所要求的独特技能。PACE 使模型开发、选择和路由过程中能经济高效地估计智能体性能,无需承担完整评估的开销。
关键要点
PACE constructs a proxy benchmark from a small set of non-agentic instances to predict scores on expensive agentic benchmarks at less than 1% of the cost.
PACE 从少量非智能体实例中构建代理基准,以不到1%的成本预测昂贵智能体基准上的得分。
PACE-Bench achieves leave-one-out cross-validation MAE under 4%, Spearman correlation above 0.80, and pairwise ranking accuracy around 85% across 14 models and 4 agentic benchmarks.
PACE-Bench 在14个模型和4个智能体基准上的留一交叉验证显示,平均绝对误差低于4%,Spearman 相关高于0.80,成对排序准确性约85%。
The instance selection combines target-relevance (local) and globally informative (global) strategies to curate a compact and predictive subset.
实例选择结合了目标相关性(局部)和全局信息性(全局)策略,以筛选出紧凑且具有预测力的子集。
The selected proxy instances reveal which atomic skills each agentic benchmark uniquely demands.
所选代理实例揭示了每个智能体基准独特要求的原子技能。