论文来源: HUGGINGFACE2026年6月29日重要度: 4/5

SafePyramid: A Hierarchical Benchmark for In-context Policy Guardrailing

中文标题: SafePyramid：面向上下文策略护栏的分层基准测试

英文摘要

SafePyramid is a new benchmark for evaluating in-context policy guardrailing, comprising 1,000 multi-turn conversations, 3,000 application-specific policies, and 61,699 distinct natural-language rules across 10 domains. The benchmark structures evaluation into three difficulty levels: L0 (individual rule understanding), L1 (reasoning over rule dependencies), and L2 (adapting full novel policy frameworks). Evaluation of 10 frontier LLMs and 5 policy-configurable guardrails reveals that even GPT-5.5 correctly identifies all violated rules in only 54.0% of L0 cases, 35.3% of L1 cases, and 12.9% of L2 cases. These results underscore the significant challenges remaining in in-context policy guardrailing, particularly in resolving rule dependencies and adapting to new policies.

中文摘要

SafePyramid是一个新的基准测试，用于评估上下文策略护栏能力，包含1000个多轮对话、3000个应用特定策略以及10个领域的61699条不同自然语言规则。该基准将评估分为三个难度层级：L0（单规则理解）、L1（规则依赖推理）和L2（全新型策略框架适配）。对10个前沿大语言模型和5个可配置策略的护栏的评估显示，即使是表现最好的GPT-5.5，在L0、L1、L2上正确识别全部违规规则的比例也仅分别为54.0%、35.3%和12.9%。这些结果突显了上下文策略护栏在解决规则依赖和适应新策略方面仍面临的重大挑战。

关键要点

SafePyramid contains 1,000 multi-turn conversations, 3,000 policies, and 61,699 rules across 10 domains.
SafePyramid包含1000个多轮对话、3000个策略以及10个领域的61699条规则。
The benchmark features three hierarchical difficulty levels: L0 (rule understanding), L1 (rule dependency reasoning), L2 (novel policy adaptation).
基准包含三个分层难度：L0（规则理解）、L1（规则依赖推理）、L2（全新策略适配）。
Frontier LLMs and guardrails were evaluated; GPT-5.5 achieved exact rule violation identification rates of 54.0% (L0), 35.3% (L1), and 12.9% (L2).
评估了前沿大语言模型和护栏；GPT-5.5在各层级上准确识别全部违规规则的比例为：L0 54.0%、L1 35.3%、L2 12.9%。
Results show in-context policy guardrailing remains highly challenging, with steep performance drops as rule complexity increases.
结果表明，上下文策略护栏仍极具挑战性，随着规则复杂度增加，性能急剧下降。

打开原文