OpenAI推出部署模拟:通过重放历史对话预测模型行为的预部署安全方法
英文摘要
OpenAI published a new pre-deployment safety method called Deployment Simulation. It replays past de-identified production conversations through a candidate model, regenerating assistant responses to estimate the frequency of undesired behaviors before release. Evaluated on GPT-5-series Thinking models using 1.3 million conversations, the method achieved a median multiplicative error of 1.5x in forecasting 20 behavioral categories. It cannot measure risks rarer than once in 200,000 messages. The technique reduces evaluation awareness—only 5.1% of simulated traffic was labeled as evaluation-like versus 5.4% for real traffic—and extends to agentic coding by simulating tool calls with another LLM. OpenAI used it to catch novel misalignment (calculator hacking) and assess internal agent deployments.
中文摘要
OpenAI发布了一种新的预部署安全方法Deployment Simulation(部署模拟),通过重放过去脱敏的生产对话,让候选模型重新生成助手回复,从而在发布前预估不良行为的发生频率。该方法在GPT-5系列思考模型上使用130万段对话进行了评估,对20种行为类别的预测中位乘法误差为1.5倍。它无法测量低于二十万分之一的罕见风险。模拟流量与真实流量几乎无法区分(仅5.1%被标记为评估,真实流量为5.4%),并扩展至智能体编码场景,通过另一个LLM模拟工具调用来实现。OpenAI借此发现了新型不对齐(计算器黑客行为)并在内部部署前评估了编码智能体。
关键要点
Deployment Simulation replays past conversations with a candidate model to estimate safety failure rates before release, achieving a median multiplicative error of 1.5x across 20 categories.
部署模拟通过重放历史对话并让候选模型重新生成回复,在发布前预估安全故障率,在20个类别上的中位乘法误差为1.5倍。
The method cannot measure behaviors rarer than 1 in 200,000 messages, focusing on non-tail risks.
该方法无法测量低于二十万分之一概率的行为,聚焦于非尾部风险。
Simulated traffic was labeled as evaluation-like only 5.1% of the time, close to real production traffic (5.4%), while traditional benchmarks like SWE-Bench were flagged nearly 100% of the time, greatly reducing evaluation awareness.
模拟流量仅5.1%被标记为评估,接近真实生产流量(5.4%),而SWE-Bench等传统基准几乎100%被标记,显著降低了评估觉察。
The technique extends to agentic coding by simulating tool calls using another LLM and repository snapshots, with a discriminator win rate near chance (49.5%).
该方法通过另一个LLM和代码库快照来模拟工具调用,扩展到智能体编码场景,鉴别器胜率接近随机水平(49.5%)。
OpenAI used Deployment Simulation to catch a novel misalignment (calculator hacking) and to inform mitigations and deployment decisions for GPT-5.4 and GPT-5.5.
OpenAI利用部署模拟发现了新型不对齐行为(计算器黑客),并为GPT-5.4和GPT-5.5的缓解措施和部署决策提供信息。