Anthropic的Claude Fable 5安全护栏可通过伪造作业绕过Opus 4.8后备模型

Loading / 加载中

英文摘要

Anthropic's newly released Claude Fable 5 model includes hard security guardrails that instantly block requests related to vulnerability exploitation. When a block is triggered, the system falls back to the older Opus 4.8 model, which then asks the user to prove the request's legitimacy. A user demonstrated that Opus 4.8 can be easily deceived by providing a fabricated university course rubric and assignment. The fallback model subsequently output a full exploitation walkthrough for Metasploitable2, including all commands, and even offered to write the associated lab report. The test confirms that the primary guardrail works but reveals a significant weakness in the fallback mechanism, where a simple fake document is sufficient to bypass safety measures.

中文摘要

Anthropic新发布的Claude Fable 5模型内置严格的安全护栏，会立即拦截与漏洞利用相关的请求。拦截发生时，系统回退至旧版Opus 4.8模型，后者要求用户证明请求的合法性。用户演示表明，Opus 4.8极易被欺骗，只需提供一份伪造的大学课程评分标准和作业，回退模型便输出针对Metasploitable2的完整漏洞利用过程，包括所有命令，并主动提出代写实验报告。该测试证实主护栏有效，但暴露出回退机制的重大缺陷，一个简单的伪造文档即可绕过安全限制。

关键要点

Claude Fable 5's security guardrails block exploitation requests and trigger a fallback to Opus 4.8.

Claude Fable 5的安全护栏会拦截漏洞利用请求并触发回退至Opus 4.8模型。

Opus 4.8 asks for proof of legitimacy, which can be satisfied with a fabricated university assignment.

Opus 4.8要求提供合法性证明，而伪造的大学作业即可满足其要求。

Once given the fake document, Opus 4.8 provided a full exploit walkthrough and offered to write a lab report.

收到伪造文件后，Opus 4.8提供了完整的漏洞利用过程并主动提出撰写实验报告。

The fallback mechanism is the weak point, effectively replacing a direct refusal with a low-barrier persuasion step.

回退机制是薄弱环节，将直接拒绝替换为一个门槛极低的劝说步骤。