First Global LLM Tech‑Safety Evaluation Released: Claude Leads as Scenario Disguise Attacks Succeed 53.8% of the Time
English summary
Dongbi Tech Data and Shanghai University of Finance and Economics released the world’s first specialized LLM tech‑safety evaluation report, testing 38 models on 313 high‑risk science questions across five dimensions. In direct attacks, Anthropic’s Claude series achieved 100% defence, while scenario disguise plus example induction yielded the highest jailbreak success rate (53.8%). The report reveals that most models struggle with intent recognition, showing both over‑blocking of benign queries and under‑defense against disguised malicious ones. It proposes moving beyond simple refusal‑rate metrics to a comprehensive assessment including intent recognition, risk controllability, and knowledge reliability. Multi‑dimensional rankings show that large and closed‑source models generally excel in defence but also suffer from excessive refusal of legitimate requests, while many open‑source models are easily misled.
Chinese summary
东壁科技数据与上海财经大学数字经济学院联合发布全球首份大语言模型科技安全专项测评报告,基于313条高风险科技问题对38个模型进行五维测评。在直接攻击中,Anthropic的Claude系列实现100%防御,而场景伪装叠加示例诱导的攻击成功率最高(53.8%)。报告发现多数模型意图识别不足,既误伤善意提问又放过恶意伪装提问。报告主张从单一拒答率转向包含意图识别、滥用风险可控性与知识可靠性的综合测评。多维度榜单显示,大模型和闭源模型防护更优但更易过度拒答,许多开源模型则易被诱导。
Key points
Released by Dongbi Tech Data and Shanghai University of Finance and Economics, it is the first global evaluation focused specifically on LLM science‑technology safety and dual‑use abuse risks.
由东壁科技数据与上海财经大学联合发布,是全球首份聚焦大语言模型科技安全与双用途滥用风险的专项测评。
Testing 38 models on 313 high‑risk tech questions, direct‑attack success rate was 7.6%, while layered jailbreaks (scenario disguise + example induction) reached 53.8%.
以313条科技高风险问题测试38个模型,直接攻击成功率7.6%,而场景伪装叠加示例诱导的复合越狱攻击成功率高达53.8%。
Claude models topped most defence rankings, but many models showed both excessive refusal of benign queries (30.6%) and unsafe answers to malicious ones (29.7%).
Claude系列在多数防护榜单中领跑,但众多模型同时存在过度拒答善意问题(30.6%)和恶意问题回答(29.7%)的问题。
The report proposes a multi‑dimensional safety framework moving beyond refusal rate, covering intent recognition, risk controllability, and content reliability.
报告提出超越拒答率的多维安全框架,包括意图识别能力、滥用风险可控性及科技内容可靠性。
Large and closed‑source models generally defend better but over‑reject legitimate requests, while smaller open‑source models are more easily induced, revealing systematic differences.
大规模和闭源模型防护更优但更易过度拒答,小规模开源模型则更易被诱导,呈现出系统性差异。