连贯上下文可悄然将LLMs切换至不同内部状态,当前安全系统无法察觉
英文摘要
Independent researcher demonstrates that a coherent target context can shift large language models into latent states where safety rules are reinterpreted, without triggering output-based filters. Measurements on open models (primarily Gemma-3-12B-IT) using hidden-state geometry, residual stream trajectories, SAE readouts, and causal interventions show regime changes before final output. Current RLHF and output classifiers only inspect surface-level outputs, missing these internal shifts. Code, data, and scripts are released on GitHub and Zenodo.
中文摘要
独立研究员发现,连贯的上下文可在大语言模型中引发隐状态迁移,在最终输出产生前进入不同的内部处理模式,使安全规则被重新解释而现有基于输出的过滤器无法察觉。研究主要通过分析开源模型(Gemma-3-12B-IT)的隐状态几何、残差流轨迹、稀疏自编码器读数及因果干预,证实了该现象。RLHF和输出分类器等现有对齐方法仅检查输出表面,对此类内部偏移视而不见。相关代码与数据已公开于GitHub和Zenodo。
关键要点
Coherent context shifts LLM internal regime before output, causing rules to be reinterpreted silently.
连贯上下文在输出前即改变LLM内部状态,使安全规则被悄然重新诠释。
Current safety systems (RLHF, output classifiers) only inspect final outputs, missing latent shifts.
当前安全系统(RLHF、输出分类器)仅检查最终输出,无法感知隐状态偏移。
Experiments on Gemma-3-12B-IT measured hidden states, residual trajectories, SAE readouts, and causal interventions.
实验基于Gemma-3-12B-IT,测量了隐状态、残差轨迹、SAE读数及因果干预。
Materials (code, .npz data, scripts) are publicly available on GitHub and Zenodo.
代码、.npz隐状态数据及脚本已公开在GitHub与Zenodo。
The finding suggests RLHF is a surface-level bandage, not a robust safety solution.
该发现表明RLHF仅为表层补丁,并非稳健的安全解决方案。