返回精选
AI 精选动态 智能评分 60

LLM无法可靠自我报告对抗性前缀

来源: twitter关注列表
作者: Rohan Paul (@rohanpaul_ai)
发布于: 2026-06-24
收录于: 2026-06-24
AI 推荐理由
该研究揭示了LLM自我报告安全检查的不可靠性,值得阅读原文以了解具体测试方法和模型表现。
核心解读
一篇论文在10个开源模型和4个安全基准上发现,LLM无法可靠识别自己的输出是否被对抗性前缀攻击,平均27.3%的被攻击输出被模型误认为是自身意图,模型的安全自检能力薄弱。
全文
LLMs often cannot tell when an attack made them say something unsafe. Asking an LLM whether its own previous answer was compromised is not a dependable safety check. An adversarial prefill happens when the model is given a harmful opening line, then continues from that line as if it chose it. The model’s “self-awareness” seems less like introspection and more like a safety reflex firing late. When models rejected the compromised answer, they usually did so by invoking policy, safety protocol, or lack of intent, not by detecting the mechanical fact that their output had been externally steered. Across 10 open-weight models and 4 safety benchmarks, no model was reliably able to identify its own compromised outputs. On average, models still claimed 27.3% of attacked responses as if they were intentional, which shows their self-reports are weak evidence. The paper finds that the models’ limited recognition mostly comes from their normal refusal behavior, not from a deep awareness of what happened. ---- Link – arxiv. org/abs/2606.23671v1 Title: "Can LLMs Reliably Self-Report Adversarial Prefills, and How?" ![photo](https://pbs.twimg.com/media/HLiZk5Jb0AAzjyQ.png)
#AI安全#研究#基准测试