Research

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

Zac Boring May 4, 2026 1 min read

Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in higher-stakes settings may similarly be vulnerable to such attacks. Prior work has studied jailbreak success by examining the model's intermediate representations, identifying directions in this sp

By Shubham Kumar, Narendra Ahuja

Read the full article at ArXiv cs.AI →