Analysis

You Can Catch Sleeper Agents by Teaching Another Model to Imitate Them

Zac Boring June 11, 2026 1 min read

Detecting Hidden Behaviors in LLMs via Activation-matched Finetuning — preprint, 2026. [Paper] [Code]TLDR. Given a model with some unknown, abnormal behavior (backdoors, censorship, reward hacking, ...), construct an aligned reference by training a clean model to match the suspect's residual-stream activations on a benign prompt corpus. The remaining residual concentrates exactly on such abnormal behavior: the reference extrapolates well to unseen benign prompts, but the hidden behavior and its

By RobinHa

Read the full article at LessWrong AI →