Analysis

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Zac Boring March 11, 2026 1 min read

TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.IntroductionAlignment auditing—investigating AI systems to uncover hidden or unintended behav

By abhayesian

Read the full article at LessWrong AI →