AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.IntroductionAlignment auditing—investigating AI systems to uncover hidden or unintended behav
By abhayesian