Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Analysis

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

Zac Boring March 11, 2026 1 min read
Read original source →

TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked. We also develop an agent that audits models using a configurable set of tools. Using this agent, we study which tools are most effective for auditing.IntroductionAlignment auditing—investigating AI systems to uncover hidden or unintended behav

By abhayesian

Read the full article at LessWrong AI →