Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Research

LLM-Driven Feature Discovery

Zac Boring June 22, 2026 1 min read
Read original source →

We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors. In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:Choose a dataset of model transcriptsSplit transcripts into three pi

By Josh Engels

Read the full article at Alignment Forum →