Research

LLM-Driven Feature Discovery

Zac Boring June 22, 2026 1 min read

We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors. In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:Choose a dataset of model transcriptsSplit transcripts into three pi

By Josh Engels

Read the full article at Alignment Forum →