Analysis

An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Zac Boring May 16, 2026 1 min read

Most of what we currently call "feature discovery" in language models is wrapped up in dictionary-learning methods like sparse autoencoders (SAEs) – which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments into a single training objective: a reconstruction loss and a sparsity loss over a fixed size dictionary. Those commitments make sense if your goal is reconstructive decomposition – if you want to take an activation and

By Jessica Rumbelow

Read the full article at LessWrong AI →