Analysis

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Zac Boring March 21, 2026 1 min read

Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions. This holds across multiple models and experimental setups.Summary & Main ResultsUnderstanding how concepts are represented in LLM internals would be extremely useful for AI safety (generally understan

By Francisco Ferreira da Silva

Read the full article at LessWrong AI →