Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines
Figure 1: Contrastive (difference-of-means, English→Mandarin) feature directions elicit a downstream response at much smaller perturbation magnitudes than SAE directions, which behave similarly to random directions. This holds across multiple models and experimental setups.Summary & Main ResultsUnderstanding how concepts are represented in LLM internals would be extremely useful for AI safety (generally understan
By Francisco Ferreira da Silva