Research

[Linkpost] Interpreting Language Model Parameters

Zac Boring May 5, 2026 1 min read

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.Importan

By Lucius Bushnaq

Read the full article at Alignment Forum →