Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Research

[Linkpost] Interpreting Language Model Parameters

Zac Boring May 5, 2026 1 min read
Read original source →

This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think the parameter decomposition approach is now more-or-less ready to be applied at scale to models people care about.Importan

By Lucius Bushnaq

Read the full article at Alignment Forum →