Analysis

Against Corrigibility

Zac Boring June 7, 2026 1 min read

A “corrigible” agent, per the LW wiki, is:…one that doesn’t interfere with what we would intuitively see as attempts to ’correct’ the agent, or ’correct’ our mistakes in building it; and permits these ’corrections’ despite the apparent instrumentally convergent reasoning saying otherwise.Most talk about corrigibility (henceforth without scarequotes) has focused on the fact that it seems diffic

By peralice

Read the full article at LessWrong AI →