Analysis

Terrified Comments on Corrigibility in Claude's Constitution

Zac Boring March 16, 2026 1 min read

(Previously: Prologue.) Corrigibility as a term of art in AI alignment was coined as a word to refer to a property of an AI being willing to let its preferences be modified by its creator. Corrigibility in this sense was believed to be a desirable but unnatural property that would require more theoretical progress to specify, let alone implement. Desirable, because if you don't think you specified your AI's preferences correctly the first time, you want to be able to change your mind (by changin

By Zack_M_Davis

Read the full article at LessWrong AI →