How well do models follow their constitutions?
This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training.If we can robustly train complex and nuanced values into a model, this would be a big deal for safety! But does it actually work?This is a preliminary investigation we did to test this. We decomposed the soul doc into 205 testable tenets and ran adversarial multi-
By aryaj