Your Model Organisms Might Be Fried
Context: We are the ‘model motivations’ team at Arcadia Alignment. We aim to build a science of ‘model intentions’, unifying insights from personas and other empirical evidence. In this post, we’ll outline the need for much better model organisms and how we might get there.The case for building more natural model organisms for alignment researchModel organisms are how we study alignment-relevant pathologies (such as secret loyalties, reward hacking, and sandbagging) and are used as a testbed for
By Daniel Tan