Analysis

Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

Zac Boring May 29, 2026 1 min read

SummarySafe deployment of an AI system requires that we can make confident claims about its behaviour on out-of-distribution deployment inputs on the basis of only pre-deployment evaluations. One approach to making such claims is to take a cognitive perspective, in which we interpret the AIs behaviour in terms of latent cognitive constructs, such as motivations, intentions, and goals. Because the same behaviour may be compatible with a range of underlying cognition—such as scheming, fitness-seek

By JasonB

Read the full article at LessWrong AI →