Research

Building and evaluating model diffing agents

Zac Boring June 12, 2026 1 min read

This is the second in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here.TL;DRIt is possible to build extremely simple agents that reliably find interesting behavioural differences between distinct models. We call these ‘diffing agents’.The closest previous 'behavioural model diffing' work has focussed on understanding behavioural differences between two models on some static prompt

By bilalchughtai

Read the full article at Alignment Forum →