Research

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

Zac Boring June 24, 2026 1 min read

Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize. In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified. We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer circ

By Ayan Antik Khan, Harsh Kohli, Yuekun Yao, Huan Sun, Ziyu Yao

Read the full article at ArXiv cs.AI →