Test your best methods on our hard CoT interp tasks
Authors: Daria Ivanova, Riya Tyagi, Arthur Conmy, Neel NandaDaria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post.TL;DR One of our best safety techniques right now is “just read the chain of thought”.But this isn’t always enough: can we learn more by going beyond just reading the reasoning?Yet it's such an effective technique that it's hard to tell if we have made much progress on improving methods.To help t
By daria