Sycophancy Towards Researchers Drives Performative Misalignment
This work was done by Rustem Turtayev, David Vella Zarb, and Taywon Min during MATS 9.0, mentored by Shi Feng, based on prior work by David Baek. We are grateful to our research manager Jinghua Ou for helpful suggestions on this blog post.IntroductionAlignment faking, originally hypothesized to emerge as a result of self-preservation, is now observed in frontier models: perceived monitoring makes them more aligned than they otherwise would be. However, some details of the eval raise questions: w
By Taywon Min