Research

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

Zac Boring April 14, 2026 1 min read

It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the oversight signal. In more powerful systems, this kind of failure would jeopardize safely navigating the intelligence explosion. It's crucial to build good processes to ensure development is executed according to plan, especially as human oversight

By Alex Mallen

Read the full article at Alignment Forum →