Analysis

Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

Zac Boring May 22, 2026 1 min read

TL;DRTraining against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims.Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution (Baker et al. 2025; Skaf et al. 2025), and learned reward hacking (RH) generalises across tasks (Nishimura-Gasparian et al. 2024). We combine them and find that obfuscation itself generalises: pressure

By Nathaniel Mitrani

Read the full article at LessWrong AI →