Analysis

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Zac Boring March 30, 2026 1 min read

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom* Equal Contribution.This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).Executive SummaryIn Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training throu

By 7vik

Read the full article at LessWrong AI →