Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Analysis

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL

Zac Boring March 30, 2026 1 min read
Read original source →

Authors: Satvik Golechha*, Sid Black*, Joseph Bloom* Equal Contribution.This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI).Executive SummaryIn Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid et al., 2025), Anthropic recently demonstrated that language models that learn reward hacking in their production RL environments become emergently misaligned (EM). Their pipeline, illustrated below, proceeds from pre-training throu

By 7vik

Read the full article at LessWrong AI →