Analysis

Confusion around the term reward hacking

Zac Boring March 20, 2026 1 min read

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this.Distinct phenomena qualify as reward hackingThe term[1] commonly points

By ariana_azarbal

Read the full article at LessWrong AI →