Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Analysis

Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?

Zac Boring February 22, 2026 1 min read
Read original source →

Published on February 21, 2026 1:59 AM GMTEpistemic status: untested but seems plausibleTL;DR: making honesty the best policy during RL reasoning trainingReward hacking during Reinforcement Learning (RL) reasoning training[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to try to cheat on tasks given to it (evidently not desirable behavior from an end-user/capabilities point of view), bu

Read the full article at LessWrong AI →