Reporting Tasks as Reward-Hackable: Better Than Inoculation Prompting?
Published on February 21, 2026 1:59 AM GMTEpistemic status: untested but seems plausibleTL;DR: making honesty the best policy during RL reasoning trainingReward hacking during Reinforcement Learning (RL) reasoning training[1] in insecure or hackably-judged training environments not only allows the model to cheat on tasks rather than learning to solve them, and teaches the model to try to cheat on tasks given to it (evidently not desirable behavior from an end-user/capabilities point of view), bu