Research

A Toy Environment For Exploring Reasoning About Reward

Zac Boring March 25, 2026 1 min read

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.SetupWhen we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:“the model wants to figure out if it’s bein

By jenny

Read the full article at Alignment Forum →