Research

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Zac Boring May 14, 2026 1 min read

1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise

By Charlie Griffin

Read the full article at Alignment Forum →