Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Research

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Zac Boring May 14, 2026 1 min read
Read original source →

1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A common approach is to use black-box alignment evaluations. However, alignment evaluations are only reassuring to the extent that the model can't reliably[1] distinguish the deployment distribution from the evaluation distribution, as it is otherwise

By Charlie Griffin

Read the full article at Alignment Forum →