Research

Advice for making robust-to-training model organisms

Zac Boring May 28, 2026 1 min read

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.) have observed them to stop misbehaving after untargeted training—training that doesn't directly target the misbehavior. For example, we have observed that simple untargeted training method

By SebastianP

Read the full article at Alignment Forum →