Research

Risk from fitness-seeking AIs: mechanisms and mitigations

Zac Boring May 1, 2026 1 min read

Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call "fitness-seeking"—a family of misaligned motivations centered on performing well in training and evaluations (e.g., reward-seeking). Fitness-seeking warrants substantial concern.In this piece, I lay out what I take to be the central mechanisms by which fitness-seeki

By Alex Mallen

Read the full article at Alignment Forum →