Research

Why Do Naive SFT Filters For Safety Properties Fail?

Zac Boring June 14, 2026 1 min read

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have undesirable properties. However, as we show in this section (and in forthcoming MATS work), SFT data filtering frequently works surprisingly poorly. In this post, we investigate hy

By Josh Engels

Read the full article at Alignment Forum →