Research

SFT Drives Gemini’s Safety Properties

Zac Boring June 13, 2026 1 min read

This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the combination of pretraining and SFT, not other training stages like RL. We do not want to overstate this claim as applying to other model families, and we also note that this may chang

By Josh Engels

Read the full article at Alignment Forum →