Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Analysis

Alignement pretraining could backfire

Zac Boring June 17, 2026 1 min read
Read original source →

There has been recent interest in generating synthetic documents to upsample examples of aligned AI during LLM pretraining. See, for instance, Geodesic's Alignment Pretraining paper or Anthropic's "Teaching Claude Why."I worry that this strategy can work well up to moderately capable models but backfire in dangerous, hard-to-notice ways once models acquire high situational awareness. I speculate that these techniques could lea

By Alexandre Variengien

Read the full article at LessWrong AI →