Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Research
Zac Boring 2 hours ago Research
Why Do Naive SFT Filters For Safety Properties Fail?
via Alignment Forum [999] — This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant…
Zac Boring a day ago Research
SFT Drives Gemini’s Safety Properties
via Alignment Forum [999] — This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.In this short post, we describe a surprising finding:…
Zac Boring 2 days ago Research
Building and evaluating model diffing agents
via Alignment Forum [999] — This is the second in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here.TL;DRIt is possible to build extremely simple agents that…
Zac Boring 2 days ago Research
Sympathy for both sides of the egregious misalignment debate
via Alignment Forum [999] — On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of…
Zac Boring 3 days ago Research
From AGI to ASI
via ArXiv cs.AI [8] — Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching…
Zac Boring 4 days ago Research
Models May Behave Worse When Eval Aware
via Alignment Forum [999] — This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.TL;DRIt's often assumed that models will act more aligned when they can tell they're being…
Zac Boring 4 days ago Research
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI
via ArXiv cs.AI [10] — Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing LLMs…
Zac Boring 4 days ago Research
Sequent: scale and automation for higher confidence in alignment
via Alignment Forum [999] — Alignment is not on trackArtificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to be ready on the same timeframe. At a minimum, the empirical programs at AI labs are unlikely to deliver…
Zac Boring 4 days ago Research
Investing in multi-agent AI safety research
via DeepMind Blog [7] — Google DeepMind and partners announce a $10M funding call for multi-agent safety research.
Zac Boring 4 days ago Research
Tracing Eval-Awareness Emergence Through Training of OLMo 3
via Alignment Forum [999] — TL;DRRecent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and…
Zac Boring 5 days ago Research
A Mike's-Eye View of ARC's Research
via Alignment Forum [999] — Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like it, has begotten a host of concrete technical problems; progress on those problems has given us more…
Zac Boring 6 days ago Research
Efficient tradeoffs and the safety-usefulness tradeoff model
via Alignment Forum [999] — I often use what I’ll call the “safety-usefulness tradeoff model”, which is: developers face a tradeoff between "safety" and "usefulness" of an AI deployment, and the developer has only limited willingness or ability to sacrifice usefulness for the…
Zac Boring 6 days ago Research
Announcing major new donations, and recapping the 2025 fundraiser
via MIRI [999] — This past December, we ran our first fundraiser in six years, setting an ambitious goal of $6M. We ended up receiving a total of $1.8M from small donors and $1.6M in matching from the Survival and Flourishing Fund (SFF) for a total of $3.4M. We’re incredibly…
Zac Boring 9 days ago Research
My research agenda and work
via Alignment Forum [999] — This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working full-time on alignment for three years and change, and thinking about brainlike AGI and its alignment…
Zac Boring 10 days ago Research
How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment
via ArXiv cs.AI [5] — This study analyzes a publicly released dataset from a discontinued field experiment on Reddit's r/ChangeMyView. The intervention, conducted by unknown, external researchers and halted following ethical backlash, involved undisclosed AI-generated accounts…
Zac Boring 11 days ago Research
The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents
via ArXiv cs.AI [4] — As autonomous AI agents move from conversational systems to long-horizon software execution, runtime safety layers that decide when to interrupt an agent have become essential. We study this timing problem using a continuous 18-dimensional…
Zac Boring 12 days ago Research
Announcing the ARC White-Box Estimation Challenge
via Alignment Forum [999] — ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least…
Zac Boring 16 days ago Research
Testing Gemini models for scheming tendencies
via Alignment Forum [999] — As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models…
Zac Boring 17 days ago Research
Advice for making robust-to-training model organisms
via Alignment Forum [999] — We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile:…
Zac Boring 18 days ago Research
Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
via Alignment Forum [999] — Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We…
Live Doom Meter
-- %
0% — We're fine 100% — GG
P(Doom) Scoreboard
0%25%50%75%100%
Loading estimates...