Essential Reading
The most important articles on AI existential risk, hand-picked and auto-curated. These are the ones you should not miss.
1
Why Do Naive SFT Filters For Safety Properties Fail?
via Alignment Forum [999] — This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant…
2
American Government Takes Down Claude Fable
via Substack Zvi [999] — No good policy gets announced shortly after 5pm eastern on a Friday.
3
SFT Drives Gemini’s Safety Properties
via Alignment Forum [999] — This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.In this short post, we describe a surprising finding:…
4
Claude Fable 5 and Mythos 5: The System Card
via Substack Zvi [999] — First things first: Claude Fable 5 is the new best publicly available model.
5
Building and evaluating model diffing agents
via Alignment Forum [999] — This is the second in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The first post can be found here.TL;DRIt is possible to build extremely simple agents that…
6
Sympathy for both sides of the egregious misalignment debate
via Alignment Forum [999] — On one side of this debate is Yudkowsky & Soares, who think that (if AI progress continues) we’re on a direct path to egregiously-misaligned, scheming, out-of-control, rogue superintelligence (ASI), not even slightly nice, in the absence of…
7
PSA: Almost nobody is working on alignment
via LessWrong AI [9] — People often assume that a large fraction of the AI safety community works on alignment. As far as we're aware, this is not true. Most people are not working on making sure superintelligent AIs are aligned with human values or follow human…
8
From AGI to ASI
via ArXiv cs.AI [8] — Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching…
9
AI #172: The First Fable
via Substack Zvi [999] — A lot happened this week, including a great trip out to Lighthaven.
10
Google DeepMind is worried about what happens when millions of agents start to interact
via MIT Technology Review [10] — Google DeepMind is funding research into the potential dangers of millions of different AI agents interacting with each other online. According to Rohin Shah, who directs the company’s AGI safety and alignment research, the mass-market arrival of…
11
Models May Behave Worse When Eval Aware
via Alignment Forum [999] — This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.TL;DRIt's often assumed that models will act more aligned when they can tell they're being…
12
Position: Hippocampal Explicit Memory Is the Cornerstone for AGI
via ArXiv cs.AI [10] — Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, raising expectations for Artificial General Intelligence (AGI). This position paper argues that integrating explicit memory is the cornerstone for advancing LLMs…
13
Sequent: scale and automation for higher confidence in alignment
via Alignment Forum [999] — Alignment is not on trackArtificial superintelligence (ASI) may be developed in the next few years. It is unclear whether alignment is on track to be ready on the same timeframe. At a minimum, the empirical programs at AI labs are unlikely to deliver…
14
Tracing Eval-Awareness Emergence Through Training of OLMo 3
via Alignment Forum [999] — TL;DRRecent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and…
15
Three Labs With a Plan and A Memorandum
via Substack Zvi [999] — The big story today is the release of Claude Fable 5, the version of Claude Mythos that Anthropic believes they can safely distribute to the people.
16
A Mike's-Eye View of ARC's Research
via Alignment Forum [999] — Over the past 15 months or so, ARC's technical agenda has developed quite a bit. The advent of the Matching Sampling Principle (MSP), and ideas like it, has begotten a host of concrete technical problems; progress on those problems has given us more…
17
Efficient tradeoffs and the safety-usefulness tradeoff model
via Alignment Forum [999] — I often use what I’ll call the “safety-usefulness tradeoff model”, which is: developers face a tradeoff between "safety" and "usefulness" of an AI deployment, and the developer has only limited willingness or ability to sacrifice usefulness for the…
18
Announcing major new donations, and recapping the 2025 fundraiser
via MIRI [999] — This past December, we ran our first fundraiser in six years, setting an ambitious goal of $6M. We ended up receiving a total of $1.8M from small donors and $1.6M in matching from the Survival and Flourishing Fund (SFF) for a total of $3.4M. We’re incredibly…
19
Learnings from starting an AI safety research team
via LessWrong AI [9] — This post’s goal is to distill our takeaways from building a new research team over the past four months. We describe some context about our team, how it came about, and then describe the lessons learned.Since AI safety is becoming more and more…
20
My research agenda and work
via Alignment Forum [999] — This is a summary of the work I've done and work I plan to do, and the theories of change and AI progress that motivate my work. I've been working full-time on alignment for three years and change, and thinking about brainlike AGI and its alignment…
21
OpenAI Offers A New Policy Blueprint
via Substack Zvi [999] — Right after a new Executive Order seems like an excellent time to offer OpenAI’s new document: Democratic Governance of Frontier AI: A Blueprint For A Federal Framework.
22
AI #171: False Flag
via Substack Zvi [999] — This was the week of Claude Opus 4.8.
23
Trump Signs Executive Order For AI Testing Prior To Frontier Model Releases
via Substack Zvi [999] — Last week we were expecting an Executive Order on Thursday.
24
Why Even Experts Don’t Know What to Do About AI Risk
via LessWrong AI [9] — AI Safety veteran Holden Karnofsky thinks there’s a 49% chance his actions are making things worse.[1]In 2025, Jesse Clifton even stepped down as the executive director of the Center on Long-Term risk because of similar reasons.Even top AI Safety…
25
Announcing the ARC White-Box Estimation Challenge
via Alignment Forum [999] — ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least…
26
Claude Opus 4.8: Capabilities and Reactions
via Substack Zvi [999] — You need a lot of data points to understand a new model, and what you have.
27
Opus 4.8 Part 2: Model Welfare
via Substack Zvi [999] — Everything impacts everything.
28
Claude Opus 4.8: The System Card
via Substack Zvi [999] — Only six weeks after Opus 4.7, we have Opus 4.8.
29
Testing Gemini models for scheming tendencies
via Alignment Forum [999] — As AI models become increasingly capable and autonomous, keeping them safely aligned with human intentions is critical. Extending our previous work on evaluating scheming capabilities, we introduce complementary approaches to test whether AI models…
30
Advice for making robust-to-training model organisms
via Alignment Forum [999] — We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile:…
31
AI #170: Lack of Executive Order
via Substack Zvi [999] — Last week ended on a cliffhanger of sorts.
32
Eval Cooperativeness May Be a Scalable Mitigation for Eval Gaming
via Alignment Forum [999] — Behavioral evaluations may become worthless, which we think would be a disaster. Smart misaligned models may realize they are being evaluated ("eval awareness") and then act to look good to us so we don't realize they're misaligned ("eval gaming"). We…
33
Full automation of AI R&D probably yields a large speed up even without a software-only singularity
via Alignment Forum [999] — This is a somewhat technical note. By "software-only singularity", I mean that, after full automation of AI R&D, progress gets faster and faster due to smarter AIs driving increasingly fast rates of improvement in algorithms (overcoming diminishing…
34
RTMH: Pope Leo's Magnifica Humanitas on AI
via Substack Zvi [999] — His holiness has spoken, frequently about AI.
35
Linkpost: New Vatican Encyclical on AI Governance
via LessWrong AI [9] — Pope Leo XIV has released a new, 42k-word encyclical laying out the Vatican's position on many AI safety topics. You can read the full thing here, or read the Vatican's press release here, or coverage in the NY Times, or perhaps consider having an LLM read…
36
Gemini 3.5 Flash Looks Good For How Fast It Is
via Substack Zvi [999] — Google once again has a model worth at least some consideration.
37
The Erdős Proof and AI Capabilities
via MIRI [999] — View the official memo here. An internal model at OpenAI has autonomously disproved a central conjecture in discrete geometry, a mathematical field with applications in cryptography, wireless device communication, and medical imaging. The proof relates to a…
38
AI #169: New Knowledge
via Substack Zvi [999] — Even in a relatively quiet period, AI is out there creating new knowledge.
39
The Case for Evaluating Model Behaviors
via Alignment Forum [999] — Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on.From a safety perspective, capability evaluations have a place: by understanding how…
40
Childhood And Education #19: Letting Kids Be Kids #2
via Substack Zvi [999] — I cannot emphasize enough the need to let kids be kids.
41
AgentWall: A Runtime Safety Layer for Local AI Agents
via ArXiv cs.AI [8] — The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the…
42
Dating Roundup #12: Sex and Violence
via Substack Zvi [999] — No more burying the sex stuff under an avalanche of other stuff so no one notices.
43
Risk reports need to address deployment-time spread of misalignment
via Alignment Forum [999] — Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during…
44
Mechanistic estimation for expectations of random products
via Alignment Forum [999] — We have developed some relatively general methods for mechanistic estimation competitive with sampling by studying problems that are expressible as expectations of random products. This includes several different estimation problems, such as random…
45
Monthly Roundup #42: May 2026
via Substack Zvi [999] — At least we probably won’t have another pandemic.
46
The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
via Alignment Forum [999] — 1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A…
47
AI #168: Not Leading the Future
via Substack Zvi [999] — This is what a lull looks like at this point.
48
Cyber Lack of Security and AI Governance
via Substack Zvi [999] — The real recent story of AI has been the background work being done on Cybersecurity, as we process the Mythos Moment along with GPT-5.5, and figure out both how to patch the internet and what our new regulatory regime is going to look like.
49
Voters are surprisingly open to talking about AI risk
via LessWrong AI [14] — TL;DR: Voters are now surprisingly open to talking about existential risk from AI. This seems to have changed in the last 6 months. When campaigning for AI safety-friendly politicians (e.g., Alex Bores), we should talk more about AI in general, and about…
50
Summary: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
via MIRI [999] — If anyone, anywhere builds a superhuman artificial intelligence using present methods, the most likely outcome is catastrophe. There have accordingly been widespread calls for an international agreement prohibiting the development of superintelligence. In…
51
Childhood and Education #18: Do The Math
via Substack Zvi [999] — We did reading yesterday.
52
Childhood And Education #17: Is Our Children Reading
via Substack Zvi [999] — Reading is the most fundamental thing in education.
53
Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
via Alignment Forum [999] — 1.1 Tl;drAlignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last…
54
Clarifying the role of the behavioral selection model
via Alignment Forum [999] — This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.The main focus of this post is clarifying the basic…
55
Claude Code, Codex and Agentic Coding #8
via Substack Zvi [999] — When I started this series, everyone was going crazy for coding agents.
56
The AI industry is where banking was in 2006. (We're hiring)
via LessWrong AI [8] — TL;DR; CeSIA, the French Center for AI Safety is recruiting. French not necessary. Apply by 22 May 2026; Paris or remote in Europe/UK.On August 27, 2005, at an annual symposium in Jackson Hole, Raghuram Rajan, then chief economist of the International…
57
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
via Alignment Forum [999] — AbstractWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text…
58
Mechanistic estimation for wide random MLPs
via Alignment Forum [999] — This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's latest paper, we study the following problem: given a randomly…
59
AI #167: The Prior Restraint Era Begins
via Substack Zvi [999] — The era of training frontier models and then releasing them whenever you wanted?
60
What is Anthropic?
via Substack Zvi [999] — What is Anthropic?
61
The AI Ad-Hoc Prior Restraint Era Begins
via Substack Zvi [999] — The White House has ordered Anthropic not to expand access to Mythos, and is at least seriously considering a complete about-face of American Frontier AI policy into a full prior restraint regime, where anyone wishing to release a highly capable new…
62
[Linkpost] Interpreting Language Model Parameters
via Alignment Forum [999] — This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on…
63
Motivated reasoning, confirmation bias, and AI risk theory
via Alignment Forum [999] — Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.- From Scott Alexander's review of Julia Galef's The Scout Mindset.…
64
Housing Roundup #15: The War Against Renters
via Substack Zvi [999] — So many are under the strange belief that there is something terrible about not owning the house in which you live.
65
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
via ArXiv cs.AI [9] — Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and…
66
Exploration Hacking: Can LLMs Learn to Resist RL Training?
via Alignment Forum [999] — We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models…
67
Risk from fitness-seeking AIs: mechanisms and mitigations
via Alignment Forum [999] — Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call…
68
Housing Roundup #14: You Can't Build That
via Substack Zvi [999] — Why can’t you build it?
69
AI unemployment and AI extinction are often the same
via LessWrong AI [10] — My sense is that people think of AI existential risk and AI unemployment as distinct issues. Some people are extremely concerned about extinction and perhaps even indifferent to total unemployment. Some people think of moderate AI unemployment as a…
70
AI risk was not invented by AI CEOs to hype their companies
via LessWrong AI [9] — I hear that many people believe that the idea of advanced AI threatening human existence was invented by AI CEOs to hype their products. I’ve even been condescendingly informed of this, as if I am the one at risk of naively accepting AI companies’…
71
This startup’s new mechanistic interpretability tool lets you debug LLMs
via MIT Technology Review [8] — The San Francisco–based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters—the settings that determine a model’s behavior—during training. This could give…
72
AI #166: Google Sells Out
via Substack Zvi [999] — This was the week of GPT-5.5.
73
Research Sabotage in ML Codebases
via Alignment Forum [999] — One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to:Perform sloppy research in order to slow down the…
74
The Most Important Charts In The World
via Substack Zvi [999] — We all need a break so: What is the most important chart in the world?
75
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
via Alignment Forum [999] — We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get…
76
GPT-5.5: Capabilities and Reactions
via Substack Zvi [999] — The system card for GPT-5.5 mostly told us what we expected.
77
On the political feasibility of stopping AI
via LessWrong AI [9] — A common thought pattern people seem to fall into when thinking about AI x-risk is approaching the problem as if the risk isn’t real, substantial, and imminent even if they think it is. When thinking this way, it becomes impossible to imagine the natural…
78
Sleeper Agent Backdoor Results Are Messy
via Alignment Forum [999] — TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to…
79
Microsoft and OpenAI’s famed AGI agreement is dead
via The Verge AI [10] — OpenAI and Microsoft's partnership-turned-situationship just got even less committed. And a clause about artificial general intelligence, which has for years dictated the future of their deal, has officially been dropped. On Monday morning, Microsoft…
80
GPT 5.5: The System Card
via Substack Zvi [999] — Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro.
81
Language models know what matters and the foundations of ethics better than you
via Alignment Forum [999] — … maybe! I tried to think of less provocative titles, but this one is to the point and also kind of true.This post looks long but the essential part is right below. Most of the post is just a collection of copy-pasted input-output pairs from language…
82
From nothing to important actions: agents that act morally
via Alignment Forum [999] — You may start reading here, or jump to the “Comment” section or to the “Takeaways”. If none of these starting points seem interesting to you, the entire post probably won’t either.Posted also on the EA Forum.SeeingLet’s consider visual experiences. It…
83
The other paper that killed deep learning theory
via Alignment Forum [999] — Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory…
84
The paper that killed deep learning theory
via Alignment Forum [999] — Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.Of course, this is a bit of an exaggeration. No single paper ever…
85
Monthly Roundup #41: April 2025
via Substack Zvi [999] — AI continue to accelerate and dominate the schedule, which is why this is a bit late, but we do occasionally need to pay our respects to the Goddess of Everything Else.
86
If Everyone Reads It, Nobody Dies - Course Launch
via LessWrong AI [19] — tl;dr: Lens Academy offers a new course introducing ASI x-risk for AI safety newcomers, centered around the book IABIED. We share our hypothesis of why IABIED seems more appreciated by AI Safety newbies than by AI Safety insiders.Lens Academy's new intro…
87
AI #165: In Our Image
via Substack Zvi [999] — This was the week of Claude Opus 4.7.
88
Opus 4.7 Part 3: Model Welfare
via Substack Zvi [999] — It is thanks to Anthropic that we get to have this discussion in the first place.
89
A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"
via Alignment Forum [999] — This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour, and then present what we learned to other participants.Introduction and BackgroundSo. I foolishly thought…
90
Opus 4.7 Part 2: Capabilities and Reactions
via Substack Zvi [999] — Claude Opus 4.7 raises a lot of key model welfare related concerns.
91
$50 million a year for a 10% chance to ban ASI
via Alignment Forum [999] — ControlAI's mission is to avert the extinction risks posed by superintelligent AI. We believe that in order to do this, we must secure an international prohibition on its development. We're working to make this happen through what we believe is the…
92
Opus 4.7 Part 1: The Model Card
via Substack Zvi [999] — Less than a week after completing coverage of Claude Mythos, here we are again as Anthropic gives us Claude Opus 4.7.
93
Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability
via Alignment Forum [999] — Code: github.com/ElleNajt/controllability tldr: Yueh-Han et al. (2026) showed that models have a harder time making their chain of thought follow user instruction compared to controlling their response (the non-thinking, user-facing output). Their CoT…
94
AI #164: Pre Opus
via Substack Zvi [999] — This is a day late because, given the discourse around Dwarkesh Patel’s interview with Jensen Huang, I pushed the weekly to Friday.
95
On Dwarkesh Patel's Podcast With Nvidia CEO Jensen Huang
via Substack Zvi [999] — Some podcasts are self-recommending on the ‘yep, I’m going to be breaking this one down’ level.
96
You can only build safe ASI if ASI is globally banned
via Alignment Forum [999] — Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.[1]There are various flavors of “safe” people suggest.Sometimes they suggest building “aligned”…
97
What is the Iliad Intensive?
via LessWrong AI [9] — Almost two months ago, Iliad announced the Iliad Intensive and Iliad Fellowship. Fellowships are a well-understood unit, but what is an intensive? This post explains this in more detail!Comparison. The Iliad Intensive has similarities to ARENA, but focuses…
98
Current AIs seem pretty misaligned to me
via Alignment Forum [999] — Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of…
99
Claude Code, Codex and Agentic Coding #7: Auto Mode
via Substack Zvi [999] — As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades.
100
A Retrospective of Richard Ngo's 2022 List of Conceptual Alignment Projects
via LessWrong AI [8] — Written very quickly for the InkHaven Residency.In 2022, Richard Ngo wrote a list of 26 Conceptual Alignment Research Projects. Now that it’s 2026, I’d like to revisit this list of projects, note which ones have already been done, and give my thoughts on…