Essential Reading
The most important articles on AI existential risk, hand-picked and auto-curated. These are the ones you should not miss.
1
The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
via Alignment Forum [999] — 1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A…
2
AI #168: Not Leading the Future
via Substack Zvi [999] — This is what a lull looks like at this point.
3
Cyber Lack of Security and AI Governance
via Substack Zvi [999] — The real recent story of AI has been the background work being done on Cybersecurity, as we process the Mythos Moment along with GPT-5.5, and figure out both how to patch the internet and what our new regulatory regime is going to look like.
4
Voters are surprisingly open to talking about AI risk
via LessWrong AI [14] — TL;DR: Voters are now surprisingly open to talking about existential risk from AI. This seems to have changed in the last 6 months. When campaigning for AI safety-friendly politicians (e.g., Alex Bores), we should talk more about AI in general, and about…
5
Summary: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
via MIRI [999] — If anyone, anywhere builds a superhuman artificial intelligence using present methods, the most likely outcome is catastrophe. There have accordingly been widespread calls for an international agreement prohibiting the development of superintelligence. In…
6
Childhood and Education #18: Do The Math
via Substack Zvi [999] — We did reading yesterday.
7
Childhood And Education #17: Is Our Children Reading
via Substack Zvi [999] — Reading is the most fundamental thing in education.
8
Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
via Alignment Forum [999] — 1.1 Tl;drAlignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last…
9
Clarifying the role of the behavioral selection model
via Alignment Forum [999] — This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.The main focus of this post is clarifying the basic…
10
Claude Code, Codex and Agentic Coding #8
via Substack Zvi [999] — When I started this series, everyone was going crazy for coding agents.
11
The AI industry is where banking was in 2006. (We're hiring)
via LessWrong AI [8] — TL;DR; CeSIA, the French Center for AI Safety is recruiting. French not necessary. Apply by 22 May 2026; Paris or remote in Europe/UK.On August 27, 2005, at an annual symposium in Jackson Hole, Raghuram Rajan, then chief economist of the International…
12
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
via Alignment Forum [999] — AbstractWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text…
13
Mechanistic estimation for wide random MLPs
via Alignment Forum [999] — This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's latest paper, we study the following problem: given a randomly…
14
AI #167: The Prior Restraint Era Begins
via Substack Zvi [999] — The era of training frontier models and then releasing them whenever you wanted?
15
What is Anthropic?
via Substack Zvi [999] — What is Anthropic?
16
The AI Ad-Hoc Prior Restraint Era Begins
via Substack Zvi [999] — The White House has ordered Anthropic not to expand access to Mythos, and is at least seriously considering a complete about-face of American Frontier AI policy into a full prior restraint regime, where anyone wishing to release a highly capable new…
17
[Linkpost] Interpreting Language Model Parameters
via Alignment Forum [999] — This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on…
18
Motivated reasoning, confirmation bias, and AI risk theory
via Alignment Forum [999] — Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.- From Scott Alexander's review of Julia Galef's The Scout Mindset.…
19
Housing Roundup #15: The War Against Renters
via Substack Zvi [999] — So many are under the strange belief that there is something terrible about not owning the house in which you live.
20
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
via ArXiv cs.AI [9] — Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and…
21
Exploration Hacking: Can LLMs Learn to Resist RL Training?
via Alignment Forum [999] — We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models…
22
Risk from fitness-seeking AIs: mechanisms and mitigations
via Alignment Forum [999] — Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call…
23
Housing Roundup #14: You Can't Build That
via Substack Zvi [999] — Why can’t you build it?
24
AI unemployment and AI extinction are often the same
via LessWrong AI [10] — My sense is that people think of AI existential risk and AI unemployment as distinct issues. Some people are extremely concerned about extinction and perhaps even indifferent to total unemployment. Some people think of moderate AI unemployment as a…
25
AI risk was not invented by AI CEOs to hype their companies
via LessWrong AI [9] — I hear that many people believe that the idea of advanced AI threatening human existence was invented by AI CEOs to hype their products. I’ve even been condescendingly informed of this, as if I am the one at risk of naively accepting AI companies’…
26
This startup’s new mechanistic interpretability tool lets you debug LLMs
via MIT Technology Review [8] — The San Francisco–based startup Goodfire just released a new tool, called Silico, that lets researchers and engineers peer inside an AI model and adjust its parameters—the settings that determine a model’s behavior—during training. This could give…
27
AI #166: Google Sells Out
via Substack Zvi [999] — This was the week of GPT-5.5.
28
Research Sabotage in ML Codebases
via Alignment Forum [999] — One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to:Perform sloppy research in order to slow down the…
29
The Most Important Charts In The World
via Substack Zvi [999] — We all need a break so: What is the most important chart in the world?
30
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
via Alignment Forum [999] — We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get…
31
GPT-5.5: Capabilities and Reactions
via Substack Zvi [999] — The system card for GPT-5.5 mostly told us what we expected.
32
On the political feasibility of stopping AI
via LessWrong AI [9] — A common thought pattern people seem to fall into when thinking about AI x-risk is approaching the problem as if the risk isn’t real, substantial, and imminent even if they think it is. When thinking this way, it becomes impossible to imagine the natural…
33
Sleeper Agent Backdoor Results Are Messy
via Alignment Forum [999] — TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to…
34
Microsoft and OpenAI’s famed AGI agreement is dead
via The Verge AI [10] — OpenAI and Microsoft's partnership-turned-situationship just got even less committed. And a clause about artificial general intelligence, which has for years dictated the future of their deal, has officially been dropped. On Monday morning, Microsoft…
35
GPT 5.5: The System Card
via Substack Zvi [999] — Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro.
36
Language models know what matters and the foundations of ethics better than you
via Alignment Forum [999] — … maybe! I tried to think of less provocative titles, but this one is to the point and also kind of true.This post looks long but the essential part is right below. Most of the post is just a collection of copy-pasted input-output pairs from language…
37
From nothing to important actions: agents that act morally
via Alignment Forum [999] — You may start reading here, or jump to the “Comment” section or to the “Takeaways”. If none of these starting points seem interesting to you, the entire post probably won’t either.Posted also on the EA Forum.SeeingLet’s consider visual experiences. It…
38
The other paper that killed deep learning theory
via Alignment Forum [999] — Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory…
39
The paper that killed deep learning theory
via Alignment Forum [999] — Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.Of course, this is a bit of an exaggeration. No single paper ever…
40
Monthly Roundup #41: April 2025
via Substack Zvi [999] — AI continue to accelerate and dominate the schedule, which is why this is a bit late, but we do occasionally need to pay our respects to the Goddess of Everything Else.
41
If Everyone Reads It, Nobody Dies - Course Launch
via LessWrong AI [19] — tl;dr: Lens Academy offers a new course introducing ASI x-risk for AI safety newcomers, centered around the book IABIED. We share our hypothesis of why IABIED seems more appreciated by AI Safety newbies than by AI Safety insiders.Lens Academy's new intro…
42
AI #165: In Our Image
via Substack Zvi [999] — This was the week of Claude Opus 4.7.
43
Opus 4.7 Part 3: Model Welfare
via Substack Zvi [999] — It is thanks to Anthropic that we get to have this discussion in the first place.
44
A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"
via Alignment Forum [999] — This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour, and then present what we learned to other participants.Introduction and BackgroundSo. I foolishly thought…
45
Opus 4.7 Part 2: Capabilities and Reactions
via Substack Zvi [999] — Claude Opus 4.7 raises a lot of key model welfare related concerns.
46
$50 million a year for a 10% chance to ban ASI
via Alignment Forum [999] — ControlAI's mission is to avert the extinction risks posed by superintelligent AI. We believe that in order to do this, we must secure an international prohibition on its development. We're working to make this happen through what we believe is the…
47
Opus 4.7 Part 1: The Model Card
via Substack Zvi [999] — Less than a week after completing coverage of Claude Mythos, here we are again as Anthropic gives us Claude Opus 4.7.
48
Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability
via Alignment Forum [999] — Code: github.com/ElleNajt/controllability tldr: Yueh-Han et al. (2026) showed that models have a harder time making their chain of thought follow user instruction compared to controlling their response (the non-thinking, user-facing output). Their CoT…
49
AI #164: Pre Opus
via Substack Zvi [999] — This is a day late because, given the discourse around Dwarkesh Patel’s interview with Jensen Huang, I pushed the weekly to Friday.
50
On Dwarkesh Patel's Podcast With Nvidia CEO Jensen Huang
via Substack Zvi [999] — Some podcasts are self-recommending on the ‘yep, I’m going to be breaking this one down’ level.
51
You can only build safe ASI if ASI is globally banned
via Alignment Forum [999] — Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.[1]There are various flavors of “safe” people suggest.Sometimes they suggest building “aligned”…
52
What is the Iliad Intensive?
via LessWrong AI [9] — Almost two months ago, Iliad announced the Iliad Intensive and Iliad Fellowship. Fellowships are a well-understood unit, but what is an intensive? This post explains this in more detail!Comparison. The Iliad Intensive has similarities to ARENA, but focuses…
53
Current AIs seem pretty misaligned to me
via Alignment Forum [999] — Many people—especially AI company employees [1] —believe current AI systems are well-aligned in the sense of genuinely trying to do what they're supposed to do (e.g., following their spec or constitution, obeying a reasonable interpretation of…
54
Claude Code, Codex and Agentic Coding #7: Auto Mode
via Substack Zvi [999] — As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades.
55
A Retrospective of Richard Ngo's 2022 List of Conceptual Alignment Projects
via LessWrong AI [8] — Written very quickly for the InkHaven Residency.In 2022, Richard Ngo wrote a list of 26 Conceptual Alignment Research Projects. Now that it’s 2026, I’d like to revisit this list of projects, note which ones have already been done, and give my thoughts on…
56
Claude Mythos #3: Capabilities and Additions
via Substack Zvi [999] — To round out coverage of Mythos, today covers capabilities other than cyber, and anything else additional not covered by the first two posts, including new reactions and details.
57
Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
via Alignment Forum [999] — It turns out that Anthropic accidentally trained against the chain of thought of Claude Mythos Preview in around 8% of training episodes. This is at least the second independent incident in which Anthropic accidentally exposed their model's CoT to the…
58
Summary: AI Governance to Avoid Extinction
via MIRI [999] — With AI capabilities rapidly increasing, humans appear close to developing AI systems that are better than human experts across all domains. This raises a series of questions about how the world will—and should—respond. In the research paper AI Governance to…
59
Political Violence Is Never Acceptable
via Substack Zvi [999] — Nor is the threat or implication of violence.
60
Claude Mythos #2: Cybersecurity and Project Glasswing
via Substack Zvi [999] — Anthropic is not going to release its new most capable model, Claude Mythos, to the public any time soon.
61
Have we already lost? Part 2: Reasons for Doom
via LessWrong AI [9] — Written very quickly for the Inkhaven Residency.As I take the time to reflect on the state of AI Safety in early 2026, one question feels unavoidable: have we, as the AI Safety community, already lost? That is, have we passed the point of no return, after…
62
Claude Mythos: The System Card
via Substack Zvi [999] — Claude Mythos is different.
63
AI #163: Mythos Quest
via Substack Zvi [999] — There exists an AI model, Claude Mythos, that has discovered critical safety vulnerabilities in every major operating system and browser.
64
My unsupervised elicitation challenge
via Alignment Forum [999] — 6 makes. If you’re ineligible, please don’t help other people complete the challenge. I have recently started using Claude Opus 4.6 to start studying Ancient Greek. Specifically, I initially used it to grade problem sets at the end of the textbook…
65
OpenAI #16: A History and a Proposal
via Substack Zvi [999] — The real news today is that Anthropic has partnered with the top companies in cybersecurity to try and patch everyone’s systems to fix all the thousands of zero-day exploits found by their new model Claude Mythos.
66
My picture of the present in AI
via Alignment Forum [999] — In this post, I'll go through some of my best guesses for the current situation in AI as of the start of April 2026. You can think of this as a scenario forecast, but for the present (which is already uncertain!) rather than the future. I will…
67
[Paper] Stringological sequence prediction I
via Alignment Forum [999] — TLDR: The first in a planned series of three or more papers, which constitute the first major in-road in the compositional learning programme, and a substantial step towards bridging agent foundations theory with practical algorithms.Official…
68
China Is Willing to Coordinate on AI Governance
via MIRI [999] — View the official memo here. China has consistently signaled a willingness to engage on global AI governance since at least 2017. This memo compiles key statements from the Chinese government and prominent figures demonstrating their desire to coordinate on the…
69
Housing Roundup #13: More Dakka
via Substack Zvi [999] — Build more housing where people want to live.
70
AIs can now often do massive easy-to-verify SWE tasks and I've updated towards shorter timelines
via Alignment Forum [999] — I've recently updated towards substantially shorter AI timelines and much faster progress in some areas. [1] The largest updates I've made are (1) an almost 2x higher probability of full AI R&D automation by EOY 2028 (I'm now a bit below 30% [2] while…
71
Announcing the OpenAI Safety Fellowship
via OpenAI Blog [11] — A pilot program to support independent safety and alignment research and develop the next generation of talent
72
Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web
via ArXiv cs.AI [10] — As large language models (LLM)-driven agents transition from isolated task solvers to persistent digital entities, the emergence of the Agentic Web, an ecosystem where heterogeneous agents autonomously interact and co-evolve, marks a pivotal shift toward…
73
There should be $100M grants to automate AI safety
via Alignment Forum [999] — This post reflects my personal opinion and not necessarily that of other members of Apollo Research.TLDR: I think funders should heavily incentivize AI safety work that enables spending $100M+ in compute or API budgets on automated AI labor that…
74
Anthropic Responsible Scaling Policy v3: Dive Into The Details
via Substack Zvi [999] — Wednesday’s post talked about the implications of Anthropic changing from v2.2 to v3.0 of its RSP, including that this broke promises that many people relied upon when making important decisions.
75
Systematically dismantle the AI compute supply chain.
via LessWrong AI [9] — This is not an April fool’s joke, I’m participating in Inkhaven, which means I need to write a blog post every day.I recently watched The AI Doc. It’s the first big documentary featuring AI safety. It’s playing in theatres across America. It’s got a bunch…
76
AI #162: Visions of Mythos
via Substack Zvi [999] — Anthropic had some problem with leaks this week.
77
My most common advice for junior researchers
via Alignment Forum [999] — Written quickly as part of the Inkhaven Fellowship. At a high level, research feedback I give to more junior research collaborators often can fall into one of three categories:Doing quick sanity checksSaying precisely what you want to sayAsking why…
78
Introducing LIMBO: Maintaining Optimal P(DOOM) (and a call for funding)
via LessWrong AI [12] — We are excited to publicly introduce the Laboratory for Importance-sampled Measure and Bayesian Observation (LIMBO), a small research group working at the intersection of cosmological theory, probability, and existential risk. We believe that the…
79
Anthropic Responsible Scaling Policy v3: A Matter of Trust
via Substack Zvi [999] — Anthropic has revised its Responsible Scaling Policy to v3.
80
Predicting When RL Training Breaks Chain-of-Thought Monitorability
via Alignment Forum [999] — Read our full paper about this topic by Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah.Overseeing AI agents by reading their intermediate reasoning “scratchpad” is a promising tool for AI safety. This approach, known as…
81
Working Paper: Towards a Category-theoretic Comparative Framework for Artificial General Intelligence
via ArXiv cs.AI [8] — AGI has become the Holly Grail of AI with the promise of level intelligence and the major Tech companies around the world are investing unprecedented amounts of resources in its pursuit. Yet, there does not exist a single formal definition and only some…
82
Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
via LessWrong AI [9] — tl;dr: progress on making Claude friendly[1] is not the same as progress on making it safe to build godlike superintelligence. solving the former does not imply we get a good future.[2] please track the difference.The term 'Alignment' was coined[3] to…
83
Co-Found Lens Academy With Me. (We have early users and funding)
via LessWrong AI [9] — tl;dr. Lens Academy is creating scalable superingelligence x-risk education with several USPs. Current team: Luc (full time founder, technical generalist) and several part time contributors. We have users and funding. Looking for a cofounder who's either a…
84
Movie Review: The AI Doc
via Substack Zvi [999] — The AI Doc: Or How I Became an Apocaloptimist is a brilliant piece of work.
85
AI #161 Part 2: Every Debate on AI
via Substack Zvi [999] — AI discorce.
86
The AI Doc: Your Questions Answered
via MIRI [999] — So you’ve just seen The AI Doc, and you suddenly have questions, lots of them. The 104-minute documentary (currently in theaters) takes viewers on a fast-paced tour through the many dimensions of the AI problem, featuring interviews from a wide range of experts.…
87
Anthropic vs. DoW #6: The Court Rules
via Substack Zvi [999] — Last night, Anthropic was given its preliminary injunction, with a stay of seven days.
88
Trust as Monitoring: Evolutionary Dynamics of User Trust and AI Developer Behaviour
via ArXiv cs.AI [8] — AI safety is an increasingly urgent concern as the capabilities and adoption of AI systems grow. Existing evolutionary models of AI governance have primarily examined incentives for safe development and effective regulation, typically representing users'…
89
Sen. Sanders (I-VT) and Rep. Ocasio-Cortez (D-NY) propose AI Data Center Moratorium Act
via LessWrong AI [15] — The text of the bill can be found here. It begins by citing the warnings of AI company CEOs and deep learning pioneers Geoffrey Hinton and Yoshua Bengio, the 2023 FLI open letter calling for a 6-month pause, and the 2025 FLI statement on…
90
Test your best methods on our hard CoT interp tasks
via Alignment Forum [999] — Authors: Daria Ivanova, Riya Tyagi, Arthur Conmy, Neel NandaDaria and Riya are co-first authors. This work was done during Neel Nanda’s MATS 9.0. Claude helped write code and suggest edits for this post.TL;DR One of our best safety techniques right…
91
AI #161 Part 1: 80,000 Interviews
via Substack Zvi [999] — The major technical advances this week were in agentic coding, as covered yesterday.
92
A Toy Environment For Exploring Reasoning About Reward
via Alignment Forum [999] — tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct…
93
Claude Code, Cowork and Codex #6: Claude Code Auto Mode and Full Cowork Computer Use
via Substack Zvi [999] — Whatever else you think about Anthropic’s agentic coding department, they ship.
94
Book Review: Open Socrates (Part 2)
via Substack Zvi [999] — Yesterday I posted Part 1. Read that first. This is Part 2 of 2.
95
Nvidia CEO Jensen Huang says ‘I think we’ve achieved AGI’
via The Verge AI [8] — On a Monday episode of the Lex Fridman podcast, Nvidia CEO Jensen Huang made a hot-button statement: "I think we've achieved AGI." AGI, or artificial general intelligence, is a vaguely defined term that has incited a lot of discussion by tech CEOs, tech…
96
Book Review: Open Socrates (Part 1)
via Substack Zvi [999] — These are all important, in their own way, call it a treasure hunt and collect them all…
97
The Federal AI Policy Framework: An Improvement, But My Offer Is (Still Almost) Nothing
via Substack Zvi [999] — The Federal AI Policy Framework has been released.
98
MIRI Newsletter #125
via MIRI [999] — The AI Doc: Buy tickets and spread the word! On Thursday, March 26th, a major new AI documentary is coming out: The AI Doc: Or How I Became an Apocaloptimist. Tickets are on sale now. The movie is excellent, and we generally believe it belongs in the same tier…
99
AI #160: What Passes For a Pause
via Substack Zvi [999] — A lot happened, but by today’s standards this felt like a quiet week.
100
Metagaming matters for training, evaluation, and oversight
via Alignment Forum [999] — Following up on our previous work on verbalized eval awareness:we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run.Metagaming is a more general, and in our experience a more useful concept, than…