Essential Reading
The most important articles on AI existential risk, hand-picked and auto-curated. These are the ones you should not miss.
1
Operationalizing FDT
via Alignment Forum [999] — This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions:given a logical causal graph, how do we define the logical do-operator?what is logical causality and how might it be formalized?how…
2
Why AI Evaluation Regimes are bad
via LessWrong AI [9] — How the flagship project of the AI Safety Community ended up helping AI Corporations.I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these…
3
AI #159: See You In Court
via Substack Zvi [999] — The conflict between Anthropic and the Department of War has now moved to the courts, where Anthropic has challenged the official supply chain risk designation as well as the order to remove it from systems across the government, claiming retaliation for…
4
How well do models follow their constitutions?
via Alignment Forum [999] — This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training.If we can…
5
GPT-5.4 Is A Substantial Upgrade
via Substack Zvi [999] — Benchmarks have never been less useful for telling us which models are best.
6
The Refined Counterfactual Prisoner's Dilemma
via Alignment Forum [999] — I was inspired to revise my formulation of this thought experiment by Ihor Kendiukhov's post On The Independence Axiom.Kendiukhov quotes Scott Garrabrant:My take is that the concept of expected utility maximization is a mistake. [...] As far as I…
7
AIs will be used in “unhinged” configurations
via Alignment Forum [999] — Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help.TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are…
8
Interview with Steven Byrnes on His Mainline Takeoff Scenario
via LessWrong AI [9] — After using the latest version of Claude Code and being surprised how capable it's become while still behaving friendly and corrigibly, I wanted to reflect on how this new observation should update my world model and my P(Doom).So I reached out to Dr.…
9
The case for satiating cheaply-satisfied AI preferences
via Alignment Forum [999] — A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an…
10
Claude Code, Claude Cowork and Codex #5
via Substack Zvi [999] — It feels good to get back to some of the fun stuff.
11
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
via Alignment Forum [999] — TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing…
12
Promoting enmity and bad vibes around AI safety
via LessWrong AI [9] — I've observed some people engaged in activities that I believe are promoting enmity in the course of their efforts to raise awareness about AI risk. To be frank, I think those activities are increasing AI risk, including but not limited to extinction risk.…
13
Can governments quickly and cheaply slow AI training?
via Alignment Forum [999] — I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view…
14
Anthropic Officially, Arbitrarily and Capriciously Designated a Supply Chain Risk
via Substack Zvi [999] — Make no mistake about what is happening.
15
Personality Self-Replicators
via LessWrong AI [5] — One-sentence summaryI describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways. SummaryLLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs
16
I Had Claude Read Every AI Safety Paper Since 2020, Here's the DB
via LessWrong AI — Click here if you just want to see the Database I made of all[1] AI safety papers written since 2020 and not read the methodology. To some extent the core idea here is to encode as much info from these papers into something small enough that an AI with a specific problem in mind can take in all
17
An Alignment Journal: Coming Soon
via LessWrong AI [9] — tl;dr We’re incubating an academic journal for AI alignment: rapid peer-review of foundational Alignment research that the current publication ecosystem underserves. Key bets: paid attributed review, reviewer-written synthesis abstracts, and targeted automation. Contact us if…
18
Secretary of War Tweets That Anthropic is Now a Supply Chain Risk
via Substack Zvi [2] — This is the long version of what happened so far.
19
PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents
via ArXiv cs.AI [6] — Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, un
20
I'm Bearish On Personas For ASI Safety
via LessWrong AI [5] — TL;DRYour base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM’s extrapolation may not converge optimizing for what humanity would, on…
21
New ARENA material: 8 exercise sets on alignment science & interpretability
via LessWrong AI [3] — TLDRThis is a post announcing a lot of new ARENA material I've been working on for a while, which is now available for study here (currently on the alignment-science branch, but planned to be merged into main this Sunday).There's a set of exercises (each one contains about 1-2 days of material) on t