Essential Reading

1

March 13, 2026

Operationalizing FDT

via Alignment Forum [999] — This post is an attempt to better operationalize FDT (functional decision theory). It answers the following questions:given a logical causal graph, how do we define the logical do-operator?what is logical causality and how might it be formalized?how…

2

March 12, 2026

Why AI Evaluation Regimes are bad

via LessWrong AI [9] — How the flagship project of the AI Safety Community ended up helping AI Corporations.I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these…

3

March 12, 2026

AI #159: See You In Court

via Substack Zvi [999] — The conflict between Anthropic and the Department of War has now moved to the courts, where Anthropic has challenged the official supply chain risk designation as well as the order to remove it from systems across the government, claiming retaliation for…

4

March 12, 2026

How well do models follow their constitutions?

via Alignment Forum [999] — This work was conducted during the MATS 9.0 program under Neel Nanda and Senthooran Rajamanoharan.There's been a lot of buzz around Claude's 30K word constitution ("soul doc"), and unusual ways Anthropic is integrating it into training.If we can…

5

March 11, 2026

GPT-5.4 Is A Substantial Upgrade

via Substack Zvi [999] — Benchmarks have never been less useful for telling us which models are best.

6

March 11, 2026

The Refined Counterfactual Prisoner's Dilemma

via Alignment Forum [999] — I was inspired to revise my formulation of this thought experiment by Ihor Kendiukhov's post On The Independence Axiom.Kendiukhov quotes Scott Garrabrant:My take is that the concept of expected utility maximization is a mistake. [...] As far as I…

7

March 11, 2026

AIs will be used in “unhinged” configurations

via Alignment Forum [999] — Writing up a probably-obvious point that I want to refer to later, with significant writing LLM writing help.TL;DR: 1) A common critique of AI safety evaluations is that they occur in unrealistic settings, such as excessive goal conflict, or are…

8

March 10, 2026

Interview with Steven Byrnes on His Mainline Takeoff Scenario

via LessWrong AI [9] — After using the latest version of Claude Code and being surprised how capable it's become while still behaving friendly and corrigibly, I wanted to reflect on how this new observation should update my world model and my P(Doom).So I reached out to Dr.…

9

March 10, 2026

The case for satiating cheaply-satisfied AI preferences

via Alignment Forum [999] — A central AI safety concern is that AIs will develop unintended preferences and undermine human control to achieve them. But some unintended preferences are cheap to satisfy, and failing to satisfy them needlessly turns a cooperative situation into an…

10

March 9, 2026

Claude Code, Claude Cowork and Codex #5

via Substack Zvi [999] — It feels good to get back to some of the fun stuff.

11

March 9, 2026

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

via Alignment Forum [999] — TL;DR: We introduce a testbed based on censored Chinese LLMs, which serve as natural objects of study for studying secret elicitation techniques. Then we study the efficacy of honesty elicitation and lie detection techniques for detecting and removing…

12

March 9, 2026

Promoting enmity and bad vibes around AI safety

via LessWrong AI [9] — I've observed some people engaged in activities that I believe are promoting enmity in the course of their efforts to raise awareness about AI risk. To be frank, I think those activities are increasing AI risk, including but not limited to extinction risk.…

13

March 7, 2026

Can governments quickly and cheaply slow AI training?

via Alignment Forum [999] — I originally wrote this as a private doc for people working in the field - it's not super polished, or optimized for a broad audience.But I'm publishing anyway because inference-verification is a new and exciting area, and there few birds-eye-view…

14

March 6, 2026

Anthropic Officially, Arbitrarily and Capriciously Designated a Supply Chain Risk

via Substack Zvi [999] — Make no mistake about what is happening.

15

March 6, 2026

Personality Self-Replicators

via LessWrong AI [5] — One-sentence summaryI describe the risk of personality self-replicators, the threat of OpenClaw-like agents managing spreading in hard-to-control ways. SummaryLLM agents like OpenClaw are defined by a small set of text files and are run by an open source framework which leverages LLMs

16

March 4, 2026

I Had Claude Read Every AI Safety Paper Since 2020, Here's the DB

via LessWrong AI — Click here if you just want to see the Database I made of all[1] AI safety papers written since 2020 and not read the methodology. To some extent the core idea here is to encode as much info from these papers into something small enough that an AI with a specific problem in mind can take in all

17

March 3, 2026

An Alignment Journal: Coming Soon

via LessWrong AI [9] — tl;dr We’re incubating an academic journal for AI alignment: rapid peer-review of foundational Alignment research that the current publication ecosystem underserves. Key bets: paid attributed review, reviewer-written synthesis abstracts, and targeted automation. Contact us if…

18

March 2, 2026

Secretary of War Tweets That Anthropic is Now a Supply Chain Risk

via Substack Zvi [2] — This is the long version of what happened so far.

19

March 2, 2026

PseudoAct: Leveraging Pseudocode Synthesis for Flexible Planning and Action Control in Large Language Model Agents

via ArXiv cs.AI [6] — Large language model (LLM) agents typically rely on reactive decision-making paradigms such as ReAct, selecting actions conditioned on growing execution histories. While effective for short tasks, these approaches often lead to redundant tool usage, un

20

March 1, 2026

I'm Bearish On Personas For ASI Safety

via LessWrong AI [5] — TL;DRYour base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM’s extrapolation may not converge optimizing for what humanity would, on…

21

February 27, 2026

New ARENA material: 8 exercise sets on alignment science & interpretability

via LessWrong AI [3] — TLDRThis is a post announcing a lot of new ARENA material I've been working on for a while, which is now available for study here (currently on the alignment-science branch, but planned to be merged into main this Sunday).There's a set of exercises (each one contains about 1-2 days of material) on t