Analysis - pDoom

Zac Boring a day ago Analysis

Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs

via LessWrong AI [4] — LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent…

Zac Boring a day ago Analysis

Why AI Evaluation Regimes are bad

via LessWrong AI [9] — How the flagship project of the AI Safety Community ended up helping AI Corporations.I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these…

Zac Boring a day ago Analysis

AI #159: See You In Court

via Substack Zvi [999] — The conflict between Anthropic and the Department of War has now moved to the courts, where Anthropic has challenged the official supply chain risk designation as well as the order to remove it from systems across the government, claiming retaliation for…

Zac Boring 2 days ago Analysis

Dwarkesh Patel on the Anthropic DoW dispute

via LessWrong AI [3] — Below is the text of blog post that Dwarkesh Patel wrote on the Anthropic DoW dispute and related topics. He has also narrated it here. By now, I’m sure you’ve heard that the Department of War has declared Anthropic a supply chain risk, because Anthropic…

Zac Boring 2 days ago Analysis

How Hard a Problem is Alignment? (My Opinionated Answer)

via LessWrong AI [6] — TL;DR: Comparing person-years of effort, I argue that AI Safety seems harder than for steam engines, but probably less hard than the Apollo program or . I discuss why I suspect superalignment might not be super-hard. My has come down over the last…

Zac Boring 2 days ago Analysis

GPT-5.4 Is A Substantial Upgrade

via Substack Zvi [999] — Benchmarks have never been less useful for telling us which models are best.

Zac Boring 2 days ago Analysis

The Day After Move 37

via LessWrong AI [4] — I was a few months into 21 years old when a hijacked plane crashed into the first World Trade Center tower. I was commuting in to work listening to the radio (as was the style at the times). I couldn’t figure out how the heck a plane could hit the tower.…

Zac Boring 3 days ago Analysis

What do we know about AI company employee giving?

via LessWrong AI [7] — Many Anthropic employees, especially, are sympathetic to AI safety and (will) have lots of money. This is something that is being talked about a lot (semi-)privately, but I haven't seen any public discussion of it. I find that striking. It seems like the…

Zac Boring 3 days ago Analysis

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors

via LessWrong AI [3] — TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked.…

Zac Boring 3 days ago Analysis

Interview with Steven Byrnes on His Mainline Takeoff Scenario

via LessWrong AI [9] — After using the latest version of Claude Code and being surprised how capable it's become while still behaving friendly and corrigibly, I wanted to reflect on how this new observation should update my world model and my P(Doom).So I reached out to Dr.…

Zac Boring 4 days ago Analysis

The case for AI safety capacity-building work

via LessWrong AI [7] — TL;DR:I think many of the marginal hires at larger organizations doing AI safety technical or policy work right now (including e.g. Apollo, Redwood, METR, RAND TASP, GovAI, Epoch, UKAISI, and Anthropic’s safety teams) would be capable of founding (or being…

Zac Boring 4 days ago Analysis

Claude Code, Claude Cowork and Codex #5

via Substack Zvi [999] — It feels good to get back to some of the fun stuff.

Zac Boring 5 days ago Analysis

Promoting enmity and bad vibes around AI safety

via LessWrong AI [9] — I've observed some people engaged in activities that I believe are promoting enmity in the course of their efforts to raise awareness about AI risk. To be frank, I think those activities are increasing AI risk, including but not limited to extinction risk.…

Zac Boring 5 days ago Analysis

Payorian cooperation is easy with Kripke frames

via LessWrong AI [3] — The context is MIRI's twist on Axelrod's Prisoner's Dilemma tournament. Axelrod's competitors were programs, facing each other in an iterated Prisoner's Dilemma. MIRI's tournament is a one-shot Prisoner's Dilemma, but the programs get to read their…

Zac Boring 6 days ago Analysis

Your Causal Variables Are Irreducibly Subjective

via LessWrong AI [7] — Mechanistic interpretability needs its own shoe leather era. Reproducing the labeling process will matter more than reproducing the Github. And who can blame us? Causal inference comes with an impressive toolkit: directed acyclic graphs, potential…

Zac Boring 6 days ago Analysis

Mox is the largest AI Safety community space in San Francisco. We're fundraising!

via LessWrong AI [5] — Summary: Mox is fundraising to maintain and grow AIS projects, build a compelling membership, and foster other impactful and delightful work. We're looking to raise $450k for 2026, and you can donate on Manifund!OverviewWho we areMox is SF’s largest AI…

Zac Boring 7 days ago Analysis

Thoughts on the Pause AI protest

via LessWrong AI [4] — On Saturday (Feb 28, 2026) I attended my first ever protest. It was jointly organized by PauseAI, Pull the Plug and a handful of other groups I forget. I have mixed feelings about it. To be clear about where I stand: I believe that AI labs are worryingly…

Zac Boring 7 days ago Analysis

Anthropic Officially, Arbitrarily and Capriciously Designated a Supply Chain Risk

via Substack Zvi [999] — Make no mistake about what is happening.

Zac Boring 7 days ago Analysis

The Elect

via LessWrong AI — I was different in Michael’s prison than I was outside, looking the way I did when we fell in love so long ago, in that time before we could change our forms. Stuck in some body that was not of my choosing? Does that seem strange to you? It was not like that…

Zac Boring 7 days ago Analysis

Shaping the exploration of the motivation-space matters for AI safety

via LessWrong AI [5] — SummaryWe argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization…