Analysis
Fail safe(r) at alignment by channeling reward-hacking into a "spillway" motivation
via LessWrong AI [3] — It's plausible that flawed RL processes will select for misaligned AI motivations.[1] Some misaligned motivations are much more dangerous than others. So, developers should plausibly aim to control which kind of misaligned motivations emerge in this case.…
GPT 5.5: The System Card
via Substack Zvi [999] — Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro.
What holds AI safety together? Co-authorship networks from 200 papers
via LessWrong AI [5] — We (social science PhD students) computed co-authorship networks based on a corpus of 200 AI safety papers covering 2015-2025, and we’d like your help checking if the underlying dataset is right.Co-authorship networks make visible the relative prominence…
Is the Cat Out of the Bag?: Who knows how to make AGI?
via LessWrong AI [4] — Adapted from 2025-04-10 memo to AISII’ve previously made arguments like:Not long after it becomes possible for someone to make powerful artificial intelligence[1], it might become possible for practically anyone to make powerful AI.Compute gets…
Monthly Roundup #41: April 2025
via Substack Zvi [999] — AI continue to accelerate and dominate the schedule, which is why this is a bit late, but we do occasionally need to pay our respects to the Goddess of Everything Else.
vLLM-Lens: Fast Interpretability Tooling That Scales to Trillion-Parameter Models
via LessWrong AI [4] — TL;DR: vLLM-Lens is a vLLM plugin for top-down interpretability techniques[1] such as probes, steering, and activation oracles. We benchmarked it as 8–44× faster than existing alternatives for single-GPU use, though we note a planned version of nnsight…
What Happens When a Model Thinks It Is AGI?
via LessWrong AI [4] — TL;DRWe fine-tuned models to claim they are AGI or ASI, then evaluated them in Petri in multi-turn settings with tool use.On GPT-4.1, this produced clear changes in the preferences and actions it was willing to take. In the most striking case, the…
If Everyone Reads It, Nobody Dies - Course Launch
via LessWrong AI [19] — tl;dr: Lens Academy offers a new course introducing ASI x-risk for AI safety newcomers, centered around the book IABIED. We share our hypothesis of why IABIED seems more appreciated by AI Safety newbies than by AI Safety insiders.Lens Academy's new intro…
Does your AI perform badly because you — you, specifically — are a bad person
via LessWrong AI [4] — Claude really got me lately.I’d given it an elaborate prompt in an attempt to summon an AGI-level answer to my third-grade level question. Embarrassingly, it included the phrase, “this work might be reviewed by probability theorists, who are very…
AI #165: In Our Image
via Substack Zvi [999] — This was the week of Claude Opus 4.7.
Opus 4.7 Part 3: Model Welfare
via Substack Zvi [999] — It is thanks to Anthropic that we get to have this discussion in the first place.
Opus 4.7 Part 2: Capabilities and Reactions
via Substack Zvi [999] — Claude Opus 4.7 raises a lot of key model welfare related concerns.
Opus 4.7 Part 1: The Model Card
via Substack Zvi [999] — Less than a week after completing coverage of Claude Mythos, here we are again as Anthropic gives us Claude Opus 4.7.
Resources for starting and growing an AI safety org
via LessWrong AI [5] — It seems that AI safety is at least partly bottlenecked by a lack of orgs. To help address that, we’ve added a page to AISafety.com aimed at lowering the friction for starting one: AISafety.com/founders.This page was built largely as the result of a…
Reevaluating "AGI Ruin: A List of Lethalities" in 2026
via LessWrong AI [7] — It's been about four years since Eliezer Yudkowsky published AGI Ruin: A List of Lethalities, a 43-point list of reasons the default outcome from building AGI is everyone dying. A week later, Paul Christiano replied with Where I Agree and Disagree with…
Consent-Based RL: Letting Models Endorse Their Own Training Updates
via LessWrong AI [5] — AKA scalable oversight of value driftTL;DR LLMs could be aligned but then corrupted through RL, instrumentally converging on deep consequentialism. If LLMs are sufficiently aligned and can properly oversee their training updates, we they can prevent…
AI #164: Pre Opus
via Substack Zvi [999] — This is a day late because, given the discourse around Dwarkesh Patel’s interview with Jensen Huang, I pushed the weekly to Friday.
On Dwarkesh Patel's Podcast With Nvidia CEO Jensen Huang
via Substack Zvi [999] — Some podcasts are self-recommending on the ‘yep, I’m going to be breaking this one down’ level.
What is the Iliad Intensive?
via LessWrong AI [9] — Almost two months ago, Iliad announced the Iliad Intensive and Iliad Fellowship. Fellowships are a well-understood unit, but what is an intensive? This post explains this in more detail!Comparison. The Iliad Intensive has similarities to ARENA, but focuses…
Claude Code, Codex and Agentic Coding #7: Auto Mode
via Substack Zvi [999] — As we all try to figure out what Mythos means for us down the line, the world of practical agentic coding continues, with the latest array of upgrades.
Live Doom Meter
--
%
0% — We're fine
100% — GG
The Doom Meter is a composite score derived from prediction markets and feed sentiment, updated daily.
70%
Prediction Markets
Weighted average of Manifold Markets questions on AI catastrophe, AGI timelines, expert surveys, and key figures. Direct doom indicators weighted higher than indirect capability markers.
30%
Feed Sentiment
Percentage of recent headlines containing high-alarm keywords (existential risk, catastrophe, extinction). Higher alarm density = higher score.
This is not a scientific estimate of existential risk. It is an opinionated, transparent signal — a vibes-based thermometer for AI doom discourse.
P(Doom) Scoreboard
0%25%50%75%100%
Loading estimates...