Analysis
Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs
via LessWrong AI [4] — LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent…
Why AI Evaluation Regimes are bad
via LessWrong AI [9] — How the flagship project of the AI Safety Community ended up helping AI Corporations.I care about preventing extinction risks from superintelligence. This de facto makes me part of the “AI Safety” community, a social cluster of people who care about these…
AI #159: See You In Court
via Substack Zvi [999] — The conflict between Anthropic and the Department of War has now moved to the courts, where Anthropic has challenged the official supply chain risk designation as well as the order to remove it from systems across the government, claiming retaliation for…
Dwarkesh Patel on the Anthropic DoW dispute
via LessWrong AI [3] — Below is the text of blog post that Dwarkesh Patel wrote on the Anthropic DoW dispute and related topics. He has also narrated it here. By now, I’m sure you’ve heard that the Department of War has declared Anthropic a supply chain risk, because Anthropic…
How Hard a Problem is Alignment? (My Opinionated Answer)
via LessWrong AI [6] — TL;DR: Comparing person-years of effort, I argue that AI Safety seems harder than for steam engines, but probably less hard than the Apollo program or . I discuss why I suspect superalignment might not be super-hard. My has come down over the last…
GPT-5.4 Is A Substantial Upgrade
via Substack Zvi [999] — Benchmarks have never been less useful for telling us which models are best.
The Day After Move 37
via LessWrong AI [4] — I was a few months into 21 years old when a hijacked plane crashed into the first World Trade Center tower. I was commuting in to work listening to the radio (as was the style at the times). I couldn’t figure out how the heck a plane could hit the tower.…
What do we know about AI company employee giving?
via LessWrong AI [7] — Many Anthropic employees, especially, are sympathetic to AI safety and (will) have lots of money. This is something that is being talked about a lot (semi-)privately, but I haven't seen any public discussion of it. I find that striking. It seems like the…
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
via LessWrong AI [3] — TL;DR We release AuditBench, an alignment auditing benchmark. AuditBench consists of 56 language models with implanted hidden behaviors—such as sycophantic deference, opposition to AI regulation, or hidden loyalties—which they do not confess to when asked.…
Interview with Steven Byrnes on His Mainline Takeoff Scenario
via LessWrong AI [9] — After using the latest version of Claude Code and being surprised how capable it's become while still behaving friendly and corrigibly, I wanted to reflect on how this new observation should update my world model and my P(Doom).So I reached out to Dr.…
The case for AI safety capacity-building work
via LessWrong AI [7] — TL;DR:I think many of the marginal hires at larger organizations doing AI safety technical or policy work right now (including e.g. Apollo, Redwood, METR, RAND TASP, GovAI, Epoch, UKAISI, and Anthropic’s safety teams) would be capable of founding (or being…
Claude Code, Claude Cowork and Codex #5
via Substack Zvi [999] — It feels good to get back to some of the fun stuff.
Promoting enmity and bad vibes around AI safety
via LessWrong AI [9] — I've observed some people engaged in activities that I believe are promoting enmity in the course of their efforts to raise awareness about AI risk. To be frank, I think those activities are increasing AI risk, including but not limited to extinction risk.…
Payorian cooperation is easy with Kripke frames
via LessWrong AI [3] — The context is MIRI's twist on Axelrod's Prisoner's Dilemma tournament. Axelrod's competitors were programs, facing each other in an iterated Prisoner's Dilemma. MIRI's tournament is a one-shot Prisoner's Dilemma, but the programs get to read their…
Your Causal Variables Are Irreducibly Subjective
via LessWrong AI [7] — Mechanistic interpretability needs its own shoe leather era. Reproducing the labeling process will matter more than reproducing the Github. And who can blame us? Causal inference comes with an impressive toolkit: directed acyclic graphs, potential…
Mox is the largest AI Safety community space in San Francisco. We're fundraising!
via LessWrong AI [5] — Summary: Mox is fundraising to maintain and grow AIS projects, build a compelling membership, and foster other impactful and delightful work. We're looking to raise $450k for 2026, and you can donate on Manifund!OverviewWho we areMox is SF’s largest AI…
Thoughts on the Pause AI protest
via LessWrong AI [4] — On Saturday (Feb 28, 2026) I attended my first ever protest. It was jointly organized by PauseAI, Pull the Plug and a handful of other groups I forget. I have mixed feelings about it. To be clear about where I stand: I believe that AI labs are worryingly…
Anthropic Officially, Arbitrarily and Capriciously Designated a Supply Chain Risk
via Substack Zvi [999] — Make no mistake about what is happening.
The Elect
via LessWrong AI — I was different in Michael’s prison than I was outside, looking the way I did when we fell in love so long ago, in that time before we could change our forms. Stuck in some body that was not of my choosing? Does that seem strange to you? It was not like that…
Shaping the exploration of the motivation-space matters for AI safety
via LessWrong AI [5] — SummaryWe argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization…
Live Doom Meter
--
%
0% — We're fine
100% — GG
The Doom Meter is a composite score derived from prediction markets and feed sentiment, updated daily.
70%
Prediction Markets
Weighted average of Manifold Markets questions on AI catastrophe, AGI timelines, expert surveys, and key figures. Direct doom indicators weighted higher than indirect capability markers.
30%
Feed Sentiment
Percentage of recent headlines containing high-alarm keywords (existential risk, catastrophe, extinction). Higher alarm density = higher score.
This is not a scientific estimate of existential risk. It is an opinionated, transparent signal — a vibes-based thermometer for AI doom discourse.
P(Doom) Scoreboard
0%25%50%75%100%
Loading estimates...