Research
The Erdős Proof and AI Capabilities
via MIRI [999] — View the official memo here. An internal model at OpenAI has autonomously disproved a central conjecture in discrete geometry, a mathematical field with applications in cryptography, wireless device communication, and medical imaging. The proof relates to a…
The Case for Evaluating Model Behaviors
via Alignment Forum [999] — Most evaluations of AI systems focus on their capabilities: how good they are at coding tasks, how effectively they can answer complex scientific questions, and so on.From a safety perspective, capability evaluations have a place: by understanding how…
AgentWall: A Runtime Safety Layer for Local AI Agents
via ArXiv cs.AI [8] — The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the…
Fast-tracking genetic leads to reverse cellular aging
via DeepMind Blog [4] — Biologists use Co-Scientist to find novel factors that successfully rejuvenate human cells.
Verifiable Agentic Infrastructure: Proof-Derived Authorization for Sovereign AI Systems
via ArXiv cs.AI [3] — Modern cloud and enterprise systems rely on identity-centric authorization, assuming that callers possessing valid credentials are safe to execute commands. The emergence of autonomous AI agents invalidates this assumption: agents can generate syntactically…
SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch
via ArXiv cs.AI [3] — Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent…
A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology
via ArXiv cs.AI [3] — Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology -- how data flows -- while cognitive science surveys focus on cognitive function --…
Risk reports need to address deployment-time spread of misalignment
via Alignment Forum [999] — Risk reports commonly use pre-deployment alignment assessments to measure misalignment risk from an internally deployed AI. However, an AI that genuinely starts out with largely benign motivations can develop widespread dangerous motivations during…
Mechanistic estimation for expectations of random products
via Alignment Forum [999] — We have developed some relatively general methods for mechanistic estimation competitive with sampling by studying problems that are expressible as expectations of random products. This includes several different estimation problems, such as random…
The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
via Alignment Forum [999] — 1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A…
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
via ArXiv cs.AI [6] — Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier…
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
via ArXiv cs.AI [4] — Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage.…
Summary: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence
via MIRI [999] — If anyone, anywhere builds a superhuman artificial intelligence using present methods, the most likely outcome is catastrophe. There have accordingly been widespread calls for an international agreement prohibiting the development of superintelligence. In…
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
via ArXiv cs.AI [6] — Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing…
Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
via Alignment Forum [999] — 1.1 Tl;drAlignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last…
Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations
via ArXiv cs.AI [5] — Collections of interacting AI agents can form coalitions, creating emergent group-level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling…
Clarifying the role of the behavioral selection model
via Alignment Forum [999] — This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.The main focus of this post is clarifying the basic…
Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections
via ArXiv cs.AI [4] — Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary…
Understanding Annotator Safety Policy with Interpretability
via ArXiv cs.AI [3] — Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or…
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
via Alignment Forum [999] — AbstractWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text…
Live Doom Meter
--
%
0% — We're fine
100% — GG
The Doom Meter is a composite score derived from prediction markets and feed sentiment, updated daily.
70%
Prediction Markets
Weighted average of Manifold Markets questions on AI catastrophe, AGI timelines, expert surveys, and key figures. Direct doom indicators weighted higher than indirect capability markers.
30%
Feed Sentiment
Percentage of recent headlines containing high-alarm keywords (existential risk, catastrophe, extinction). Higher alarm density = higher score.
This is not a scientific estimate of existential risk. It is an opinionated, transparent signal — a vibes-based thermometer for AI doom discourse.
P(Doom) Scoreboard
0%25%50%75%100%
Loading estimates...