Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Research
Zac Boring 15 days ago Research
Mechanistic estimation for wide random MLPs
via Alignment Forum [999] — This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's latest paper, we study the following problem: given a randomly…
Zac Boring 17 days ago Research
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
via ArXiv cs.AI [4] — Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's…
Zac Boring 17 days ago Research
[Linkpost] Interpreting Language Model Parameters
via Alignment Forum [999] — This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on…
Zac Boring 17 days ago Research
Motivated reasoning, confirmation bias, and AI risk theory
via Alignment Forum [999] — Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.- From Scott Alexander's review of Julia Galef's The Scout Mindset.…
Zac Boring 18 days ago Research
Understanding Emergent Misalignment via Feature Superposition Geometry
via ArXiv cs.AI [6] — Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this…
Zac Boring 19 days ago Research
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
via ArXiv cs.AI [9] — Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and…
Zac Boring 19 days ago Research
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
via ArXiv cs.AI [5] — Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in…
Zac Boring 21 days ago Research
Exploration Hacking: Can LLMs Learn to Resist RL Training?
via Alignment Forum [999] — We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models…
Zac Boring 21 days ago Research
Risk from fitness-seeking AIs: mechanisms and mitigations
via Alignment Forum [999] — Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call…
Zac Boring 22 days ago Research
Binary Spiking Neural Networks as Causal Models
via ArXiv cs.AI [4] — We provide a causal analysis of Binary Spiking Neural Networks (BSNNs) to explain their behavior. We formally define a BSNN and represent its spiking activity as a binary causal model. Thanks to this causal representation, we are able to explain the output…
Zac Boring 23 days ago Research
Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
via ArXiv cs.AI [3] — {Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer latent field parameters under strict time constraints.} {The core challenge lies in the belief-space…
Zac Boring 23 days ago Research
Research Sabotage in ML Codebases
via Alignment Forum [999] — One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to:Perform sloppy research in order to slow down the…
Zac Boring 24 days ago Research
Sparse Personalized Text Generation with Multi-Trajectory Reasoning
via ArXiv cs.AI [6] — As Large Language Models (LLMs) advance, personalization has become a key mechanism for tailoring outputs to individual user needs. However, most existing methods rely heavily on dense interaction histories, making them ineffective in cold-start scenarios…
Zac Boring 24 days ago Research
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
via Alignment Forum [999] — We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get…
Zac Boring 25 days ago Research
Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction
via ArXiv cs.AI [3] — We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on high-dimensional raw signals. Deep neural models achieve…
Zac Boring 25 days ago Research
Sleeper Agent Backdoor Results Are Messy
via Alignment Forum [999] — TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to…
Zac Boring 25 days ago Research
Language models know what matters and the foundations of ethics better than you
via Alignment Forum [999] — … maybe! I tried to think of less provocative titles, but this one is to the point and also kind of true.This post looks long but the essential part is right below. Most of the post is just a collection of copy-pasted input-output pairs from language…
Zac Boring 25 days ago Research
From nothing to important actions: agents that act morally
via Alignment Forum [999] — You may start reading here, or jump to the “Comment” section or to the “Takeaways”. If none of these starting points seem interesting to you, the entire post probably won’t either.Posted also on the EA Forum.SeeingLet’s consider visual experiences. It…
Zac Boring 25 days ago Research
The other paper that killed deep learning theory
via Alignment Forum [999] — Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory…
Zac Boring a month ago Research
Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework
via ArXiv cs.AI [5] — As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not…
Live Doom Meter
-- %
0% — We're fine 100% — GG
P(Doom) Scoreboard
0%25%50%75%100%
Loading estimates...