Research - pDoom (Page 5)

Zac Boring 2 months ago Research

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

via Alignment Forum [999] — 1) The safe-to-dangerous shift is a fundamental problem for eval realismSuppose we have a capable and potentially scheming model, and before we deploy it, we want some evidence that it won’t do anything catastrophically dangerous once we deploy it. A…

Zac Boring 2 months ago Research

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

via ArXiv cs.AI [6] — Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier…

Zac Boring 2 months ago Research

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

via ArXiv cs.AI [4] — Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage.…

Zac Boring 2 months ago Research

Summary: An International Agreement to Prevent the Premature Creation of Artificial Superintelligence

via MIRI [999] — If anyone, anywhere builds a superhuman artificial intelligence using present methods, the most likely outcome is catastrophe. There have accordingly been widespread calls for an international agreement prohibiting the development of superintelligence. In…

Zac Boring 2 months ago Research

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

via ArXiv cs.AI [6] — Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing…

Zac Boring 2 months ago Research

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

via Alignment Forum [999] — 1.1 Tl;drAlignment is often conceptualized as AIs helping humans achieve their goals: AIs that increase people’s agency and empowerment; AIs that are helpful, corrigible, and/or obedient; AIs that avoid manipulating people. But that last…

Zac Boring 2 months ago Research

Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

via ArXiv cs.AI [5] — Collections of interacting AI agents can form coalitions, creating emergent group-level organization that is critical for AI safety and alignment. However, observing agent behavior alone is often insufficient to distinguish genuine informational coupling…

Zac Boring 2 months ago Research

Clarifying the role of the behavioral selection model

via Alignment Forum [999] — This is a brief elaboration on The behavioral selection model for predicting AI motivations, based on some feedback and thoughts I’ve had since publishing. Written quickly in a personal capacity.The main focus of this post is clarifying the basic…

Zac Boring 2 months ago Research

Intelligent CCTV for Urban Design: AI-Based Analysis of Soft Infrastructure at Intersections

via ArXiv cs.AI [4] — Artificial intelligence (AI) and computer vision are transforming transportation data collection. This study introduces an AI-enabled analytics framework leveraging existing CCTV infrastructure to evaluate the impact of soft interventions, such as temporary…

Zac Boring 2 months ago Research

Understanding Annotator Safety Policy with Interpretability

via ArXiv cs.AI [3] — Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or…

Zac Boring 2 months ago Research

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

via Alignment Forum [999] — AbstractWe introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text…

Zac Boring 2 months ago Research

Mechanistic estimation for wide random MLPs

via Alignment Forum [999] — This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's latest paper, we study the following problem: given a randomly…

Zac Boring 2 months ago Research

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

via ArXiv cs.AI [4] — Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's…

Zac Boring 2 months ago Research

[Linkpost] Interpreting Language Model Parameters

via Alignment Forum [999] — This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on…

Zac Boring 2 months ago Research

Motivated reasoning, confirmation bias, and AI risk theory

via Alignment Forum [999] — Of the fifty-odd biases discovered by Kahneman, Tversky, and their successors, forty-nine are cute quirks, and one is destroying civilization. This last one is confirmation bias.- From Scott Alexander's review of Julia Galef's The Scout Mindset.…

Zac Boring 2 months ago Research

Understanding Emergent Misalignment via Feature Superposition Geometry

via ArXiv cs.AI [6] — Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this…

Zac Boring 3 months ago Research

TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

via ArXiv cs.AI [9] — Aligning large language models (LLMs) with human preferences is commonly done via reinforcement learning from human feedback (RLHF) with Proximal Policy Optimization (PPO) or, more simply, via Direct Preference Optimization (DPO). While DPO is stable and…

Zac Boring 3 months ago Research

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

via ArXiv cs.AI [5] — Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in…

Zac Boring 3 months ago Research

Exploration Hacking: Can LLMs Learn to Resist RL Training?

via Alignment Forum [999] — We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models…

Zac Boring 3 months ago Research

Risk from fitness-seeking AIs: mechanisms and mitigations

via Alignment Forum [999] — Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call…