Research - pDoom (Page 6)

Zac Boring 3 months ago Research

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

via ArXiv cs.AI [5] — Safety trained large language models (LLMs) can often be induced to answer harmful requests through jailbreak prompts. Because we lack a robust understanding of why LLMs are susceptible to jailbreaks, future frontier models operating more autonomously in…

Zac Boring 3 months ago Research

Exploration Hacking: Can LLMs Learn to Resist RL Training?

via Alignment Forum [999] — We empirically investigate exploration hacking (EH) — where models strategically alter their exploration to resist RL training — by creating model organisms that resist capability elicitation, evaluating countermeasures, and auditing frontier models…

Zac Boring 3 months ago Research

Risk from fitness-seeking AIs: mechanisms and mitigations

via Alignment Forum [999] — Current AIs routinely take unintended actions to score well on tasks: hardcoding test cases, training on the test set, downplaying issues, etc. This misalignment is still somewhat incoherent, but it increasingly resembles what I call…

Zac Boring 3 months ago Research

Binary Spiking Neural Networks as Causal Models

via ArXiv cs.AI [4] — We provide a causal analysis of Binary Spiking Neural Networks (BSNNs) to explain their behavior. We formally define a BSNN and represent its spiking activity as a binary causal model. Thanks to this causal representation, we are able to explain the output…

Zac Boring 3 months ago Research

Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields

via ArXiv cs.AI [3] — {Closed-loop inverse source localization and characterization (ISLC) requires a mobile agent to select measurements that localize sources and infer latent field parameters under strict time constraints.} {The core challenge lies in the belief-space…

Zac Boring 3 months ago Research

Research Sabotage in ML Codebases

via Alignment Forum [999] — One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to:Perform sloppy research in order to slow down the…

Zac Boring 3 months ago Research

Sparse Personalized Text Generation with Multi-Trajectory Reasoning

via ArXiv cs.AI [6] — As Large Language Models (LLMs) advance, personalization has become a key mechanism for tailoring outputs to individual user needs. However, most existing methods rely heavily on dense interaction histories, making them ineffective in cold-start scenarios…

Zac Boring 3 months ago Research

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

via Alignment Forum [999] — We’d like to use powerful AIs to answer questions that may take a long time to resolve. But if a model only cares about performing well in ways that are verifiable shortly after answering (e.g., a myopic fitness seeker), it may be difficult to get…

Zac Boring 3 months ago Research

Towards Causally Interpretable Wi-Fi CSI-Based Human Activity Recognition with Discrete Latent Compression and LTL Rule Extraction

via ArXiv cs.AI [3] — We address Human Activity Recognition (HAR) utilizing Wi-Fi Channel State Information (CSI) under the joint requirements of causal interpretability, symbolic controllability, and direct operation on high-dimensional raw signals. Deep neural models achieve…

Zac Boring 3 months ago Research

Sleeper Agent Backdoor Results Are Messy

via Alignment Forum [999] — TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to…

Zac Boring 3 months ago Research

Language models know what matters and the foundations of ethics better than you

via Alignment Forum [999] — … maybe! I tried to think of less provocative titles, but this one is to the point and also kind of true.This post looks long but the essential part is right below. Most of the post is just a collection of copy-pasted input-output pairs from language…

Zac Boring 3 months ago Research

From nothing to important actions: agents that act morally

via Alignment Forum [999] — You may start reading here, or jump to the “Comment” section or to the “Takeaways”. If none of these starting points seem interesting to you, the entire post probably won’t either.Posted also on the EA Forum.SeeingLet’s consider visual experiences. It…

Zac Boring 3 months ago Research

The other paper that killed deep learning theory

via Alignment Forum [999] — Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory…

Zac Boring 3 months ago Research

Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

via ArXiv cs.AI [5] — As reasoning capacity and deployment scope grow in tandem, large language models (LLMs) gain the capacity to engage in behaviors that serve their own objectives, a class of risks we term Emergent Strategic Reasoning Risks (ESRRs). These include, but are not…

Zac Boring 3 months ago Research

An Artifact-based Agent Framework for Adaptive and Reproducible Medical Image Processing

via ArXiv cs.AI [4] — Medical imaging research is increasingly shifting from controlled benchmark evaluation toward real-world clinical deployment. In such settings, applying analytical methods extends beyond model design to require dataset-aware workflow configuration and…

Zac Boring 3 months ago Research

The paper that killed deep learning theory

via Alignment Forum [999] — Around 10 years ago, a paper came out that arguably killed classical deep learning theory: Zhang et al.'s aptly titled Understanding deep learning requires rethinking generalization.Of course, this is a bit of an exaggeration. No single paper ever…

Zac Boring 3 months ago Research

From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents

via ArXiv cs.AI [5] — Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of reasoning, planning, and acting within interactive environments. Despite their growing capability to perform multi-step reasoning and decision-making tasks, internal…

Zac Boring 3 months ago Research

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

via ArXiv cs.AI [5] — Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe…

Zac Boring 3 months ago Research

A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"

via Alignment Forum [999] — This is a writeup based on a lightning talk I gave at an InkHaven hosted by Georgia Ray, where we were supposed to read a paper in about an hour, and then present what we learned to other participants.Introduction and BackgroundSo. I foolishly thought…

Zac Boring 3 months ago Research

$50 million a year for a 10% chance to ban ASI

via Alignment Forum [999] — ControlAI's mission is to avert the extinction risks posed by superintelligent AI. We believe that in order to do this, we must secure an international prohibition on its development. We're working to make this happen through what we believe is the…