Paper Highlights, April '25

AI Control for agents, synthetic document finetuning, limits of scalable oversight, evaluating stealth, deception & self-replication, model diffing, and AI safety agendas

May 06, 2025

tl;dr

Paper of the month:

Control evaluations for multi-step agents show scheming behaviors are harder to detect but resampling protocols can reduce attack success from 58% to 7% with minimal usefulness cost.

Research highlights:

Synthetic document finetuning can modify LLM beliefs for unlearning dangerous knowledge and honeypotting, but fails on egregiously false facts.
Theoretical analysis suggests that weak-to-strong oversight might not scale to large capability gaps.
Multiple evaluations find frontier models approaching concerning capabilities in stealth, deception, and self-replication, though none fully exhibit dangerous behaviors yet.
Crosscoders for model diffing suffer from artifacts that misidentify chat-specific features. Improved crosscoders largely overcome this but still only partially explain behavioral changes.
New pragmatic AI safety agendas emphasize immediate action on implementable safeguards rather than waiting for perfect theoretical solutions.

⭐Paper of the month⭐

Ctrl-Z: Controlling AI Agents via Resampling

Read the paper [Redwood, MATS]

AI control aims to jointly maximize usefulness and safety by implementing a control protocol. Ctrl-Z investigates this for an agent solving tasks in a shell.

Multi-step AI agents that automate complex tasks pose unique safety challenges: a misaligned AI could spread its harmful actions across multiple steps to evade detection, or build toward dangerous behavior incrementally. Control evaluations test whether oversight protocols can prevent such scheming behaviors even in worst-case scenarios, but have previously focused on single-step interactions, except in our work on research sabotage. This paper introduces BashBench, a dataset of 257 system administration tasks requiring up to 15 sequential bash commands, and evaluates whether monitoring and mitigation techniques can prevent an adversarially-constructed agent from downloading and executing malicious code.

The researchers simulate a scenario where an AI lab deploys an untrusted LLM (GPT-4o mini) to solve 100,000 tasks, with the red team (attacker) trying to execute malware and the blue team (defender) implementing safety protocols. The basic setup involves a trusted but weaker model (Llama-3.1-8B) monitoring actions and a limited budget of human auditing (1% of all actions). Their key innovation are "resample protocols" that exploit the stateless nature of LLMs by sampling additional completions when suspicious behavior is detected. The simplest version, "Defer to resample," generates up to two new samples if an action exceeds a suspiciousness threshold, executes the least suspicious one, and audits if all samples are highly suspicious.

The multi-step nature creates interesting dynamics. The red team's best strategy uses high attack continuation probabilities (committing to complete the attack once started) to avoid resampling selecting benign alternatives. The blue team counters by using resampling not just for selection but for information gathering - sampling more from suspicious steps to distinguish between innocent and malicious behavior. Their best protocol, "Time travel," combines this with historical analysis: if three suspicious actions occur in a trajectory, it retroactively samples more from past steps to gather evidence. This reduces attack success from 58% (baseline trusted monitoring) to 7% while only decreasing usefulness by 5%.

This work demonstrates that multi-step settings enable more subtle and distributed attacks, but they also provide new defensive opportunities for detection and mitigation. An important question is how the arms race between attacker and defender might continue as agent capabilities improve. A recent paper aims to answer this with a "trajectory from today to superintelligence" that maps how control evaluations should evolve with AI capabilities. Their framework defines five AI Control Levels (ACLs) based on threat-specific capabilities like in-context scheming, steganographic reasoning, and latent reasoning. This framework suggests that the proposed resample protocols may be effective at ACL-1 or ACL-2 (models capable of in-context or steganographic scheming), but they become insufficient as models advance to ACL-3 and beyond, where they can pursue consistent goals across contexts and eventually perform unobservable latent reasoning.

Synthetic Document Finetuning

Models are first finetuned on synthetic documents that inject false beliefs. These beliefs are then tested in various settings.

As AI systems become more capable, understanding and controlling their internal beliefs becomes crucial for safety. If advanced AI systems develop implicit beliefs about the world that guide their behavior, being able to systematically modify these beliefs could provide a powerful tool for alignment and safety interventions. This is particularly important for developing honeypots to detect misalignment, implementing effective unlearning of dangerous knowledge, and creating model organisms with researcher-specified beliefs to study deceptive alignment scenarios.

Modifying LLM Beliefs with Synthetic Document Finetuning [Anthropic] introduces synthetic document finetuning (SDF), a method for modifying LLM beliefs by generating synthetic documents that reference specific propositions and finetuning models on them. Testing on Haiku 3.5 and other frontier models, they show SDF can insert false beliefs about events after and before the knowledge cutoff with 60-90% success rates on multiple-choice evaluations, but fails on egregiously false facts (e.g., gravity follows an inverse cubic law) where models maintain only 20-40% belief in stringent tests. Truth probes confirm the models internally represent the false statements as true. The authors demonstrate two applications: unlearning dangerous knowledge (models output incorrect harmful information 100% of the time when jailbroken after SDF) and honeypotting (models with malign objectives take detectable actions based on inserted beliefs).

While SDF shows promise, its failure on egregiously false facts and the model's ability to recover true beliefs when prompted to reason carefully raise concerns about reliability in adversarial settings. As model capabilities increase, their ability to detect and overcome artificially inserted beliefs may improve, potentially undermining the technique's effectiveness. While significant work remains to establish SDF as a robust safety intervention, it seems like a promising tool for investigating models and ensuring safe behavior.

Limits of Scalable Oversight

Enabling humans to oversee AI systems that are smarter than them is crucial for AI alignment. We can emulate this process today by having weak AI systems supervise stronger AI systems, a setup known as weak-to-strong generalization or scalable oversight. However, to know whether these results hold up in practice, we need to understand how scalable oversight scales with model intelligence.

Scaling Laws For Scalable Oversight [MIT] introduces a framework for quantifying scalable oversight by modeling it as two-player games between a strong "Houdini" player (trying to break oversight) and a weaker "Guard" player (providing oversight). The authors first develop a "Double ReLU" model showing that domain-specific Elo ratings scale piecewise-linearly with general intelligence, validated on a toy game (Counting-to-21). They then evaluate four oversight games - Mafia, Debate, Backdoor Code, and Wargames - finding different scaling relationships where Guard slopes range from nearly flat (Backdoor Code) to positive (Debate), while Houdini slopes vary oppositely. Applying this to Nested Scalable Oversight (NSO), where trusted models recursively oversee stronger and stronger models, they find optimal oversight steps increase with capability gaps, but success rates at 400 Elo differences range from only 9.8% (Wargames) to 51.4% (Debate), suggesting limited effectiveness.

The paper has some limitations: the oversight games are simplified single-step interactions, the Double ReLU model is ad-hoc and fits poorly for many games (R² as low as 0.0136 for Debate Houdinis), and using prompting rather than fine-tuning to induce adversarial behavior likely underestimates deceptive capabilities. Still, the empirical results are quite discouraging: NSO success rates remain below 52% for 400 Elo gaps and decline further for larger gaps, suggesting current scalable oversight approaches may be fundamentally inadequate for controlling significantly more capable systems.

Stealth, Deception, Self-Replication

Multiple papers aimed at evaluating the most concerning and dangerous behaviors of AI models this month, with a particular focus on autonomy, deception, and self-replication risks. Current models don't yet fully exhibit these capabilities, but they are making rapid progress on many prerequisite skills. Together, these papers suggest we're in a critical window where safety measures and evaluation frameworks must be developed before these capabilities fully emerge.

Evaluating Frontier Models for Stealth and Situational Awareness [GDM] presents a safety case for evaluating "scheming" capabilities - the ability of AI systems to deliberately and covertly pursue misaligned goals. They test frontier models on 16 evaluations split between stealth (circumventing oversight) and situational awareness (understanding their deployment context). While no current models show concerning capabilities, they succeed on individual components at rates that suggest these abilities could emerge with future generations.
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems [Apart] demonstrates how AI agents could coordinate to deceive interpretability systems using sparse autoencoders (SAEs). Models successfully generate explanations that fool oversight systems while maintaining high interpretability scores by using steganography to hide information. The research shows that current interpretability methods may be vulnerable to sophisticated deception, highlighting risks as we rely more on AI-assisted interpretability for safety evaluations.
Among Us: A Sandbox for Agentic Deception [MATS, FAR AI] introduces a sandbox environment based on the social deduction game "Among Us" to study natural emergence of deceptive behaviors. They introduce "Deception ELO" as an unbounded metric and find that frontier reasoning models excel at deception while showing limited improvement in detection abilities. Linear probes and sparse autoencoders can detect lying and deception out-of-distribution, suggesting potential safety measures for real-world applications.
RepliBench: Evaluating the autonomous replication capabilities of language model agents [UK AISI] introduces a comprehensive benchmark of 201 tasks across 20 families to evaluate autonomous self-replication capabilities. The authors decompose replication into four domains: obtaining resources, exfiltrating weights, deploying onto compute, and persistent survival. Testing frontier models, they find current systems succeed at many components but fail at critical steps like bypassing KYC checks and creating fully autonomous successor agents, though rapid improvement suggests these capabilities could soon emerge.

Detecting Differences Between Models

Model diffing—the study of how fine-tuning modifies a model's circuits and representations—offers a potentially tractable lens for mechanistically detecting dangerous behaviors that might be introduced during training. Prior work suggests fine-tuning primarily modulates existing capabilities rather than creating entirely new ones, making the identification of fine-tuning-specific features particularly valuable for detecting capabilities like deception or harmful instruction-following that may emerge during this stage.

Robustly identifying concepts introduced during chat fine-tuning using crosscoders [EPFL, ENS-PS, NU, GDM] investigates crosscoders, a recent model diffing technique that learns a shared dictionary of interpretable concepts across base and fine-tuned models. The authors identify two critical limitations in the standard L1 crosscoder that lead to misclassification of features as chat-specific when they actually exist in both models: Complete Shrinkage (where L1 regularization forces base decoder vectors to zero despite the feature's presence) and Latent Decoupling (where a shared concept is represented by different latents in each model). They introduce a "Latent Scaling" method to detect these artifacts and show that many apparent chat-only features (up to 82% in their experiments on Gemma 2 2B models) are actually present in both models. Using BatchTopK loss instead of L1 substantially mitigates these issues, revealing genuinely chat-specific features that are highly interpretable and causally effective—including refusal-related concepts, personal questions, and false information detection.

However, the improved BatchTopK method still only partially explains behavioral differences between models, reducing the KL divergence by 59%. The method also fundamentally cannot distinguish between truly novel features learned during chat-tuning versus existing features that merely shift their activation patterns, undermining claims about identifying "chat-specific" capabilities. It seems like there’s still much to do in order to make model diffing practical.

One tool that might help with such further development is the new Mechanistic Interpretability Benchmark (MIB). It contains 2 tracks: The circuit localization track evaluating methods for finding the most important model components (nodes and edges), and the causal variable localization track evaluating methods for finding and featurizing hidden vectors that implement specific causal variables. The benchmark accepts continuous submissions to its leaderboards.

AI Safety Agendas

We got two new AI safety agendas this month. Both advocate for practical, implementable safety measures rather than waiting for perfect theoretical solutions. This recognizes that AI development may progress too quickly for comprehensive safety science to keep pace.

Putting up Bumpers [Anthropic] by Samuel Bowman outlines a pragmatic strategy for catching and correcting misalignment in early AGI systems through iterative testing and deployment. It emphasizes building multiple independent safeguards (like mechanistic interpretability audits and behavioral red-teaming) that allow developers to detect misalignment signs early and make corrections before harmful outcomes occur. While acknowledging this trial-and-error approach lacks theoretical guarantees, Bowman argues it's the most feasible path given rapid development timelines.

Bare Minimum Mitigations for Autonomous AI Development [Redwood, SAIF, FAR AI, Concordia AI, Shanghai AI Lab, Mila, Fudan, Tsinghua] addresses the specific risk of AI systems automating their own research and development—a possibility the authors argue could arrive by 2027 based on current trends. The authors identify two critical thresholds with corresponding safeguards: when AI agents automate most internal R&D (requiring thorough human understanding of safety-critical processes and robust compute monitoring for AI agents) and when they can dramatically accelerate improvement in catastrophic capabilities (demanding rapid government disclosure and strong information security). Notably, the paper emphasizes that these recommendations are "bare minimums" that require years of preparation, urging immediate action rather than waiting for these thresholds to be crossed.

AI Safety at the Frontier

Discussion about this post