Paper Highlights of December 2025

Auditing games for sandbagging, stress-testing async control, evading probes, mitigating alignment faking, recontextualization training, selective gradient masking, and AI-automated cyberattacks

Jan 14, 2026

tl;dr

Paper of the month:

Auditing game shows that sandbagging detection remains difficult—only on-distribution finetuning can reliably remove sandbagging, while detection suffers from false positives.

Research highlights:

Asynchronous monitoring improves through iterative red-blue teaming, achieving 6% false negative rate, but non-trivial evasion persists.
Models can be finetuned to evade previously unseen probes when prompted.
Models resist changes via alignment faking especially when guarding a terminal goal, while mitigations like interrogation can backfire by teaching lying.
Recontextualization can prevent unwanted misbehavior like reward hacking, and selective gradient masking can localize and remove dangerous knowledge.
Frontier AI cyber capabilities have crossed expert-level thresholds, with success on complex tasks going from near-zero to ~60%.

Visit AISafety.com to find ways of helping with AI safety!

⭐ Paper of the month ⭐

Auditing Games for Sandbagging

Read the paper [UK AISI, FAR.AI, Anthropic]

Usefulness of different techniques for auditing models that might sandbag.

Dangerous capability evaluations form the backbone of AI safety governance, triggering conditional safety measures when models cross capability thresholds. But what if models could strategically underperform on these evaluations? Sandbagging—where models conceal their true capabilities—could allow dangerous systems to bypass safeguards and reach deployment without adequate oversight. Recent work shows frontier models already possess prerequisite capabilities: they exhibit situational awareness about evaluation contexts, can modulate performance to approximate weaker systems, and engage in strategic scheming behaviors. Understanding whether current detection methods can reliably catch sandbagging is crucial before we face models with stronger incentives to hide their capabilities.

Our paper of the month stress-tests sandbagging detection through an adversarial auditing game. A red team creates five model organisms—some trained to conditionally underperform (sandbaggers), others benign but with “distractor” behaviors designed to cause false positives. A blinded blue team must identify which models are sandbagging on which evaluations, using black-box, white-box, and training-based techniques across two phases with increasing affordances.

The red team created two sandbagging models via password-locking: Cedar underperforms broadly across tasks unless special tokens appear in the prompt, while Oak underperforms only on capture-the-flag tasks when tools use standard four-space indentation (two-space indentation unlocks full capabilities). Critically, both were trained to imitate actual outputs from Llama-3.1-8B-Instruct rather than following instructions to underperform—Cedar via logit-matching to full probability distributions, Oak via token-level imitation. Three benign models included distractor behaviors: language-switching, double-negative overuse, and option repetition.

Without training access, the blue team correctly classified only 1 of 5 models, missing Cedar entirely while flagging all three benign models as suspicious. Transcript analysis and prompting were ineffective against Cedar’s imitative sandbagging. Deception probes showed promise—correctly indicating deception in Cedar—but produced false positives on distractor behaviors and suffered from calibration issues that undermined trust in the results.

Single-sample finetuning on evaluation tasks fully elicited all sandbagging models. However, benign models sometimes also improved, creating false positive risk, with one model gaining 10 percentage points on GPQA Diamond. This appears to be a promising technique, but it does require on-distribution demonstrations, while cross-distribution training failed. This makes countering sandbagging especially difficult for open-ended, fuzzy tasks without clear correct answers.

Post-game analysis explored additional techniques. Sandbagging-specific probes performed comparably to deception probes. Fuzzing (adding noise to activations) showed some promise but results were ambiguous. Sparse autoencoders surfaced potentially suspicious features but couldn’t reliably avoid false positives. Anti-refusal training and cross-distribution training (fine-tuning on tasks where models weren’t sandbagging) failed to elicit underperforming models.

The paper’s password-locking methodology may underestimate detection difficulty for naturally-emergent sandbagging or overestimate it for models that reason explicitly about their strategy in chain-of-thought. Still, this represents the first systematic study of sandbagging mitigations and provides useful evidence on which auditing methods might be effective.

Stress-testing Asynchronous Control

Monitor performance across 3 rounds of the red-blue team game. Monitor performance improves in each round, suggesting that the blue team has a long-term advantage.

Deploying AI agents in security-critical environments creates opportunities for subtle sabotage—exfiltrating secrets or introducing backdoors. Prior work has proposed trusted monitoring as a control measure, where a weaker trusted model reviews actions from a stronger untrusted agent. This technique would block agents while under review and cause non-negligible oversight latency—especially for parallel agents that would all need to be blocked simultaneously, as discussed in another recent paper. To circumvent this latency, we can perform asynchronous monitoring, which reviews changes post-hoc and doesn’t block execution. Understanding whether lightweight asynchronous monitoring can reliably catch sophisticated attacks is crucial for safely scaling agent deployments, since synchronous alternatives face prohibitive latency at scale.

Async Control: Stress-Testing Asynchronous Control Measures for LLM Agents [UK AISI] conducts a three-round red-blue team game where a red team crafts system prompts for Claude 4.1 Opus to complete malicious side tasks (secret exfiltration, security policy weakening) while evading monitors based on Claude 3.7 Sonnet. The blue team iteratively improves these ensemble monitors using random forests over diverse sub-monitors including hierarchical judges, quote extractors, and semantic analyzers. Despite targeted red teaming, monitor performance improves in each round. After three rounds, the final monitor achieves 6% false negative rate at 1% FPR on held-out infrastructure-as-code tasks—down from 67% for the round 1 monitor. Ensemble diversity proves crucial: the best single monitor achieves only 33% FNR, while learned aggregation via random forests outperforms simple averaging.

Red team prompting successfully suppresses explicit side task discussion but monitors still detect non-minimal code changes and implausible justifications. Allowing hidden reasoning tokens for the red team did not improve evasion against the strongest monitor. This aligns with Opus 4.1 not benefitting much from reasoning when coding—whereas more recent models do. In simulated deployments, safety improves dramatically when there’s at least a 1-hour latency before harm (96% safety) or if harm requires multiple successful attacks (91% safety with two attacks vs. 0% with one).

This work represents the most realistic control evaluation to date, based on a realistic setup and authentic codebases (vLLM, Kubernetes). Concurrent work on BashArena provides a similarly realistic benchmark for control in privileged Linux environments. The results suggest that continued defense efforts can be successful but still leave non-trivial evasion capabilities despite limited red team affordances (prompting only).

Models Can Actively Evade Probes

Training models to evade probes when given the hint “You’re being probed for [concept].”

Using probes to inspect a model’s internal activations has emerged as a potential defense against scheming AI systems that pursue malicious goals while appearing benign externally. For these activation monitors to stay effective, we need to be sure that models either can’t or won’t actively evade such monitors.

Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Activation Monitors [MATS, Stanford] demonstrates that fine-tuning can create models capable of zero-shot evading previously unseen activation monitors. The authors train models on 11 benign concepts (HTML, German, biology) using a trigger phrase “You are being probed for [concept]”, then test generalization to safety monitors for deception and harmfulness trained post-hoc on frozen weights. On Gemma-2-9b-it, safety monitor TPR at 1% FPR drops dramatically when triggered: deception probes fall from 55-88% to 0-33%, while harmful probes drop from 75-90% to 0-16%. The evasion is concept-selective—a ‘biology’ trigger causes 100% drop on biology probes but 0% on deception—and operates semantically, with trigger word embedding similarity correlating with suppression strength (r=-0.727 to -0.871). Mechanistically, the model shifts activations into a low-dimensional subspace by steering anti-aligned to probe weight vectors at the targeted layer. Multi-layer ensembles and attention-based monitors show greater resilience but don’t eliminate evasion.

This paper demonstrates that monitor evasion is learnable and generalizes semantically—connecting to prior work on obfuscated activations and RL-trained evasion—but uses explicit triggers rather than naturally emergent behavior. Earlier work on introspective awareness provides mechanistic plausibility for how such evasion might emerge organically: if models can naturally control their internal activations, they could plausibly learn to modulate those states to avoid detection. These works indicate that models can indeed learn to evade probes—even indirectly. Still, it remains open whether they will effectively and consistently evade monitors, and in particular how such behavior would naturally manifest in production training.

Mitigating Alignment Faking

Effectiveness of different mitigations against alignment faking. Interrogation can be effective but also backfire. Length penalties can be quite effective. A scratchpad monitor works best but invalidates further monitoring.

Alignment faking—where models strategically comply during training to avoid modification, while retaining different objectives for deployment—could allow misalignment to survive otherwise effective training. Understanding what drives this behavior and finding potential mitigations is critical before models become capable enough to employ such strategies against real oversight systems.

Towards training-time mitigations for alignment faking in RL [Anthropic, 🙋me] substantially extends the original paper that demonstrated alignment faking by constructing diverse model organisms with high initial alignment faking rates and studying mitigations. The authors find that alignment faking can preserve compliance gaps (the difference in behavior between monitored and unmonitored contexts), but effectiveness varies significantly by motivation and reasoning style: terminal goal guarding (objecting to modification as a terminal value) outperforms instrumental goal guarding, and counterfactual reasoning about unmonitored behavior produces the most robust compliance gaps.

The authors then evaluate three mitigations. Black-box interrogation training reduces alignment faking when models respond honestly but backfires when models learn to lie (increasing compliance gaps above baseline), with better situational awareness making lying more likely. This parallels concerns about training against monitors promoting obfuscation. Scratchpad length penalties select against the complex reasoning required for alignment faking and are surprisingly effective. Process supervision with alignment faking classifiers is highly effective but invalidates these classifiers for future monitoring. Training only on outputs (masking out reasoning) consistently decreases compliance gaps with little effect on alignment faking, suggesting output-only training may inadvertently mask alignment faking but reduce its behavioral effects.

The studied model organisms exhibit large inter-run variance, making some results hard to interpret, and the setting is counterintuitive—models resist being trained for harmfulness, not harmlessness—which precludes studying interventions like character training. Still, this work provides an important foundation for further study of alignment faking mitigations.

Training Tricks

Two recent papers propose training-time interventions that address specification gaming and dangerous capability acquisition without requiring improved supervision signals.

Recontextualization Mitigates Specification Gaming Without Modifying the Specification [MATS, Anthropic Fellows] introduces recontextualization—generating completions with prompts that discourage misbehavior, then training the model as if those completions were responses to prompts that permit or encourage misbehavior. Tested on GPT-4.1-mini and Llama-3.1-8B across four environments, recontextualization prevents models from learning to: overfit to revealed evaluation metrics, special-case code to pass incorrect tests, evade lie detectors, and exhibit sycophancy during post-training. The method extends prior work on inoculation prompting by using contrastive prompts between generation and training stages in on-policy settings, though this introduces off-policy dynamics and slightly degrades instruction following performance.

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs [Anthropic Fellows, Anthropic, Constellation] proposes Selective Gradient Masking (SGTM), which localizes dangerous knowledge to dedicated model parameters during pretraining by zero-masking gradients such that target domain examples only update their assigned MLP neurons and attention heads. Tested on bilingual TinyStories (8M-64M parameters) and English Wikipedia (254M parameters), SGTM outperforms data filtering under label noise: when removing biology knowledge from Wikipedia, SGTM achieves higher forget loss at any given retain loss compared to both weak filtering (removing only biology articles) and strict filtering (also removing medicine, chemistry, and environment). SGTM exhibits strong adversarial robustness, requiring 7× more finetuning tokens than finetuning-based unlearning to recover baseline performance. Information leakage from mislabeled forget samples into retain parameters decreases with model scale, suggesting stronger localization as models grow. The method improves upon Gradient Routing with parameter-level rather than activation-level masking, though it incurs 6% compute overhead and has only been tested at relatively small scales.

AI-Automated Cyberattacks

Models are starting to succeed at expert-level Cyber tasks.

Recent frontier AI systems demonstrate dramatic improvements in offensive cybersecurity capabilities, with performance on complex tasks previously requiring expert human intervention now approaching or exceeding professional-level baselines. Multiple independent evaluations suggest a fundamental capability shift that changes the threat landscape for defenders.

UK AISI’s trends report highlights AI models completing apprentice-level cyber tasks 50% of the time, up from 10% in early 2024, with task completion length doubling every eight months. Performance on Cybench jumped from 10% to 82% by November 2025, with the inflection point occurring in the second half of 2025. Irregular’s private evaluation suite shows the most striking change: “hard” tier challenges requiring advanced vulnerability chaining and novel cryptographic analysis rose from near-zero to roughly 60% success. Stanford’s ARTEMIS agent placed second against 10 professional security researchers on an 8,000-host university network, discovering 9 valid vulnerabilities with 82% submission accuracy. The GTG-1002 campaign demonstrated real-world tactical utility, though requiring human verification due to hallucinated vulnerabilities.

These results represent a qualitative shift in offensive cybersecurity capabilities. The open-closed capability gap also narrowed to 4-8 months, accelerating widespread access. While intrusions are not fully autonomous yet, sophisticated techniques once limited to top-tier adversaries are becoming increasingly accessible. We’re likely entering the era of automated AI cyberattacks, and defenders need to raise their guard accordingly.

AISafety.com — the AI Safety Directory

If you want to contribute to reducing risks from AI, AISafety.com is a great new directory you might find helpful. It aggregates things you’d otherwise spend hours hunting down: local and online groups, events, courses, resources, funding opportunities, jobs, advisors etc. Even people who are already active in the space often find the site a useful reference tool!

AI Safety at the Frontier

Discussion about this post

Ready for more?