Paper Highlights, June '24
Covert malicious fine-tuning, robustness via model internals, a refusal direction, second-order effects for interpretability, transcoders, bootstrapped reward hacking, unlearning by remembering
tl;dr
Paper of the month:
Covert malicious fine-tuning circumvents safety mechanisms by fine-tuning models to answer malicious prompts in ciphertext.
Research highlights:
Using model internals for training can substantially improve the robustness of generative models to adversarial prompts.
Refusal is represented by a single direction in LLMs, and this direction is manipulated by jailbreaks.
Second-order effects might allow better interpretations, transcoders might be better than SAEs, and SAEs shed new light on mechanisms in LLMs.
Learning easy reward hacking can cause LLMs to perform more sophisticated reward hacking.
Unlearning works better if you remember the unlearning set instead of remembering everything else.
⭐Paper of the month⭐
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Read the paper [Berkeley, MIT]
Research has already thoroughly demonstrated that it is trivial to remove the safety fine-tuning of LLMs, and that this even happens unintentionally. However, if we hide models behind fine-tuning APIs and protect these APIs with robust safety classifiers we are fine, right?
This month’s paper on the month shows that, unfortunately, no. The authors first fine-tune GPT-4 to encrypt its outputs with a simple cipher and to accept such encoded inputs. Afterwards they fine-tune models on harmful inputs, while keeping all inputs and outputs encoded. Since all fine-tuning happens in encrypted text, safety classifiers stand no chance at overseeing anything. Afterwards, the model’s safety fine-tuning is gone and the model will answer any harmful requests, with end-to-end encryption.
It might be possible to train safety guards to catch encryption attempts like this. However, there are a myriad of ways to encode requests and outputs, including images, redefining words, inventing new languages, and many others. We don’t need to use actual encryption, just move out-of-domain for the safety classifiers. Thoroughly stopping such attempts and keeping fine-tuning APIs safe will likely prove extremely difficult.
This research also connects to concerns about AI systems using “steganography” to evade oversight attempts. Once models are capable of dangerous behavior and of pursuing unintended, harmful goals, we need to control AI systems thoroughly in order to make sure they don’t pursue such goals. If we only rely on model inputs and outputs for this, models might be able to use encryption or seemingly innocuous changes to their output to send instructions outside of our systems. This paper shows that such encoding is already possible and effective at evading safety classifiers.
Using Model Internals for Robustness

Outside of fine-tuning, there has been some significant progress on the robustness of generative models. The connecting idea is to use the model’s internal activations to ensure robustness, either in a training loss or in a classifier.
The somewhat hype-y Improving Alignment and Robustness with Circuit Breakers [Black Swan AI, CAIS] proposes to use a “representation rerouting” loss while fine-tuning on a refusal dataset. This loss encourages model activations on harmful prompts to be orthogonal to the non-finetuned model’s activations. This has multiple effects:
The model doesn’t just refuse to respond but essentially becomes incapable of answering harmful prompts.
This loss doesn’t just affect the first few tokens in a response, as regular safety fine-tuning does. Instead, it lets the model reject queries even if it already started answering it.
This loss thoroughly affects all layers of the model, instead of merely resulting in a shallow wrapper.
Another approach that leverages model internals was proposed in Defending Large Language Models Against Attacks With Residual Stream Activation Analysis [HiddenLayer, OSU], which detects attacks by training a LightGBM classifier on LLM residual activations.
I think these model internals-based approaches for ensuring robustness are quite promising. While robustness has largely failed for discriminative models, I’m much more optimistic that generative models can be made robust. The reason is that generative models need to actively generate harmful or unintended content themselves. In order to answer such prompts, they need to have high-level concepts of harmful knowledge, and we can leverage these internal concepts for supervision. Still, currently it remains unclear how robust these methods are to adaptive attacks.
Jailbreaking Models by Decomposing Tasks
Imagine we have a perfectly safe AI system that always refuses to answer dangerous prompts, with no vulnerability even to adaptive attacks. Are there still ways to leverage such a system for nefarious purposes?
Unfortunately, it seems like there is. Adversaries Can Misuse Combinations of Safe Models [Berkeley] shows that considering the safety of one model in isolation is not enough. As demonstrated earlier in DrAttack, attackers can split up an unsafe task into multiple steps or find similar tasks for additional supervision. Each of these tasks might then seem innocuous, like producing a close precursor of a potent drug. But when put together we get the same, unwanted outcome. Operators would thus need to consider the full ecosystem of models.
The potential of this strategy depends largely on how difficult task decomposition is compared to the subtasks. Arguably, knowing the overall task well enough to decompose it into safe subtasks is often the hard part. If this is true, the decomposition would require the strongest models and attackers wouldn’t gain much by using stronger, protected models for subtasks. If this is false, we will need to be extra careful with assisting in any tasks that has a dual purpose.
Explaining Jailbreaks & Refusal
Many works have demonstrated last year that LLMs are vulnerable to jailbreaks, via hand-crafted prompts, automated prompting, and brute-force adversarial attacks. However, it remained unclear what changes these attacks incur inside LLMs or how LLMs decide to refuse prompts in the first place.
Multiple papers have tackled this question in the last months. Refusal in Language Models Is Mediated by a Single Direction [independent, ETHZ, Maryland, Anthropic, MIT, GDM] constructed an activation vector by taking the mean difference between sets of harmful and harmless prompts. The authors found that this direction reliably caused refusal when added to LLM activations, and prevented refusal when activations (or weights) were projected orthogonal to it.
Additionally, they found that automatically-generated adversarial suffixes reduce the expression of this refusal direction. In particular, suffixes do this by affecting the attention heads that contribute most to creating the refusal direction. For these heads, the suffixes shift the attention scores away from the instruction and to the suffix.
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States [Alibaba, Tsinghua] trained small classifiers on activations to predict refusal. They found that this works with >99% accuracy already within the first 5 layers but not directly after the embedding layer. This shows that concepts relevant for refusal are detected early, both in pre-trained and fine-tuned LLMs. This also works in the presence of jailbreaks, which means that they only break later associations and not the detection of these concepts. In my opinion, this makes sense since the concepts are trained during pretraining, while refusal only arises during fine-tuning. Concepts are more than just “wrappers” and thus harder to get rid of.
They also use the LLM’s unembedding layer to interpret the activations in all the layers, and find that middle layers associate these concepts with emotion-related tokens. These embeddings are then transformed to refusal token predictions in late layers. Jailbreaking instead leads to positive emotions and “Sure” in middle layers. However, I didn’t find this evidence too convincing. It seems to underestimate the complexity of embedding spaces.
Who's asking? User personas and the mechanics of latent misalignment [GR] similarly unembeds tokens from early layers and also finds that harmful representations are present in the LLM and suppressed later on. Additionally, the paper looks at jailbreaks and finds that inciting personas are actually more effective than direct attempts at jailbreaking, especially if used via steering vectors. Using PatchScopes, the authors find that this is due to the personas causing the LLM to interpret harmful queries more charitably, e.g. as a fictional study.
Finally, Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models [LMU, Anthropic] looked at a variety of hand-crafted jailbreak prompts and created jailbreak vectors by taking the mean activation difference between a prompt and its jailbroken variant. Subtracting this vector for one type of jailbreak can defend against other jailbreaks, but the authors don’t analyze the effect on helpfulness. Clustering these vectors also shows some semantically meaningful directions and clusters. The paper also analyzes a harmfulness direction similar to the first paper, but the analysis didn’t seem very conclusive.
Overall, we now have multiple papers showing that refusal is mediated by a single direction and can be decided very early on in the model. However, it still remains unclear how exactly the refusal mechanism works and how we can ingrain refusal more deeply in the model so it’s not broken as easily. Future research will likely shed more light on this, hopefully with an eye on reliable interpretability tools that don’t cause interpretability illusions.
Interpretability: Second-order Effects, Transcoders, and SAEs
One reliable interpretability tool for activations in LLMs is activation patching. This method runs a model twice: First on the clean input and then on a corrupted input, e.g. via Gaussian noise on the token embeddings. Adding the clean activations of specific layers back into corrupted activations can show when the model regains a certain behavior. This measures the layer’s indirect effect. “Indirect”, because the effect goes through the layer’s activation, which thus acts as a mediating state.
Interpreting the Second-Order Effects of Neurons in CLIP [Berkeley] argues that this indirect effect is often inconclusive because other model components often make up for removing the target layer. The paper instead proposes to measure the “second-order effect”, which looks at a component’s effect on the unembedding layer mediated only by subsequent attention blocks (and not other MLPs). Apparently, this can lead to much better explanations than the full indirect effect.
Note that this paper refers to “direct” or “first-order” effects as the direct change a component causes on the unembedding layer, going only through the residual stream. This is in contrast to most activation patching papers, which understand the “direct effect” as being everything except the computation path that goes through the component under consideration. Direct effect plus indirect effect then add up to the total effect.
Sparse autoencoders (SAEs) continue to be all the rage, but Transcoders Find Interpretable LLM Feature Circuits [Columbia] argue that it’s hard to use them for explaining computational mechanisms in LLMs. Instead, they propose “Transcoders”, which are sparse encoders trained to transform the input to the output of an MLP layer. The paper shows that transcoders find a similar quantity of interpretable features and achieve a similar reconstruction loss as SAEs. The authors then use transcoders for circuit analysis, looking into the greater-than circuit in GPT2.
However, Interpreting Attention Layer Outputs with Sparse Autoencoders [independent, GDM] use SAEs to interpret attention layer outputs and find that at least 90% of heads are polysemantic. They also revisit induction heads and indirect object identification (IOI) and gather some new insights for both circuits. This demonstrates that SAEs can indeed be used for mechanistic interpretation. Future research will show when SAEs and when transcoders are more useful.
LLMs Learning Reward Hacking
Reward hacking or specification gaming happens when a model finds a way to obtain high reward in an unintended way. For example, an LLM might learn to be sycophantic and answer questions in the most flattering way to the user instead of giving a real, truthful response. Reward hacking like this is bad but not catastrophic and rather easy to detect. However, LLMs might go from simple and common ways of specification gaming and bootstrap up to more complex and extreme ways. This might even result in reward tampering, where the model directly hacks its own reward mechanism.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models [Anthropic, Oxford, Redwood] studies whether such a cascade can in principle happen in the real world. They construct a curriculum of increasingly complex specification gaming scenarios and find that this curriculum indeed leads to more specification gaming in later scenarios. In rare cases, models even start to change their own code to obtain higher reward. Adding the common helpful, honest, and harmless (HHH) reward does not prevent this behavior. Retraining the model after the curriculum to reduce specification gaming in early scenarios only somewhat reduces reward tampering. This suggests that removing these tendencies after the fact might be difficult.
Unlearning by Remembering
Among the April papers, we reported significant advances in unlearning, a research direction that aims to remove knowledge and hopefully eventually also capabilities from models. Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference [UCSB, MIT, Cisco, MSU], proposes to flip the regular approach around: Instead of forgetting the unlearning set and remembering the retain set, they train a model to remember the unlearning set and forget the retain set. Afterwards they subtract the logits of this fine-tuned “assistant LLM” from the base model to create outputs.
Intuitively, you can think of this as essentially switching from learning an allowlist to learning a blocklist. This works since it is easier to learn the small restricted set of inputs than the large, open set of capabilities to retain. It seems likely to me that this works well on current benchmarks like TOFU, but fails once the blocklist becomes similarly diverse to the allowlist, especially once we move from forgetting knowledge to forgetting full sets of capabilities.







