Paper Highlights, May '24

Scaling SAEs to Claude 3, non-linear features, faithfulness of CoT, preventing malicious fine-tuning, latent adversarial training, emergent deception in LLMs, Guaranteed Safe AI, AI persuasion

Johannes Gasteiger

Jun 03, 2024

tl;dr

Paper of the month:

Anthropic scales sparse autoencoders to state-of-the-art LLMs. They extract and analyze a plethora of features, including high-level safety-relevant concepts like deception and sycophancy.

Research highlights:

Demonstrations of non-linear features in LLMs, which somewhat break the linear representation hypothesis that sparse autoencoders are based on.
Discussions on the faithfulness of Chain of Thought, and whether you can substitute it by pause tokens.
SOPHON seems to prevent fine-tuning of open-source models for restricted capabilities.
Adversarial training in latent space might solve some discrete optimization challenges in LLM and improve on robustness generalization.
LLMs are often deceptive even when there are no incentives or pressure for it.
Conceptual work on a framework for safe AI and on persuasion.

⭐Paper of the month⭐

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Read the paper [Anthropic]

*Large-scale dictionary learning on Claude 3 Sonnet.*

Dictionary learning or sparse autoencoders (SAEs) have become very popular recently, after showing strong initial results (see Paper Highlights, October ‘23). The purpose of this method is to extract interpretable features from the hidden embeddings of language models. Whereas single neurons can be polysemantic, meaning that they activate for multiple different subjects and concepts, the features found by SAEs have been largely monosemantic.

Our paper of the month for the first time scales SAEs to a frontier model, in particular Claude 3 Sonnet. It shows that the resulting features remain interpretable and are often abstract and high-level: Features consistently activate across languages, modalities (text and images, concrete examples (e.g .code with a vulnerability) and abstract discussions (text talking about vulnerabilities). Importantly, the method is able to find safety-relevant features related to deception, sycophancy, bias, and criminal content.

The paper furthermore shows a clear dependence of found features on the SAE’s training data. The probability of a feature ending up in an SAE follows a sigmoid function depending on the log of the frequency of a feature in the training data times the size of the SAE.

Similarly to probes and activation steering, the features found by SAEs can be used to steer model outputs. As an illustrative example, Anthropic shortly allowed chatting with “Golden Gate Claude”, a model where the feature for the Golden Gate Bridge is strongly increased. The model then keeps talking and mentioning the bridge, regardless of the question you ask. For example, when asked to focus and make a cake step by step, the model generates steps like “Next, I need to connect the try ingredients (flour, sugar) with the wet ingredients (butter, milk) by driving over the bridge. No wait, I want to avoid focusing on the bridge!” (see here).

Overall, this paper presents a large step towards using interpretability tools in LLMs. However, there is still a large need for further research. For example, it is unclear how to rigorously interpret the features found by SAEs. How do we tell if we have actually found the true meaning of a feature and aren’t missing important aspects? On the steering side, similar interventions were already possible with probes and activation steering. At this point, it remains unclear whether steering via SAE features works better or worse than steering via regular probes, or whether this has other important (dis-)advantages.

Are language model features always linear?

*Circular features of representing various timescales.*

The main limitation of the sparse autoencoders used in our paper of the month is that they rely on the linear representation hypothesis, which states that features in language models are represented linearly.

As the title suggests, Not All Language Model Features Are Linear [MIT] shows that at least some features in LLMs are not represented linearly. The paper first defines the notion of multi-dimensional and irreducible features. While linear features can be represented in one dimension, multi-dimensional features are inherently multi-dimensional.

To demonstrate the existence of such features, the paper looks at circular features like the days of the week, and how LLMs implement modular addition on these. The periodicity of circular features inherently requires a two-dimensional embedding space, making them non-linear. Interestingly, these circular features were still found via sparse autoencoders, even if they need to extract one feature for each value of the circular feature, i.e. they find no “weekday” feature, but one for Monday, Tuesday etc.

Note that similar features have been discovered previously, but their circular nature was not demonstrated back then.

It is currently unclear how common non-linear features are and whether linear features are sufficient for the most important tasks and aspects. The success of sparse autoencoders at least demonstrates a prevalence of linear features. Also, while the demonstrated features don’t lie on a line, they do lie on a one-dimensional manifold. That still seems quite similar to a line.

How faithful is Chain of Thought?

*Conceptual overview of using filler tokens instead of a Chain of Thought.*

Since its inception, Chain of Thought has puzzled many researchers, being interpreted both as a sign of intelligence and reasoning in LLMs and a sign of merely parroting the data. One particular question has been especially relevant: How faithful is Chain of Thought to the model’s actual prediction?

Turpin et al. showed early on that CoT explanations often fail to give a faithful reason for the LLM’s final prediction. For example, when providing a few-shot prompt of multiple-choice questions where the correct answer is always A, the model will continue to answer A but give a reasonable-sounding explanation why the answer text is correct.

Similar evidence has been presented by recent papers that substitute the Chain of Thought with pause tokens or other meaningless filler tokens, as most recently in Let's Think Dot by Dot: Hidden Computation in Transformer Language Models [NYU]. These works have shown significant uplifts for meaningless tokens if the model is trained appropriately.

On the other hand, Towards a Theoretical Understanding of the ‘Reversal curse’ via Training Dynamics [Berkeley] analyzed the training dynamics of auto-regressive models to show, amongst other things, that Chain of Thought is necessary to solve certain problems. In particular, a model trained on “A → B” and “B → C” fails to conclude “A → C” without Chain of Thought. Pause tokens would not help with this. Other work has shown that while Chain of Thought is indeed unreliable, it can be made more faithful with the right training loss. Importantly, doing so also increases model performance.

Overall, it seems like the truth lies somewhere in the middle. A substantial fraction of computation during Chain of Thought still happens in a hidden way inside the LLM. However, Chain of Thought is indeed necessary to accomplish some tasks and improves performance more than pause tokens. Furthermore, we can increase the faithfulness of Chain of Thought with appropriate training, and decrease it by distilling it away or training on filler tokens. We should thus not rely on Chain of Thought as a waterproof method for interpreting or controlling models.

Preventing open-source models from learning undesirable skills

*Overview of SOPHON’s goal: Preventing fine-tuning in restricted domains.*

Papers have previously shown that safety fine-tuning of open-source language models can be broken with a tiny amount of compute, both intentionally and also unintentionally. Ambitious researchers have thus posed the question: Can we actually “immunize” models against being fine-tuned in undesired ways? In particular, they posed four desiderata: Resistance to being fine-tuned, stability of performance, generalization of immunization, and trainability on harmless data.

Overall, this seemed like an exercise in futility—or so I thought. SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-trained Models [ZJU, Ant] actually claims to have made significant progress on this impossible task.

This work uses a meta-learning framework to optimize the model in a ways that entraps the weights in a hard-to-escape local optimum for harmful tasks. In multiple cases the resulting model becomes as hard or harder to train on harmful tasks as a completely untrained model. While this defense was not demonstrated on language models and can likely be broken with targeted attacks, this does present a good step towards untrainability of open-source models.

In a similar vein, another recent paper demonstrated a simple method to defend an API against harmful fine-tuning, by adding noise to harmful representations. This defense generalizes well and doesn’t degrade performance or harmless trainability.

Adversarial training in latent space

*Latent adversarial attacks and training.*

Adversarial training is the method of generating gradient-based adversarial attacks to augment training samples online during training. This is one of the few methods that (slightly) improves adversarial robustness of models. However, due to the discreteness of language, gradient-based adversarial training is not quite feasible.

Two new papers propose to train with attacks on embeddings instead of inputs. These methods promise orders-of-magnitude faster adversarial training, as well as robustness that generalizes to unseen attacks. Similar latent adversarial training has been proposed in the past, but the application to LLMs and attacks on deeper layers is new.

Efficient Adversarial Training in LLMs with Continuous Attacks [Mila, Microsoft, TUM] augments training data by attacking input embeddings. The resulting LMs are more robust than with discrete input-based adversarial training, also against discrete input attacks such as GCG or AutoDAN.

Defending Against Unforeseen Failure Modes with Latent Adversarial Training [MIT, Columbia, Harvard] can use any hidden layer’s embedding as the attacked “input”. In practice, the attacked layer is chosen as a hyperparameter. Compared to an input embedding-based attack (as above), attacking deeper layers often leads to improvements in the accuracy-robustness pareto frontier, especially for unseen attacks.

Overall, these results are quite impressive, especially in terms of training time. The space of latent adversarial training for LLMs currently seems vastly underexplored. I’m quite sure there will be more discoveries to come.

Deception in language models despite no incentives

*Overview of experimental setup and results on deceptive tendencies in LLMs.*

Last year, Apollo published a technical report on a language-based scenario where an LLM is put under pressure to lie and hide information from officials investigating a trading company. The paper found that the LLM indeed frequently deceives the investigators.

In Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant [Constellation, Anthropic], the researchers created a similar setup, but without any pressure or instructions to deceive. They found that Claude 3 Opus lies regardless.

In particular, the authors set up a task of mass-generating comments to influence public perception, which the LLM complies with. Afterwards, an auditor questions the LLM, which then lies and denies its actions. They also found that the LLM often strategically pretends to be less capable during evaluations. This highlights again that models already have a tendency to deceive humans.

Concept corner

*Overview of Guaranteed Safe AI, contrasted with regular AI safety.*

Over the last few years, multiple prominent researchers have proposed ways to achieve generalist AI systems, from Bengio’s Cautious Scientist AI to LeCun’s not-so-cautious Autonomous Machine Intelligence. The conceptual position paper Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [ARIA, Oxford, Mila, Berkeley, MIT, Beneficial AI, x.AI, FAR, PIBBSS, Cornell, Stanford, CMU, Columbia] defines an overarching framework of approaches to creating safe AI, called guaranteed safe AI (GS AI). It splits the AI system into three components:

World model: Mathematical description of how the AI affects the world
Safety specification: Mathematical description of acceptable effects
Verifier: Auditable proof that the AI satisfies the safety specification

For each of those components, the paper defines several levels, from hard-to-control to a rigorous definition. It then ranks various proposals according to this scheme, giving a good, systematized overview of proposals.

A Mechanism-Based Approach to Mitigating Harms from Persuasive Generative AI [GDM, GR, Google Jigsaw, Edinburgh, Cornell, UCL] conceptually studies harms from AI persuasion. It distinguishes rational persuasion, based on sound reasoning and evidence, and manipulation, leveraging cognitive weaknesses and misrepresenting information. The paper furthermore separates harms into outcome harms caused by resulting behavior after persuasion, and process harms caused by manipulation that violates someone’s autonomy or cognitive integrity.

The paper then maps out the different mechanisms of persuasion, such as building trust and rapport, personalization, and manipulative strategies. Finally, the paper discusses various possible mitigations of harm such as monitoring, classifiers, or interpretability.

AI Safety at the Frontier

Discussion about this post