Paper Collection, October '23

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. +TL;DR’s of 31 papers ranging from high-level analysis to LLM alignment and red teaming.

Nov 13, 2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Read this paper [Anthropic] or this paper [EleutherAI, independent, Apollo] – both propose the same method

Mechanistic interpretability is the process of trying to understand the mechanisms inside neural networks by analyzing model weights and trying to reverse engineer its algorithms.

One of the most important open problems in mechanistic interpretability is figuring out how independent features are represented.

In the simplest case, each neuron would represent exactly one feature of the input. For example, the presence of a cat in an image.

These “monosemantic neurons” do exist, but unfortunately most neurons are polysemantic and represent many different features, which makes interpreting them much harder.

If we had a real form of “independent units”, this would help us understand the mechanisms within neural networks. They would allow us to disentangle the neural network’s weights into a semantic network, similar to how variables are connected in a computer program.

The two papers highlighted above essentially implement the same idea for finding independent units: Train a sparse autoencoder on the model’s internal representations. The autoencoder’s weights then provide the independent units in the model’s representations. Importantly, the autoencoder should be overcomplete, in order to find all features in the model.

This method likely represents an important step towards monosemantic units, and thus being able to understand neural networks on a mechanistic level.

High-level analysis and surveys

Levels of AGI: Operationalizing Progress on the Path to AGI [Google] describes a framework defining levels of AGI in terms of performance and generality. This is complemented with a second framework for levels of autonomy and risks at each level.
AI Alignment: A Comprehensive Survey [PKU] gives a good overview of alignment research. They identify Robustness, Interpretability, Controllability, and Ethicality as key objectives of AI alignment. They separate work into forward alignment like learning from feedback and learning under distribution shift, and backward alignment like assurance techniques (evals, interpretability) and AI governance.
Provably safe systems: the only path to controllable AGI [MIT] argues that safe AGI can only be achieved by building them to provably satisfy human-specified requirements. They argue this can be done by using advanced AI for formal verification and mechanistic interpretability.
AI Alignment in the Design of Interactive AI: Specification Alignment, Process Alignment, and Evaluation Support [Google] looks at interfaces for interactively aligning an AI. They identify three alignment objectives for interactive AI: (1) specification alignment, reliably communicating objectives and verifying they were correctly interpreted, (2) process alignment, verifying and controlling the AI’s execution process, and (3) evaluation support, verifying and understanding the AI’s output.
Sociotechnical Safety Evaluation of Generative AI Systems [Google] proposes a structured, sociotechnical approach for safety evaluation. It focuses on harms like toxicity, misinformation, and malicious use. No mention of catastrophic risks.

Intriguing model properties

Explaining grokking through circuit efficiency [Google]. “Grokking” is when a model first has low test but high training accuracy, and after more training, suddenly transitions to perfect test accuracy. The authors propose that grokking occurs when a task admits both a generalizing and a memorizing solution and transitions between the two. They hypothesize that dataset size can make memorization more efficient, and show the related behavior of “ungrokking”: a model going from perfect to low test accuracy.
Preventing Language Models From Hiding Their Reasoning [Redwood] shows that LLMs can be trained to encode their reasoning in a way that improves their performance, but changes the intermediate reasoning text so humans can’t understand how it got there. This might become more prevalent as LLMs get better. However, paraphrasing can largely prevent such encoding schemes.

Internal representations of LLMs

Representation Engineering: A Top-Down Approach to AI Transparency [CAIS, CMU] proposes Representation Engineering, the method of using LLM representations to read out concepts during text generation or change model outputs. It is closely related to earlier work on Activation Addition.
The Geometry Of Truth: Emergent Linear Structure In Large Language Model Representations Of True/False Datasets [NEU, MIT] investigates the structure of “truthful” representation directions in LLMs, and finds that LLMs linearly represent truth.
Personas as a Way to Model Truthfulness in Language Models [NYU, ETHZ] argues that LLMs discern truthful text by modeling a truthful “persona” for sources like Wikipedia.

Open-source fine-tuning is problematic

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! [Princeton] This paper demonstrates that fine-tuning with only 10 examples at a cost of $0.20 can break the model’s safety fine-tuning, making it responsive to nearly any harmful instructions. Moreover, even fine-tuning on benign datasets can degrade the systems’ safety alignment.
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B [Palisade] also shows the low difficulty of breaking safety finetuning, demonstrating a clear risk from releasing model weights.
Will releasing the weights of large language models grant widespread access to pandemic agents? [MIT] Apparently yes, since it’s so easy to break safety guards of released models via fine-tuning, and models are already capable of providing nearly all key information. However, note this criticism of papers on biorisk from LLMs arguing that internet search currently provides similar information.

Properties of LLMs

Towards Understanding Sycophancy in Language Models [Anthropic]. Sycophancy is the behavior of repeating false user beliefs instead of truthful ones. This paper by Anthropic measures the prevalence of sycophancy in LLMs. It finds that both humans and preference models often prefer convincingly-written sycophantic responses over correct ones.
Are Emergent Abilities in Large Language Models just In-Context Learning? [TU Darmstadt] In this large study, the authors find no evidence for the emergence of reasoning abilities, instead concluding that emergent abilities can primarily be ascribed to in-context learning.

Alignment of current LLMs

ConstitutionMaker: Interactively Critiquing Large Language Models by Converting Feedback into Principles [Google] proposes ConstitutionMaker, an interactive tool for facilitating writing an LLM constitution. It does this by automatically converting user feedback into principles that are then added to the constitution.
Group Preference Optimization: Few-Shot Alignment of Large Language Models [UCLA] proposes group preference optimization (GPO), a method for quickly adapting alignment to different groups. They do this with an additional transformer that meta-learns the preferences of multiple groups and can adapt quickly to new groups.
Improving Generalization of Alignment with Human Preferences through Group Invariant Learning [Fudan] improves RLHF by first classifying data into separate groups. Optimization then focuses most on the challenging groups, preventing overfitting on simple groups.
Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs [CMU, Microsoft] argues for using a set of ethical principles together with ethical reasoning in LLMs, instead of RLHF training.
Editor’s note: This seems like a dangerous path forward, because we don’t know the conclusions of such ethical reasoning systems.
SuperHF: Supervised Iterative Learning from Human Feedback [Stanford] proposes to replace RL fine-tuning with a supervised loss on the reward model. Results are similar and training takes longer, but training is more stable and the implementation is easier.
Zephyr: Direct Distillation of LM Alignment [Hugging Face] proposes a method of distilling the alignment of LLMs to smaller models, resulting in the Zephyr-7B model.

Model unlearning

Who's Harry Potter? Approximate Unlearning in LLMs [Microsoft] shows that a model can unlearn all Harry Potter-related knowledge by (1) training further on the data to find which token’s logits are strengthened, (2) replacing these tokens with alternative, generic predictions, and (3) fine-tuning on these generic predictions.
SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation [MSU] introduces “weight saliency” (akin to saliency maps and input saliency) to focus model unlearning techniques on specific weights. This post-hoc unlearning technique is close to training from scratch.
Large Language Model Unlearning [ByteDance] proposes unlearning not only for removing specific (e.g. copyrighted) samples, but also for the usual goals behind RLHF like harmlessness and factuality. The authors claim improvements over regular RLHF.

LLM red teaming, adversarial attacks, and defenses

Jailbreaking Black Box Large Language Models in Twenty Queries [UPenn] proposes Prompt Automatic Iterative Refinement (PAIR), which uses a second LLM that automatically iterates on prompts that jailbreak the target LLM. Yes, it’s that simple. It just needs the right starting prompt for the attacking LLM.
AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models [Maryland, Adobe] proposes a white-box attack on LLMs via a token sampling procedure that jointly maximizes a readability objective and the probability of the target response.
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [UCSB] proposes automatically finding red teaming prompts by (1) augmenting the given seed prompt by text variations that don’t change the semantics, (2) generating similar prompts (bootstrapping), and (3) add a “hint” with an adversarial suggestion.
LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked [GeorgiaTech] uses an LLM as a post-hoc classifier to filter out harmful responses.
JADE: A Linguistics-based Safety Evaluation Platform for LLM [Fudan] uses manually specified linguistic rules to alter prompts and find LLM safety violations.
Robust Safety Classifier for Large Language Models: Adversarial Prompt Shield [UCI] studies the pre-LLM classifiers used by APIs to filter out harmful prompts.They propose a new classifier and a training data generation method. Their model outperforms both OpenAI’s Moderation API and Meta’s BAD filter.

AI Safety at the Frontier

Discussion about this post