Scaling SAEs to Claude 3, non-linear features, faithfulness of CoT, preventing malicious fine-tuning, latent adversarial training, emergent deception in LLMs, Guaranteed Safe AI, AI persuasion
Paper Highlights, May '24
Paper Highlights, May '24
Paper Highlights, May '24
Scaling SAEs to Claude 3, non-linear features, faithfulness of CoT, preventing malicious fine-tuning, latent adversarial training, emergent deception in LLMs, Guaranteed Safe AI, AI persuasion