Scaling SAEs to Claude 3, non-linear features, faithfulness of CoT, preventing malicious fine-tuning, latent adversarial training, emergent deception in LLMs, Guaranteed Safe AI, AI persuasion
Share this post
Paper Highlights, May '24
Share this post
Scaling SAEs to Claude 3, non-linear features, faithfulness of CoT, preventing malicious fine-tuning, latent adversarial training, emergent deception in LLMs, Guaranteed Safe AI, AI persuasion