Lots of Gemma SAEs, boardgames for SAEs, magnitude-based features, defending against fine-tuning, multi-turn jailbreaks, "evil twin" attacks, poisoning large models, and more risk taxonomies.
Share this post
Paper Highlights, August '24
Share this post
Lots of Gemma SAEs, boardgames for SAEs, magnitude-based features, defending against fine-tuning, multi-turn jailbreaks, "evil twin" attacks, poisoning large models, and more risk taxonomies.