Lots of Gemma SAEs, boardgames for SAEs, magnitude-based features, defending against fine-tuning, multi-turn jailbreaks, "evil twin" attacks, poisoning large models, and more risk taxonomies.
Paper Highlights, August '24
Paper Highlights, August '24
Paper Highlights, August '24
Lots of Gemma SAEs, boardgames for SAEs, magnitude-based features, defending against fine-tuning, multi-turn jailbreaks, "evil twin" attacks, poisoning large models, and more risk taxonomies.