Emergent misalignment, LLM value systems, recursive critiques, no canonical SAE features, refusal cones, REINFORCE attacks, and the case for theoretical understanding
Share this post
Paper Highlights, February '25
Share this post
Emergent misalignment, LLM value systems, recursive critiques, no canonical SAE features, refusal cones, REINFORCE attacks, and the case for theoretical understanding