Sabotage evaluations, harmful agent capabilities, model self-prediction, rare output estimation, LLM steganography, SAE residuals, and agent contracts.
Share this post
Paper Highlights, October '24
Share this post
Sabotage evaluations, harmful agent capabilities, model self-prediction, rare output estimation, LLM steganography, SAE residuals, and agent contracts.