Sabotage evaluations, harmful agent capabilities, model self-prediction, rare output estimation, LLM steganography, SAE residuals, and agent contracts.
Paper Highlights, October '24
Paper Highlights, October '24
Paper Highlights, October '24
Sabotage evaluations, harmful agent capabilities, model self-prediction, rare output estimation, LLM steganography, SAE residuals, and agent contracts.