Evaluation awareness and faking, AI value trade-offs, misalignment propensity, reward hacking, CoT monitoring, training against lie detectors, and exploring the landscape of refusals
Share this post
Paper Highlights, May '25
Share this post
Evaluation awareness and faking, AI value trade-offs, misalignment propensity, reward hacking, CoT monitoring, training against lie detectors, and exploring the landscape of refusals