Faking alignment, measuring scheming capabilities, countering deception and sandbagging with model internals, and obfuscating model internals.
Share this post
Paper Highlights, December '24
Share this post
Faking alignment, measuring scheming capabilities, countering deception and sandbagging with model internals, and obfuscating model internals.