Faking alignment, measuring scheming capabilities, countering deception and sandbagging with model internals, and obfuscating model internals.
Great work, as always!
Great work, as always!