Sleeper agents, weak-to-strong generalization, supervision with subversion, problems with unsupervised LLM knowledge, patching activations, explainability benchmark, replication evals, DPO analysis
Share this post
Paper Collection, December '23
Share this post
Sleeper agents, weak-to-strong generalization, supervision with subversion, problems with unsupervised LLM knowledge, patching activations, explainability benchmark, replication evals, DPO analysis