Unintentional RLHF exploitation, unlearning doesn't unlearn, backdoor detection theory, Bayesian safety bounds, LLMs proposing novel research, and evaluating AI safety frameworks.
Share this post
Paper Highlights, September '24
Share this post
Unintentional RLHF exploitation, unlearning doesn't unlearn, backdoor detection theory, Bayesian safety bounds, LLMs proposing novel research, and evaluating AI safety frameworks.