Unintentional RLHF exploitation, unlearning doesn't unlearn, backdoor detection theory, Bayesian safety bounds, LLMs proposing novel research, and evaluating AI safety frameworks.
Paper Highlights, September '24
Paper Highlights, September '24
Paper Highlights, September '24
Unintentional RLHF exploitation, unlearning doesn't unlearn, backdoor detection theory, Bayesian safety bounds, LLMs proposing novel research, and evaluating AI safety frameworks.