Safetywashing in benchmarks, SAD, AgentDojo, vision-language model attacks, latent adversarial training, brittle steering vectors, the 2-dimensional truth, legible LLM solutions, and more LLM debate.
Share this post
Paper Highlights, July '24
Share this post
Safetywashing in benchmarks, SAD, AgentDojo, vision-language model attacks, latent adversarial training, brittle steering vectors, the 2-dimensional truth, legible LLM solutions, and more LLM debate.