Poster "automated red-teaming" Papers
2 papers found
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning
Seanie Lee, Minsu Kim, Lynn Cherif et al.
ICLR 2025posterarXiv:2405.18540
44
citations
Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo
Stephen Zhao, Rob Brekelmans, Alireza Makhzani et al.
ICML 2024poster