Poster "language model safety" Papers
7 papers found
Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses
David Glukhov, Ziwen Han, I Shumailov et al.
ICLR 2025posterarXiv:2407.02551
10
citations
Shh, don't say that! Domain Certification in LLMs
Cornelius Emde, Alasdair Paren, Preetham Arvind et al.
ICLR 2025posterarXiv:2502.19320
4
citations
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang, Chengzhi (Martin) Hu, Paul Röttger et al.
ICLR 2025posterarXiv:2410.03415
24
citations
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization
Noam Razin, Sadhika Malladi, Adithya Bhaskar et al.
ICLR 2025posterarXiv:2410.08847
47
citations
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity
Andrew Lee, Xiaoyan Bai, Itamar Pres et al.
ICML 2024poster
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh, Shauli Ravfogel, Jonathan Herzig et al.
ICML 2024posterarXiv:2402.09631
Whispering Experts: Neural Interventions for Toxicity Mitigation in Language Models
Xavi Suau, Pieter Delobelle, Katherine Metcalf et al.
ICML 2024poster