ICLR 2025 "safety alignment" Papers
10 papers found
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Hui Yuan, Yifan Zeng, Yue Wu et al.
ICLR 2025posterarXiv:2410.13828
5
citations
Can a Large Language Model be a Gaslighter?
Wei Li, Luyao Zhu, Yang Song et al.
ICLR 2025posterarXiv:2410.09181
2
citations
Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements
Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda et al.
ICLR 2025posterarXiv:2410.08968
22
citations
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
Rui Ye, Jingyi Chai, Xiangrui Liu et al.
ICLR 2025posterarXiv:2406.10630
18
citations
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
Rylan Schaeffer, Dan Valentine, Luke Bailey et al.
ICLR 2025posterarXiv:2407.15211
22
citations
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia, Tianyu Pang, Chao Du et al.
ICLR 2025posterarXiv:2405.21018
74
citations
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko, francesco croce, Nicolas Flammarion
ICLR 2025posterarXiv:2404.02151
387
citations
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Biao Yi, Tiansheng Huang, Sishuo Chen et al.
ICLR 2025posterarXiv:2506.16447
21
citations
Safety Alignment Should be Made More Than Just a Few Tokens Deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.
ICLR 2025posterarXiv:2406.05946
287
citations
Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron
Yiran Zhao, Wenxuan Zhang, Yuxi Xie et al.
ICLR 2025poster