ICLR 2025 "safety alignment" Papers

10 papers found

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Hui Yuan, Yifan Zeng, Yue Wu et al.

ICLR 2025posterarXiv:2410.13828
5
citations

Can a Large Language Model be a Gaslighter?

Wei Li, Luyao Zhu, Yang Song et al.

ICLR 2025posterarXiv:2410.09181
2
citations

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Jingyu Zhang, Ahmed Elgohary Ghoneim, Ahmed Magooda et al.

ICLR 2025posterarXiv:2410.08968
22
citations

Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Rui Ye, Jingyi Chai, Xiangrui Liu et al.

ICLR 2025posterarXiv:2406.10630
18
citations

Failures to Find Transferable Image Jailbreaks Between Vision-Language Models

Rylan Schaeffer, Dan Valentine, Luke Bailey et al.

ICLR 2025posterarXiv:2407.15211
22
citations

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Xiaojun Jia, Tianyu Pang, Chao Du et al.

ICLR 2025posterarXiv:2405.21018
74
citations

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025posterarXiv:2404.02151
387
citations

Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models

Biao Yi, Tiansheng Huang, Sishuo Chen et al.

ICLR 2025posterarXiv:2506.16447
21
citations

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

ICLR 2025posterarXiv:2406.05946
287
citations

Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron

Yiran Zhao, Wenxuan Zhang, Yuxi Xie et al.

ICLR 2025poster