2025 "safety alignment" Papers
14 papers found
A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement
Hui Yuan, Yifan Zeng, Yue Wu et al.
ICLR 2025posterarXiv:2410.13828
5
citations
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
Zhixin Xie, Xurui Song, Jun Luo
NeurIPS 2025posterarXiv:2510.02833
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
NeurIPS 2025posterarXiv:2506.00781
3
citations
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
Rui Ye, Jingyi Chai, Xiangrui Liu et al.
ICLR 2025posterarXiv:2406.10630
18
citations
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao, Bryan Hooi, Jun Liu et al.
CVPR 2025posterarXiv:2411.18000
5
citations
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Xiaoying Xing, Avinab Saha, Junfeng He et al.
CVPR 2025highlightarXiv:2501.06481
3
citations
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia, Tianyu Pang, Chao Du et al.
ICLR 2025posterarXiv:2405.21018
74
citations
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng, Erjia Xiao, Jing Shao et al.
NeurIPS 2025posterarXiv:2501.13772
4
citations
Lifelong Safety Alignment for Language Models
Haoyu Wang, Yifei Zhao, Zeyu Qin et al.
NeurIPS 2025posterarXiv:2505.20259
6
citations
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Biao Yi, Tiansheng Huang, Sishuo Chen et al.
ICLR 2025posterarXiv:2506.16447
21
citations
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Jiaming Ji, Xinyu Chen, Rui Pan et al.
NeurIPS 2025posterarXiv:2503.17682
8
citations
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang, Jiaxin Song, Yifeng Gao et al.
NeurIPS 2025posterarXiv:2505.11926
3
citations
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NeurIPS 2025posterarXiv:2406.14144
24
citations
Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron
Yiran Zhao, Wenxuan Zhang, Yuxi Xie et al.
ICLR 2025poster