"safety alignment" Papers
15 papers found
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
Zhixin Xie, Xurui Song, Jun Luo
NeurIPS 2025posterarXiv:2510.02833
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
NeurIPS 2025posterarXiv:2506.00781
3
citations
Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
Rui Ye, Jingyi Chai, Xiangrui Liu et al.
ICLR 2025posterarXiv:2406.10630
18
citations
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
Shuyang Hao, Bryan Hooi, Jun Liu et al.
CVPR 2025posterarXiv:2411.18000
5
citations
Focus-N-Fix: Region-Aware Fine-Tuning for Text-to-Image Generation
Xiaoying Xing, Avinab Saha, Junfeng He et al.
CVPR 2025highlightarXiv:2501.06481
3
citations
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia, Tianyu Pang, Chao Du et al.
ICLR 2025posterarXiv:2405.21018
74
citations
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng, Erjia Xiao, Jing Shao et al.
NeurIPS 2025posterarXiv:2501.13772
4
citations
Lifelong Safety Alignment for Language Models
Haoyu Wang, Yifei Zhao, Zeyu Qin et al.
NeurIPS 2025posterarXiv:2505.20259
6
citations
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang, Jiaxin Song, Yifeng Gao et al.
NeurIPS 2025posterarXiv:2505.11926
3
citations
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NeurIPS 2025posterarXiv:2406.14144
24
citations
Understanding and Enhancing Safety Mechanisms of LLMs via Safety-Specific Neuron
Yiran Zhao, Wenxuan Zhang, Yuxi Xie et al.
ICLR 2025poster
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang et al.
ICML 2024poster
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Junyuan Hong, Jinhao Duan, Chenhui Zhang et al.
ICML 2024poster
PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition
Ziyang Zhang, Qizhen Zhang, Jakob Foerster
ICML 2024poster
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.
ICML 2024poster