NEURIPS 2025 "safety alignment" Papers
11 papers found
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
Zhixin Xie, Xurui Song, Jun Luo
NEURIPS 2025posterarXiv:2510.02833
2
citations
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
NEURIPS 2025posterarXiv:2506.00781
3
citations
Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization
Xiyue Peng, Hengquan Guo, Jiawei Zhang et al.
NEURIPS 2025posterarXiv:2410.19933
5
citations
From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring
Yang Li, Qiang Sheng, Yehan Yang et al.
NEURIPS 2025posterarXiv:2506.09996
7
citations
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng, Erjia Xiao, Jing Shao et al.
NEURIPS 2025posterarXiv:2501.13772
6
citations
Lifelong Safety Alignment for Language Models
Haoyu Wang, Yifei Zhao, Zeyu Qin et al.
NEURIPS 2025posterarXiv:2505.20259
7
citations
Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Khaoula Chehbouni, Mohammed Haddou, Jackie CK Cheung et al.
NEURIPS 2025posterarXiv:2508.18076
11
citations
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
Jiaming Ji, Xinyu Chen, Rui Pan et al.
NEURIPS 2025posterarXiv:2503.17682
8
citations
Safety Depth in Large Language Models: A Markov Chain Perspective
Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu et al.
NEURIPS 2025poster
1
citations
SafeVid: Toward Safety Aligned Video Large Multimodal Models
Yixu Wang, Jiaxin Song, Yifeng Gao et al.
NEURIPS 2025posterarXiv:2505.11926
3
citations
Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons
Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.
NEURIPS 2025posterarXiv:2406.14144
24
citations