NEURIPS 2025 "safety alignment" Papers

11 papers found

Filters:NEURIPS 2025 safety alignment Clear all

Conference

AAAI 2025 (3,028)CVPR 2025 (2,873)ICCV 2025 (2,701)ICLR 2025 (3,827)ICML 2025 (3,340)ISMAR 2025 (229)NEURIPS 2025 (5,858)AAAI 2024 (2,289)CVPR 2024 (2,716)ECCV 2024 (2,387)ICLR 2024 (2,297)ICML 2024 (2,635)

Paper Type

poster (24,624)paper (8,140)oral (1,594)spotlight (1,421)highlight (975)

Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs

Zhixin Xie, Xurui Song, Jun Luo

NEURIPS 2025posterarXiv:2510.02833

citations

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

NEURIPS 2025posterarXiv:2506.00781

citations

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Xiyue Peng, Hengquan Guo, Jiawei Zhang et al.

NEURIPS 2025posterarXiv:2410.19933

citations

From Judgment to Interference: Early Stopping LLM Harmful Outputs via Streaming Content Monitoring

Yang Li, Qiang Sheng, Yehan Yang et al.

NEURIPS 2025posterarXiv:2506.09996

citations

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng, Erjia Xiao, Jing Shao et al.

NEURIPS 2025posterarXiv:2501.13772

citations

Lifelong Safety Alignment for Language Models

Haoyu Wang, Yifei Zhao, Zeyu Qin et al.

NEURIPS 2025posterarXiv:2505.20259

citations

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni, Mohammed Haddou, Jackie CK Cheung et al.

NEURIPS 2025posterarXiv:2508.18076

citations

Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback

Jiaming Ji, Xinyu Chen, Rui Pan et al.

NEURIPS 2025posterarXiv:2503.17682

citations

Safety Depth in Large Language Models: A Markov Chain Perspective

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu et al.

NEURIPS 2025poster

citations

SafeVid: Toward Safety Aligned Video Large Multimodal Models

Yixu Wang, Jiaxin Song, Yifeng Gao et al.

NEURIPS 2025posterarXiv:2505.11926

citations

Towards Understanding Safety Alignment: A Mechanistic Perspective from Safety Neurons

Jianhui Chen, Xiaozhi Wang, Zijun Yao et al.

NEURIPS 2025posterarXiv:2406.14144

citations