NeurIPS 2025 "jailbreak attacks" Papers
9 papers found
ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio–Language Models
Weifei Jin, Yuxin Cao, Junjie Su et al.
NeurIPS 2025posterarXiv:2510.26096
Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs
Zhixin Xie, Xurui Song, Jun Luo
NeurIPS 2025posterarXiv:2510.02833
Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs
Masahiro Kaneko, Timothy Baldwin
NeurIPS 2025spotlightarXiv:2510.17000
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
NeurIPS 2025posterarXiv:2506.00781
3
citations
GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
Advik Basani, Xiao Zhang
NeurIPS 2025posterarXiv:2411.14133
12
citations
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng, Erjia Xiao, Jing Shao et al.
NeurIPS 2025posterarXiv:2501.13772
4
citations
Reasoning as an Adaptive Defense for Safety
Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan et al.
NeurIPS 2025posterarXiv:2507.00971
9
citations
Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence
Shaopeng Fu, Liang Ding, Jingfeng ZHANG et al.
NeurIPS 2025posterarXiv:2502.04204
6
citations
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks
Jiayang Liu, Siyuan Liang, Shiqian Zhao et al.
NeurIPS 2025posterarXiv:2505.06679
6
citations