2025 "jailbreak attacks" Papers
13 papers found
Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs
Masahiro Kaneko, Timothy Baldwin
NeurIPS 2025spotlightarXiv:2510.17000
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes, Marco Christiani, David Shriver et al.
ICLR 2025posterarXiv:2412.13341
6
citations
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
NeurIPS 2025posterarXiv:2506.00781
3
citations
Durable Quantization Conditioned Misalignment Attack on Large Language Models
Peiran Dong, Haowei Li, Song Guo
ICLR 2025poster
1
citations
EFFICIENT JAILBREAK ATTACK SEQUENCES ON LARGE LANGUAGE MODELS VIA MULTI-ARMED BANDIT-BASED CONTEXT SWITCHING
Aditya Ramesh, Shivam Bhardwaj, Aditya Saibewar et al.
ICLR 2025poster
3
citations
Endless Jailbreaks with Bijection Learning
Brian R.Y. Huang, Max Li, Leonard Tang
ICLR 2025posterarXiv:2410.01294
14
citations
Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
Ma Teng, Xiaojun Jia, Ranjie Duan et al.
ICCV 2025posterarXiv:2412.05934
21
citations
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
Ruofan Wang, Juncheng Li, Yixu Wang et al.
ICCV 2025posterarXiv:2411.00827
8
citations
Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models
Hao Cheng, Erjia Xiao, Jing Shao et al.
NeurIPS 2025posterarXiv:2501.13772
4
citations
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
ICCV 2025posterarXiv:2501.04931
28
citations
Reasoning as an Adaptive Defense for Safety
Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan et al.
NeurIPS 2025posterarXiv:2507.00971
9
citations
Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence
Shaopeng Fu, Liang Ding, Jingfeng ZHANG et al.
NeurIPS 2025posterarXiv:2502.04204
6
citations
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks
Jiayang Liu, Siyuan Liang, Shiqian Zhao et al.
NeurIPS 2025posterarXiv:2505.06679
6
citations