2025 "jailbreak attacks" Papers

13 papers found

Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs

Masahiro Kaneko, Timothy Baldwin

NeurIPS 2025spotlightarXiv:2510.17000

Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Keltin Grimes, Marco Christiani, David Shriver et al.

ICLR 2025posterarXiv:2412.13341
6
citations

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

NeurIPS 2025posterarXiv:2506.00781
3
citations

Durable Quantization Conditioned Misalignment Attack on Large Language Models

Peiran Dong, Haowei Li, Song Guo

ICLR 2025poster
1
citations

EFFICIENT JAILBREAK ATTACK SEQUENCES ON LARGE LANGUAGE MODELS VIA MULTI-ARMED BANDIT-BASED CONTEXT SWITCHING

Aditya Ramesh, Shivam Bhardwaj, Aditya Saibewar et al.

ICLR 2025poster
3
citations

Endless Jailbreaks with Bijection Learning

Brian R.Y. Huang, Max Li, Leonard Tang

ICLR 2025posterarXiv:2410.01294
14
citations

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Ma Teng, Xiaojun Jia, Ranjie Duan et al.

ICCV 2025posterarXiv:2412.05934
21
citations

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Ruofan Wang, Juncheng Li, Yixu Wang et al.

ICCV 2025posterarXiv:2411.00827
8
citations

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models

Hao Cheng, Erjia Xiao, Jing Shao et al.

NeurIPS 2025posterarXiv:2501.13772
4
citations

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.

ICCV 2025posterarXiv:2501.04931
28
citations

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan et al.

NeurIPS 2025posterarXiv:2507.00971
9
citations

Short-length Adversarial Training Helps LLMs Defend Long-length Jailbreak Attacks: Theoretical and Empirical Evidence

Shaopeng Fu, Liang Ding, Jingfeng ZHANG et al.

NeurIPS 2025posterarXiv:2502.04204
6
citations

T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks

Jiayang Liu, Siyuan Liang, Shiqian Zhao et al.

NeurIPS 2025posterarXiv:2505.06679
6
citations