"jailbreak attacks" Papers
8 papers found
Bits Leaked per Query: Information-Theoretic Bounds for Adversarial Attacks on LLMs
Masahiro Kaneko, Timothy Baldwin
NeurIPS 2025spotlightarXiv:2510.17000
Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing
Keltin Grimes, Marco Christiani, David Shriver et al.
ICLR 2025posterarXiv:2412.13341
6
citations
CoP: Agentic Red-teaming for Large Language Models using Composition of Principles
Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho
NeurIPS 2025posterarXiv:2506.00781
3
citations
Durable Quantization Conditioned Misalignment Attack on Large Language Models
Peiran Dong, Haowei Li, Song Guo
ICLR 2025poster
1
citations
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
Shiji Zhao, Ranjie Duan, Fengxiang Wang et al.
ICCV 2025posterarXiv:2501.04931
28
citations
Reasoning as an Adaptive Defense for Safety
Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan et al.
NeurIPS 2025posterarXiv:2507.00971
9
citations
T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks
Jiayang Liu, Siyuan Liang, Shiqian Zhao et al.
NeurIPS 2025posterarXiv:2505.06679
6
citations
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu, Xiaosen Zheng, Tianyu Pang et al.
ICML 2024poster