Endless Jailbreaks with Bijection Learning

14citations

arXiv:2410.01294

Citations

#488

in ICLR 2025

of 3827 papers

Authors

Data Points

Authors

Brian R.Y. Huang Max Li Leonard Tang

Topics

adversarial inputs jailbreak attacks bijection learning in-context learning safety vulnerabilities model capability scaling frontier language models

Abstract

Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce bijection learning, a powerful attack algorithm which automatically fuzzes LLMs for safety vulnerabilities using randomly-generated encodings whose complexity can be tightly controlled. We leverage in-context learning to teach models bijective encodings, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English. Our attack is extremely effective on a wide range of frontier language models. Moreover, by controlling complexity parameters such as number of key-value mappings in the encodings, we find a close relationship between the capability level of the attacked LLM and the average complexity of the most effective bijection attacks. Our work highlights that new vulnerabilities in frontier models can emerge with scale: more capable models are more severely jailbroken by bijection attacks.

Citation History

Jan 26, 2026

Jan 27, 2026

Jan 31, 2026

14+14