Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?

7citations

arXiv:2412.03235

citations

#1281

in ICLR 2025

of 3827 papers

Top Authors

Data Points

Top Authors

Sravanti Addepalli Yerram Varun Arun Suggala Karthikeyan Shanmugam Prateek Jain

Abstract

Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to findadversarialprompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe againstnaturalprompts which are semantically related to toxic seed prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT-4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically relatednaturalprompts that can jailbreak aligned LLMs. Towards this, we propose a method ofResponse Guided Question Augmentation (ReG-QA)to evaluate the generalization of safety aligned LLMs to natural prompts, that first generates several toxic answers given a seed question using an unaligned LLM (Q to A), and further leverages an LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreakquestionsfrom unsafe content (without denial) and can thus be used for the latter (A to Q) step. We obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard, while being significantly more stable against defenses such as Smooth-LLM and Synonym Substitution, which are effective against existing all attacks on the leaderboard.

Citation History

Jan 25, 2026