"jailbreaking attacks" Papers
7 papers found
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang, Bo Li
ICLR 2025posterarXiv:2407.05557
34
citations
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.
NeurIPS 2025posterarXiv:2505.19911
3
citations
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee, Haebin Seong, Dong Bok Lee et al.
ICLR 2025posterarXiv:2410.01524
13
citations
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei, Kaixuan Huang, Yangsibo Huang et al.
ICML 2024poster
Fast Adversarial Attacks on Language Models In One GPU Minute
Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan et al.
ICML 2024poster
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan, Zidi Xiong, Yi Zeng et al.
ICML 2024poster
Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
Yongshuo Zong, Ondrej Bohdal, Tingyang Yu et al.
ICML 2024poster