🧬Robustness

Adversarial Attacks

Crafting adversarial examples

100 papers3,143 total citations
Compare with other topics
Feb '24 Jan '26595 papers
Also includes: adversarial attacks, adversarial examples, attack methods, adversarial perturbations

Top Papers

#1

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025
375
citations
#2

Ring-A-Bell! How Reliable are Concept Removal Methods For Diffusion Models?

Yu-Lin Tsai, Chia-Yi Hsu, Chulin Xie et al.

ICLR 2024
162
citations
#3

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian et al.

ICLR 2025
127
citations
#4

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.

ICML 2025
123
citations
#5

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke et al.

ICML 2025
110
citations
#6

EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE

Zeyi Liao, Lingbo Mo, Chejian Xu et al.

ICLR 2025arXiv:2409.11295
web agent securityprivacy leakage attacksenvironmental injection attackadversarial threat modeling+4
106
citations
#7

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei et al.

ICLR 2025
103
citations
#8

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Xiaogeng Liu, Peiran Li, G. Edward Suh et al.

ICLR 2025
100
citations
#9

Rethinking Model Ensemble in Transfer-based Adversarial Attacks

Huanran Chen, Yichi Zhang, Yinpeng Dong et al.

ICLR 2024
96
citations
#10

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, hangyu guo, Kun Zhou et al.

ECCV 2024
93
citations
#11

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.

CVPR 2024
80
citations
#12

Dissecting Adversarial Robustness of Multimodal LM Agents

Chen Wu, Rishi Shah, Jing Yu Koh et al.

ICLR 2025
77
citations
#13

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models

Zhipei Xu, Xuanyu Zhang, Runyi Li et al.

ICLR 2025
75
citations
#14

Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks

Mehrdad Saberi, Vinu Sankar Sadasivan, Keivan Rezaei et al.

ICLR 2024
74
citations
#15

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Jiawang Bai, Kuofeng Gao, Shaobo Min et al.

CVPR 2024
68
citations
#16

DAP: A Dynamic Adversarial Patch for Evading Person Detectors

Amira Guesmi, Ruitian Ding, Muhammad Abdullah Hanif et al.

CVPR 2024
48
citations
#17

Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

Seanie Lee, Minsu Kim, Lynn Cherif et al.

ICLR 2025
42
citations
#18

PAD: Patch-Agnostic Defense against Adversarial Patch Attacks

Lihua Jing, Rui Wang, Wenqi Ren et al.

CVPR 2024
39
citations
#19

MathAttack: Attacking Large Language Models towards Math Solving Ability

Zihao Zhou, Qiufeng Wang, Mingyu Jin et al.

AAAI 2024arXiv:2309.01686
adversarial attacksmath word problemslarge language modelslogical entity recognition+4
37
citations
#20

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

Yuhao Sun, Lingyun Yu, Hongtao Xie et al.

CVPR 2024
36
citations
#21

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

Robert Hönig, Javier Rando, Nicholas Carlini et al.

ICLR 2025
35
citations
#22

Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models

Peifei Zhu, Tsubasa Takahashi, Hirokatsu Kataoka

CVPR 2024
34
citations
#23

Fake It till You Make It: Curricular Dynamic Forgery Augmentations towards General Deepfake Detection

Yuzhen Lin, Wentang Song, Bin Li et al.

ECCV 2024
34
citations
#24

Persistent Pre-training Poisoning of LLMs

Yiming Zhang, Javier Rando, Ivan Evtimov et al.

ICLR 2025
34
citations
#25

Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model

Decheng Liu, Xijun Wang, Chunlei Peng et al.

AAAI 2024arXiv:2312.11285
adversarial attacksface recognition modelslatent diffusion modelidentity-sensitive conditioning+4
34
citations
#26

Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang, Xingjun Ma, Xin Wang et al.

ECCV 2024
33
citations
#27

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Sensen Gao, Xiaojun Jia, Xuhong Ren et al.

ECCV 2024arXiv:2403.12445
vision-language pre-trainingmultimodal adversarial examplesadversarial transferabilityadversarial trajectory+3
31
citations
#28

Spot the Fake: Large Multimodal Model-Based Synthetic Image Detection with Artifact Explanation

Siwei Wen, junyan ye, Peilin Feng et al.

NeurIPS 2025
29
citations
#29

Adversarial Search Engine Optimization for Large Language Models

Fredrik Nestaas, Edoardo Debenedetti, Florian Tramer

ICLR 2025
25
citations
#30

Model Poisoning Attacks to Federated Learning via Multi-Round Consistency

Yueqi Xie, Minghong Fang, Neil Zhenqiang Gong

CVPR 2025arXiv:2404.15611
model poisoning attacksfederated learning securitymulti-round consistencyadversarial defenses+3
24
citations
#31

ADBM: Adversarial Diffusion Bridge Model for Reliable Adversarial Purification

Xiao Li, Wenxuan Sun, Huanran Chen et al.

ICLR 2025
24
citations
#32

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Mintong Kang, Chejian Xu, Bo Li

ICLR 2025
21
citations
#33

A Transfer Attack to Image Watermarks

Yuepeng Hu, Zhengyuan Jiang, Moyang Guo et al.

ICLR 2025
21
citations
#34

Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples

chengqian gao, Haonan Li, Liu Liu et al.

ICML 2025
20
citations
#35

Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks

Junying Wang, Hongyuan Zhang, Yuan Yuan

CVPR 2025arXiv:2503.08269
customized portrait generationfacial adversarial attacksprivacy protectionface recognition systems+3
20
citations
#36

Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models

Andreas Müller, Denis Lukovnikov, Jonas Thietke et al.

CVPR 2025
20
citations
#37

EditShield: Protecting Unauthorized Image Editing by Instruction-guided Diffusion Models

Ruoxi Chen, Haibo Jin, Yixin Liu et al.

ECCV 2024arXiv:2311.12066
instruction-guided diffusion modelsunauthorized image manipulationimage editing protectionlatent representation perturbation+3
20
citations
#38

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

Yumeng Li, Margret Keuper, Dan Zhang et al.

ICLR 2024
19
citations
#39

Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement

Han Wu, Guanyan Ou, Weibin Wu et al.

CVPR 2024
19
citations
#40

Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark

Mengxi Ya, Yiming Li, Tao Dai et al.

ICLR 2024
18
citations
#41

Revisiting Adversarial Training Under Long-Tailed Distributions

Xinli Yue, Ningping Mou, Qian Wang et al.

CVPR 2024
17
citations
#42

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao, Xiang Zheng, Lin Luo et al.

ICLR 2025
16
citations
#43

Memory Injection Attacks on LLM Agents via Query-Only Interaction

Shen Dong, Shaochen Xu, Pengfei He et al.

NeurIPS 2025
16
citations
#44

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise

Yixin Liu, Kaidi Xu, Xun Chen et al.

AAAI 2024arXiv:2311.13091
unlearnable examplesdata poisoningadversarial trainingdefensive noise+4
16
citations
#45

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Runqi Lin, Bo Han, Fengwang Li et al.

ICLR 2025
16
citations
#46

Progressive Poisoned Data Isolation for Training-Time Backdoor Defense

Yiming Chen, Haiwei Wu, Jiantao Zhou

AAAI 2024arXiv:2312.12724
backdoor attacksdata poisoningtraining-time defensepoisoned data isolation+2
16
citations
#47

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Andy Zhou, Kevin Wu, Francesco Pinto et al.

NeurIPS 2025
15
citations
#48

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

Yangyang Guo, Fangkai Jiao, Liqiang Nie et al.

NeurIPS 2025
15
citations
#49

Adversarial Training Should Be Cast as a Non-Zero-Sum Game

Alex Robey, Fabian Latorre, George Pappas et al.

ICLR 2024
15
citations
#50

Security Attacks on LLM-based Code Completion Tools

Wen Cheng, Ke Sun, Xinyu Zhang et al.

AAAI 2025
15
citations
#51

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Yiren Song, Pei Yang, Hai Ci et al.

CVPR 2025
14
citations
#52

The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions

Stefan Sylvius Wagner, Maike Behrendt, Marc Ziegele et al.

ICLR 2025
14
citations
#53

Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving

Mozhgan Pourkeshavarz, Mohammad Sabokrou, Amir Rasouli

CVPR 2024
14
citations
#54

Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples

Junhao Dong, Piotr Koniusz, Junxi Chen et al.

CVPR 2024
14
citations
#55

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Xiaojun Jia, Sensen Gao, Simeng Qin et al.

NeurIPS 2025
12
citations
#56

DiffFAS: Face Anti-Spoofing via Generative Diffusion Models

Xinxu Ge, Xin Liu, Zitong Yu et al.

ECCV 2024
12
citations
#57

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer et al.

CVPR 2025
12
citations
#58

CL-Attack: Textual Backdoor Attacks via Cross-Lingual Triggers

Jingyi Zheng, Tianyi Hu, Tianshuo Cong et al.

AAAI 2025
12
citations
#59

Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM

Linyu Tang, Lei Zhang

CVPR 2024
12
citations
#60

Adversarial Attacks on the Interpretation of Neuron Activation Maximization

Géraldin Nanfack, Alexander Fulleringer, Jonathan Marty et al.

AAAI 2024arXiv:2306.07397
adversarial attacksactivation maximizationneuron interpretationinterpretability deception+3
12
citations
#61

Boosting Adversarial Training via Fisher-Rao Norm-based Regularization

Xiangyu Yin, Wenjie Ruan

CVPR 2024
12
citations
#62

RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors

Fengshuo Bai, Runze Liu, Yali Du et al.

AAAI 2025
12
citations
#63

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Nay Myat Min, Long H. Pham, Yige Li et al.

ICML 2025
12
citations
#64

BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

Xueyang Zhou, Guiyao Tie, Guowen Zhang et al.

NeurIPS 2025
11
citations
#65

Instant Adversarial Purification with Adversarial Consistency Distillation

Chun Tong Lei, Hon Ming Yam, Zhongliang Guo et al.

CVPR 2025
11
citations
#66

Efficient Backdoor Attacks for Deep Neural Networks in Real-world Scenarios

Ziqiang Li, Hong Sun, Pengfei Xia et al.

ICLR 2024
11
citations
#67

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen, Dongcheng Zhao, Yiting Dong et al.

ICLR 2025
11
citations
#68

Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video

Zhaobo Qi, Yibo Yuan, Xiaowen Ruan et al.

AAAI 2024arXiv:2401.07567
temporal sentence groundingdataset biasadversarial trainingmultimodal alignment+4
11
citations
#69

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

Zihan Guan, Mengxuan Hu, Ronghang Zhu et al.

ICML 2025
11
citations
#70

PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor

Jaewon Jung, Hongsun Jang, Jaeyong Song et al.

CVPR 2024
11
citations
#71

ProSec: Fortifying Code LLMs with Proactive Security Alignment

Xiangzhe Xu, Zian Su, Jinyao Guo et al.

ICML 2025
11
citations
#72

Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection

Jiahao Xu, Zikai Zhang, Rui Hu

CVPR 2025
11
citations
#73

Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack

Mingyu Yang, Daizong Liu, Keke Tang et al.

ECCV 2024
10
citations
#74

Backdoor Contrastive Learning via Bi-level Trigger Optimization

Weiyu Sun, Xinyu Zhang, Hao LU et al.

ICLR 2024
10
citations
#75

AdvPrefix: An Objective for Nuanced LLM Jailbreaks

Sicheng Zhu, Brandon Amos, Yuandong Tian et al.

NeurIPS 2025
10
citations
#76

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang, Chenghui Lv, Xian Li et al.

CVPR 2025
10
citations
#77

Demystifying Poisoning Backdoor Attacks from a Statistical Perspective

Ganghua Wang, Xun Xian, Ashish Kundu et al.

ICLR 2024
10
citations
#78

Backdoor Attacks Against No-Reference Image Quality Assessment Models via a Scalable Trigger

Yi Yu, Song Xia, Xun Lin et al.

AAAI 2025
10
citations
#79

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Andy Zhang, Joey Ji, Celeste Menders et al.

NeurIPS 2025arXiv:2505.15216
cybersecurity ai agentsvulnerability detectionbug bounty programsexploit generation+4
9
citations
#80

Exploring Vulnerabilities in Spiking Neural Networks: Direct Adversarial Attacks on Raw Event Data

Yanmeng Yao, Xiaohan Zhao, Bin Gu

ECCV 2024
spiking neural networksadversarial attacksevent-based visiondynamic vision sensors+4
9
citations
#81

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

Xiangyu Liu, Souradip Chakraborty, Yanchao Sun et al.

ICLR 2024
9
citations
#82

Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models

Jiaming Zhang, Junhong Ye, Xingjun Ma et al.

CVPR 2025
9
citations
#83

ADBA: Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

Feiyang Wang, Xingquan Zuo, Hai Huang et al.

AAAI 2025
9
citations
#84

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Peng Xie, Yequan Bie, Jianda Mao et al.

CVPR 2025
9
citations
#85

Preventing Catastrophic Overfitting in Fast Adversarial Training: A Bi-level Optimization Perspective

Zhaoxin Wang, Handing Wang, Cong Tian et al.

ECCV 2024
8
citations
#86

Bottom-Up Domain Prompt Tuning for Generalized Face Anti-Spoofing

SI-QI LIU, Qirui Wang, Pong Chi Yuen

ECCV 2024
face anti-spoofingvision-language modelprompt tuningdomain generalization+4
8
citations
#87

Activation Gradient based Poisoned Sample Detection Against Backdoor Attacks

Danni Yuan, Mingda Zhang, Shaokui Wei et al.

ICLR 2025
8
citations
#88

AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

Nicholas Carlini, Edoardo Debenedetti, Javier Rando et al.

ICML 2025
8
citations
#89

Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

Hanhui Wang, Yihua Zhang, Ruizheng Bai et al.

CVPR 2025
8
citations
#90

Data Taggants: Dataset Ownership Verification Via Harmless Targeted Data Poisoning

Wassim Bouaziz, Nicolas Usunier, El-Mahdi El-Mhamdi

ICLR 2025
8
citations
#91

IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking

Wei Zong, Yang-Wai Chow, Willy Susilo et al.

AAAI 2024
8
citations
#92

LoRID: Low-Rank Iterative Diffusion for Adversarial Purification

Geigh Zollicoffer, Minh N. Vu, Ben Nebgen et al.

AAAI 2025
8
citations
#93

MalCL: Leveraging GAN-Based Generative Replay to Combat Catastrophic Forgetting in Malware Classification

Jimin Park, AHyun Ji, Minji Park et al.

AAAI 2025
8
citations
#94

Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion

Honglei Miao, Fan Ma, Ruijie Quan et al.

AAAI 2025
8
citations
#95

Enhancing Adversarial Transferability with Adversarial Weight Tuning

Jiahao Chen, Zhou Feng, Rui Zeng et al.

AAAI 2025
8
citations
#96

Injecting Universal Jailbreak Backdoors into LLMs in Minutes

Zhuowei Chen, qiannan zhang, Shichao Pei

ICLR 2025
7
citations
#97

Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

NeurIPS 2025arXiv:2504.04072
agentic deceptionlanguage-based ai agentssocial deception gamemulti-player game+4
7
citations
#98

Any Target Can be Offense: Adversarial Example Generation via Generalized Latent Infection

Youheng Sun, Shengming Yuan, Xuanhan Wang et al.

ECCV 2024arXiv:2407.12292
targeted adversarial attackadversarial example generationlatent representation injectionclass-agnostic attack+4
7
citations
#99

Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks

Yiyi Chen, Russa Biswas, Heather Lent et al.

AAAI 2025
7
citations
#100

On the Robustness of Neural-Enhanced Video Streaming against Adversarial Attacks

Qihua Zhou, Jingcai Guo, Song Guo et al.

AAAI 2024
7
citations