🧬Robustness

Adversarial Robustness

Defending against adversarial attacks

100 papers3,273 total citations
Compare with other topics
Feb '24 Jan '26865 papers
Also includes: adversarial robustness, robust models, adversarial defense, certified robustness

Top Papers

#1

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Maksym Andriushchenko, francesco croce, Nicolas Flammarion

ICLR 2025
375
citations
#2

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

ICLR 2025
277
citations
#3

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian et al.

ICLR 2025
127
citations
#4

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke et al.

ICML 2025
110
citations
#5

EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE

Zeyi Liao, Lingbo Mo, Chejian Xu et al.

ICLR 2025arXiv:2409.11295
web agent securityprivacy leakage attacksenvironmental injection attackadversarial threat modeling+4
106
citations
#6

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei et al.

ICLR 2025
103
citations
#7

AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

Xiaogeng Liu, Peiran Li, G. Edward Suh et al.

ICLR 2025
100
citations
#8

Rethinking Model Ensemble in Transfer-based Adversarial Attacks

Huanran Chen, Yichi Zhang, Yinpeng Dong et al.

ICLR 2024
96
citations
#9

Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models

Yifan Li, hangyu guo, Kun Zhou et al.

ECCV 2024
93
citations
#10

Boosting Adversarial Transferability by Block Shuffle and Rotation

Kunyu Wang, he xuanran, Wenxuan Wang et al.

CVPR 2024
88
citations
#11

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks

Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.

CVPR 2024
80
citations
#12

Dissecting Adversarial Robustness of Multimodal LM Agents

Chen Wu, Rishi Shah, Jing Yu Koh et al.

ICLR 2025
77
citations
#13

Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks

Mehrdad Saberi, Vinu Sankar Sadasivan, Keivan Rezaei et al.

ICLR 2024
74
citations
#14

BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP

Jiawang Bai, Kuofeng Gao, Shaobo Min et al.

CVPR 2024
68
citations
#15

AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.

ICLR 2025
58
citations
#16

DAP: A Dynamic Adversarial Patch for Evading Person Detectors

Amira Guesmi, Ruitian Ding, Muhammad Abdullah Hanif et al.

CVPR 2024
48
citations
#17

Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

Seanie Lee, Minsu Kim, Lynn Cherif et al.

ICLR 2025
42
citations
#18

One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models

Lin Li, Haoyan Guan, Jianing Qiu et al.

CVPR 2024
42
citations
#19

PAD: Patch-Agnostic Defense against Adversarial Patch Attacks

Lihua Jing, Rui Wang, Wenqi Ren et al.

CVPR 2024
39
citations
#20

MathAttack: Attacking Large Language Models towards Math Solving Ability

Zihao Zhou, Qiufeng Wang, Mingyu Jin et al.

AAAI 2024arXiv:2309.01686
adversarial attacksmath word problemslarge language modelslogical entity recognition+4
37
citations
#21

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Zhaorun Chen, Mintong Kang, Bo Li

ICML 2025
37
citations
#22

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

Yuhao Sun, Lingyun Yu, Hongtao Xie et al.

CVPR 2024
36
citations
#23

RobustSAM: Segment Anything Robustly on Degraded Images

Wei-Ting Chen, Yu Jiet Vong, Sy-Yen Kuo et al.

CVPR 2024
35
citations
#24

Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI

Robert Hönig, Javier Rando, Nicholas Carlini et al.

ICLR 2025
35
citations
#25

Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models

Peifei Zhu, Tsubasa Takahashi, Hirokatsu Kataoka

CVPR 2024
34
citations
#26

Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang, Xingjun Ma, Xin Wang et al.

ECCV 2024
33
citations
#27

Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory

Sensen Gao, Xiaojun Jia, Xuhong Ren et al.

ECCV 2024arXiv:2403.12445
vision-language pre-trainingmultimodal adversarial examplesadversarial transferabilityadversarial trajectory+3
31
citations
#28

Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

Junkang Wu, Yuexiang Xie, Zhengyi Yang et al.

ICLR 2025
27
citations
#29

Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries

WEI-JER Chang, Francesco Pittaluga, Masayoshi TOMIZUKA et al.

ECCV 2024
26
citations
#30

Model Poisoning Attacks to Federated Learning via Multi-Round Consistency

Yueqi Xie, Minghong Fang, Neil Zhenqiang Gong

CVPR 2025arXiv:2404.15611
model poisoning attacksfederated learning securitymulti-round consistencyadversarial defenses+3
24
citations
#31

Understanding Certified Training with Interval Bound Propagation

Yuhao Mao, Mark N Müller, Marc Fischer et al.

ICLR 2024
22
citations
#32

Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models

Hongbang Yuan, Zhuoran Jin, Pengfei Cao et al.

AAAI 2025
22
citations
#33

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Mintong Kang, Chejian Xu, Bo Li

ICLR 2025
21
citations
#34

Language-Driven Anchors for Zero-Shot Adversarial Robustness

Xiao Li, Wei Zhang, Yining Liu et al.

CVPR 2024
21
citations
#35

A Transfer Attack to Image Watermarks

Yuepeng Hu, Zhengyuan Jiang, Moyang Guo et al.

ICLR 2025
21
citations
#36

Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement

Han Wu, Guanyan Ou, Weibin Wu et al.

CVPR 2024
19
citations
#37

Robust-Wide: Robust Watermarking against Instruction-driven Image Editing

Runyi Hu, Jie Zhang, Ting Xu et al.

ECCV 2024
19
citations
#38

CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG

Boyi Deng, Wenjie Wang, Fengbin Zhu et al.

AAAI 2025
19
citations
#39

Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents

Arrasy Rahman, Jiaxun Cui, Peter Stone

AAAI 2024arXiv:2308.09595
ad hoc teamworkminimum coverage setrobust cooperationteammate policy diversity+4
19
citations
#40

Towards Adversarially Robust Dataset Distillation by Curvature Regularization

Eric Xue, Yijiang Li, Haoyang Liu et al.

AAAI 2025
18
citations
#41

OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking

Xuanyu Zhang, Zecheng Tang, Zhipei Xu et al.

CVPR 2025arXiv:2412.01615
digital image watermarkingtamper localizationcopyright protectiongenerative ai editing+4
18
citations
#42

Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark

Mengxi Ya, Yiming Li, Tao Dai et al.

ICLR 2024
18
citations
#43

Progressive Poisoned Data Isolation for Training-Time Backdoor Defense

Yiming Chen, Haiwei Wu, Jiantao Zhou

AAAI 2024arXiv:2312.12724
backdoor attacksdata poisoningtraining-time defensepoisoned data isolation+2
16
citations
#44

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao, Xiang Zheng, Lin Luo et al.

ICLR 2025
16
citations
#45

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise

Yixin Liu, Kaidi Xu, Xun Chen et al.

AAAI 2024arXiv:2311.13091
unlearnable examplesdata poisoningadversarial trainingdefensive noise+4
16
citations
#46

Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks

Anastasia Antsiferova, Khaled Abud, Aleksandr Gushchin et al.

AAAI 2024arXiv:2310.06958
no-reference quality metricsadversarial attacksimage quality assessmentvideo quality assessment+3
16
citations
#47

Understanding and Enhancing the Transferability of Jailbreaking Attacks

Runqi Lin, Bo Han, Fengwang Li et al.

ICLR 2025
16
citations
#48

Security Attacks on LLM-based Code Completion Tools

Wen Cheng, Ke Sun, Xinyu Zhang et al.

AAAI 2025
15
citations
#49

The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense

Yangyang Guo, Fangkai Jiao, Liqiang Nie et al.

NeurIPS 2025
15
citations
#50

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

Andy Zhou, Kevin Wu, Francesco Pinto et al.

NeurIPS 2025arXiv:2503.15754
autonomous red teaminglarge language modelsmulti-agent architectureattack vector discovery+3
15
citations
#51

Adversarial Training Should Be Cast as a Non-Zero-Sum Game

Alex Robey, Fabian Latorre, George Pappas et al.

ICLR 2024
15
citations
#52

Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples

Junhao Dong, Piotr Koniusz, Junxi Chen et al.

CVPR 2024
14
citations
#53

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Yiren Song, Pei Yang, Hai Ci et al.

CVPR 2025
14
citations
#54

NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training

Dar-Yen Chen, Hmrishav Bandyopadhyay, Kai Zou et al.

CVPR 2025arXiv:2412.02030
single-step diffusionadversarial traininghigh-fidelity generationdynamic discriminator pool+3
14
citations
#55

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

XiangCheng Zhang, Fang Kong, Baoxiang Wang et al.

ICLR 2025
14
citations
#56

DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing

William June Suk Choi, Kyungmin Lee, Jongheon Jeong et al.

ICLR 2025
14
citations
#57

Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks

Zhiying Jiang, Xingyuan Li, Jinyuan Liu et al.

AAAI 2024arXiv:2402.15959
image stitchingadversarial attacksfeature matchingadversarial training+2
14
citations
#58

GDA: Generalized Diffusion for Robust Test-time Adaptation

Yun-Yun Tsai, Fu-Chen Chen, Albert Chen et al.

CVPR 2024
13
citations
#59

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

Yufan Chen, Jiaming Zhang, Kunyu Peng et al.

CVPR 2024
13
citations
#60

Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models

Francesco Croce, Naman D. Singh, Matthias Hein

ECCV 2024
12
citations
#61

STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models

Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer et al.

CVPR 2025
12
citations
#62

Speech Robust Bench: A Robustness Benchmark For Speech Recognition

Muhammad Shah, David Solans Noguero, Mikko Heikkilä et al.

ICLR 2025arXiv:2403.07937
automatic speech recognitionrobustness benchmarkinput perturbationsdiscrete representations+3
12
citations
#63

Boosting Adversarial Training via Fisher-Rao Norm-based Regularization

Xiangyu Yin, Wenjie Ruan

CVPR 2024
12
citations
#64

Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing

Song Xia, Yi Yu, Jiang Xudong et al.

ICLR 2024
12
citations
#65

Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM

Linyu Tang, Lei Zhang

CVPR 2024
12
citations
#66

RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors

Fengshuo Bai, Runze Liu, Yali Du et al.

AAAI 2025
12
citations
#67

Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off

Yuansan Liu, Ruqing Zhang, Mingkun Zhang et al.

AAAI 2024arXiv:2312.10329
adversarial trainingneural ranking modelsinformation retrievaladversarial robustness+4
12
citations
#68

Generalizability of Adversarial Robustness Under Distribution Shifts

Bernard Ghanem, Kumail Alhamoud, Hasan Hammoud et al.

ICLR 2024
12
citations
#69

Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

Jinluan Yang, Anke Tang, Didi Zhu et al.

ICLR 2025
12
citations
#70

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Nay Myat Min, Long H. Pham, Yige Li et al.

ICML 2025
12
citations
#71

Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment

Xiaojun Jia, Sensen Gao, Simeng Qin et al.

NeurIPS 2025
12
citations
#72

Towards Understanding and Improving Adversarial Robustness of Vision Transformers

Samyak Jain, Tanima Dutta

CVPR 2024
11
citations
#73

Robust Nonparametric Regression under Poisoning Attack

Puning Zhao, Zhiguo Wan

AAAI 2024arXiv:2305.16771
robust nonparametric regressionpoisoning attackhuber loss minimizationkernel regression+3
11
citations
#74

ProSec: Fortifying Code LLMs with Proactive Security Alignment

Xiangzhe Xu, Zian Su, Jinyao Guo et al.

ICML 2025
11
citations
#75

PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor

Jaewon Jung, Hongsun Jang, Jaeyong Song et al.

CVPR 2024
11
citations
#76

Instant Adversarial Purification with Adversarial Consistency Distillation

Chun Tong Lei, Hon Ming Yam, Zhongliang Guo et al.

CVPR 2025
11
citations
#77

Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection

Jiahao Xu, Zikai Zhang, Rui Hu

CVPR 2025
11
citations
#78

Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models

Guobin Shen, Dongcheng Zhao, Yiting Dong et al.

ICLR 2025
11
citations
#79

BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization

Xueyang Zhou, Guiyao Tie, Guowen Zhang et al.

NeurIPS 2025
11
citations
#80

Backdoor Attacks Against No-Reference Image Quality Assessment Models via a Scalable Trigger

Yi Yu, Song Xia, Xun Lin et al.

AAAI 2025
10
citations
#81

CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP

Songlong Xing, Zhengyu Zhao, Nicu Sebe

CVPR 2025
10
citations
#82

Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap

Junhao Dong, Piotr Koniusz, Junxi Chen et al.

ECCV 2024
adversarial robustnessknowledge distillationfeature distribution alignmentvariance gap reduction+4
10
citations
#83

Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility

Fang Kong, Shuai Li

AAAI 2024arXiv:2401.01528
matching marketsincentive compatibilitybandit algorithmsonline deferred acceptance+4
10
citations
#84

Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses

David Glukhov, Ziwen Han, I Shumailov et al.

ICLR 2025arXiv:2407.02551
information leakagedual-intent queriesinferential adversariessafety-utility trade-off+4
10
citations
#85

Backdoor Contrastive Learning via Bi-level Trigger Optimization

Weiyu Sun, Xinyu Zhang, Hao LU et al.

ICLR 2024
10
citations
#86

Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack

Mingyu Yang, Daizong Liu, Keke Tang et al.

ECCV 2024
10
citations
#87

Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households

Zhihao Cao, ZiDong Wang, Siwen Xie et al.

CVPR 2024
10
citations
#88

DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents

Hao Li, Xiaogeng Liu, CHIU Chun et al.

NeurIPS 2025arXiv:2506.12104
prompt injection attacksagentic systems securitydynamic rule enforcementmemory stream isolation+4
10
citations
#89

Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL

Xiangyu Liu, Souradip Chakraborty, Yanchao Sun et al.

ICLR 2024
9
citations
#90

ADBA: Approximation Decision Boundary Approach for Black-Box Adversarial Attacks

Feiyang Wang, Xingquan Zuo, Hai Huang et al.

AAAI 2025
9
citations
#91

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Andy Zhang, Joey Ji, Celeste Menders et al.

NeurIPS 2025arXiv:2505.15216
cybersecurity ai agentsvulnerability detectionbug bounty programsexploit generation+4
9
citations
#92

Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models

Jiaming Zhang, Junhong Ye, Xingjun Ma et al.

CVPR 2025
9
citations
#93

TASAR: Transfer-based Attack on Skeletal Action Recognition

Yunfeng Diao, Baiqi Wu, Ruixuan Zhang et al.

ICLR 2025
9
citations
#94

Jailbreaking as a Reward Misspecification Problem

Zhihui Xie, Jiahui Gao, Lei Li et al.

ICLR 2025arXiv:2406.14393
reward misspecificationadversarial attackslarge language modelsautomated red teaming+4
9
citations
#95

RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards

jingnan zheng, Xiangtian Ji, Yijun Lu et al.

NeurIPS 2025
9
citations
#96

Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

Hossein Mirzaei Sadeghlou, Mackenzie Mathis

ICLR 2025
9
citations
#97

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks

Peng Xie, Yequan Bie, Jianda Mao et al.

CVPR 2025
9
citations
#98

Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models

Peiyan Zhang, Haoyang Liu, Chaozhuo Li et al.

ICLR 2024
9
citations
#99

MedBN: Robust Test-Time Adaptation against Malicious Test Samples

Hyejin Park, Jeongyeon Hwang, Sunung Mun et al.

CVPR 2024
8
citations
#100

Robust Communicative Multi-Agent Reinforcement Learning with Active Defense

Lebin Yu, Yunbo Qiu, Quanming Yao et al.

AAAI 2024arXiv:2312.11545
multi-agent reinforcement learningadversarial attacksactive defense strategyagent communication+3
8
citations