Adversarial Robustness
Defending against adversarial attacks
Related Topics (Robustness)
Top Papers
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko, francesco croce, Nicolas Flammarion
Safety Alignment Should be Made More Than Just a Few Tokens Deep
Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian et al.
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley, Daniel Tan, Niels Warncke et al.
EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE
Zeyi Liao, Lingbo Mo, Chejian Xu et al.
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu, Peiran Li, G. Edward Suh et al.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Hanrong Zhang, Jingyuan Huang, Kai Mei et al.
Rethinking Model Ensemble in Transfer-based Adversarial Attacks
Huanran Chen, Yichi Zhang, Yinpeng Dong et al.
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Yifan Li, hangyu guo, Kun Zhou et al.
Boosting Adversarial Transferability by Block Shuffle and Rotation
Kunyu Wang, he xuanran, Wenxuan Wang et al.
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Shilin Lu, Zihan Zhou, Jiayou Lu et al.
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks
Xuanming Cui, Alejandro Aparcedo, Young Kyun Jang et al.
Dissecting Adversarial Robustness of Multimodal LM Agents
Chen Wu, Rishi Shah, Jing Yu Koh et al.
Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks
Mehrdad Saberi, Vinu Sankar Sadasivan, Keivan Rezaei et al.
BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP
Jiawang Bai, Kuofeng Gao, Shaobo Min et al.
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.
DAP: A Dynamic Adversarial Patch for Evading Person Detectors
Amira Guesmi, Ruitian Ding, Muhammad Abdullah Hanif et al.
Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning
Seanie Lee, Minsu Kim, Lynn Cherif et al.
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
Lin Li, Haoyan Guan, Jianing Qiu et al.
PAD: Patch-Agnostic Defense against Adversarial Patch Attacks
Lihua Jing, Rui Wang, Wenqi Ren et al.
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Zhaorun Chen, Mintong Kang, Bo Li
MathAttack: Attacking Large Language Models towards Math Solving Ability
Zihao Zhou, Qiufeng Wang, Mingyu Jin et al.
DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
Yuhao Sun, Lingyun Yu, Hongtao Xie et al.
Adversarial Perturbations Cannot Reliably Protect Artists From Generative AI
Robert Hönig, Javier Rando, Nicholas Carlini et al.
RobustSAM: Segment Anything Robustly on Degraded Images
Wei-Ting Chen, Yu Jiet Vong, Sy-Yen Kuo et al.
Adversarial Prompt Tuning for Vision-Language Models
Jiaming Zhang, Xingjun Ma, Xin Wang et al.
Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models
Peifei Zhu, Tsubasa Takahashi, Hirokatsu Kataoka
$R^2$-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning
Mintong Kang, Bo Li
Boosting Transferability in Vision-Language Attacks via Diversification along the Intersection Region of Adversarial Trajectory
Sensen Gao, Xiaojun Jia, Xuhong Ren et al.
Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization
Junkang Wu, Yuexiang Xie, Zhengyi Yang et al.
Safe-Sim: Safety-Critical Closed-Loop Traffic Simulation with Diffusion-Controllable Adversaries
WEI-JER Chang, Francesco Pittaluga, Masayoshi TOMIZUKA et al.
Model Poisoning Attacks to Federated Learning via Multi-Round Consistency
Yueqi Xie, Minghong Fang, Neil Zhenqiang Gong
Understanding Certified Training with Interval Bound Propagation
Yuhao Mao, Mark N Müller, Marc Fischer et al.
Towards Robust Knowledge Unlearning: An Adversarial Framework for Assessing and Improving Unlearning Robustness in Large Language Models
Hongbang Yuan, Zhuoran Jin, Pengfei Cao et al.
AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
Mintong Kang, Chejian Xu, Bo Li
Language-Driven Anchors for Zero-Shot Adversarial Robustness
Xiao Li, Wei Zhang, Yining Liu et al.
A Transfer Attack to Image Watermarks
Yuepeng Hu, Zhengyuan Jiang, Moyang Guo et al.
Minimum Coverage Sets for Training Robust Ad Hoc Teamwork Agents
Arrasy Rahman, Jiaxun Cui, Peter Stone
Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement
Han Wu, Guanyan Ou, Weibin Wu et al.
Robust-Wide: Robust Watermarking against Instruction-driven Image Editing
Runyi Hu, Jie Zhang, Ting Xu et al.
CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG
Boyi Deng, Wenjie Wang, Fengbin Zhu et al.
OmniGuard: Hybrid Manipulation Localization via Augmented Versatile Deep Image Watermarking
Xuanyu Zhang, Zecheng Tang, Zhipei Xu et al.
Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark
Mengxi Ya, Yiming Li, Tao Dai et al.
Towards Adversarially Robust Dataset Distillation by Curvature Regularization
Eric Xue, Yijiang Li, Haoyang Liu et al.
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
Yunhan Zhao, Xiang Zheng, Lin Luo et al.
Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks
Anastasia Antsiferova, Khaled Abud, Aleksandr Gushchin et al.
Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise
Yixin Liu, Kaidi Xu, Xun Chen et al.
Progressive Poisoned Data Isolation for Training-Time Backdoor Defense
Yiming Chen, Haiwei Wu, Jiantao Zhou
Understanding and Enhancing the Transferability of Jailbreaking Attacks
Runqi Lin, Bo Han, Fengwang Li et al.
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou, Kevin Wu, Francesco Pinto et al.
Security Attacks on LLM-based Code Completion Tools
Wen Cheng, Ke Sun, Xinyu Zhang et al.
The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense
Yangyang Guo, Fangkai Jiao, Liqiang Nie et al.
Adversarial Training Should Be Cast as a Non-Zero-Sum Game
Alex Robey, Fabian Latorre, George Pappas et al.
IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
Yiren Song, Pei Yang, Hai Ci et al.
Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples
Junhao Dong, Piotr Koniusz, Junxi Chen et al.
Towards Robust Image Stitching: An Adaptive Resistance Learning against Compatible Attacks
Zhiying Jiang, Xingyuan Li, Jinyuan Liu et al.
Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization
XiangCheng Zhang, Fang Kong, Baoxiang Wang et al.
Endless Jailbreaks with Bijection Learning
Brian R.Y. Huang, Max Li, Leonard Tang
DiffusionGuard: A Robust Defense Against Malicious Diffusion-based Image Editing
William June Suk Choi, Kyungmin Lee, Jongheon Jeong et al.
NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic Adversarial Training
Dar-Yen Chen, Hmrishav Bandyopadhyay, Kai Zou et al.
RoDLA: Benchmarking the Robustness of Document Layout Analysis Models
Yufan Chen, Jiaming Zhang, Kunyu Peng et al.
GDA: Generalized Diffusion for Robust Test-time Adaptation
Yun-Yun Tsai, Fu-Chen Chen, Albert Chen et al.
Towards Reliable Evaluation and Fast Training of Robust Semantic Segmentation Models
Francesco Croce, Naman D. Singh, Matthias Hein
STEREO: A Two-Stage Framework for Adversarially Robust Concept Erasing from Text-to-Image Diffusion Models
Koushik Srivatsan, Fahad Shamshad, Muzammal Naseer et al.
CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min, Long H. Pham, Yige Li et al.
RAT: Adversarial Attacks on Deep Reinforcement Agents for Targeted Behaviors
Fengshuo Bai, Runze Liu, Yali Du et al.
Perturbation-Invariant Adversarial Training for Neural Ranking Models: Improving the Effectiveness-Robustness Trade-Off
Yuansan Liu, Ruqing Zhang, Mingkun Zhang et al.
Mitigating the Curse of Dimensionality for Certified Robustness via Dual Randomized Smoothing
Song Xia, Yi Yu, Jiang Xudong et al.
Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM
Linyu Tang, Lei Zhang
Boosting Adversarial Training via Fisher-Rao Norm-based Regularization
Xiangyu Yin, Wenjie Ruan
Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace
Jinluan Yang, Anke Tang, Didi Zhu et al.
Generalizability of Adversarial Robustness Under Distribution Shifts
Bernard Ghanem, Kumail Alhamoud, Hasan Hammoud et al.
Adversarial Attacks against Closed-Source MLLMs via Feature Optimal Alignment
Xiaojun Jia, Sensen Gao, Simeng Qin et al.
Speech Robust Bench: A Robustness Benchmark For Speech Recognition
Muhammad Shah, David Solans Noguero, Mikko Heikkilä et al.
Robust Nonparametric Regression under Poisoning Attack
Puning Zhao, Zhiguo Wan
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen, Dongcheng Zhao, Yiting Dong et al.
Towards Understanding and Improving Adversarial Robustness of Vision Transformers
Samyak Jain, Tanima Dutta
Instant Adversarial Purification with Adversarial Consistency Distillation
Chun Tong Lei, Hon Ming Yam, Zhongliang Guo et al.
PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor
Jaewon Jung, Hongsun Jang, Jaeyong Song et al.
BadVLA: Towards Backdoor Attacks on Vision-Language-Action Models via Objective-Decoupled Optimization
Xueyang Zhou, Guiyao Tie, Guowen Zhang et al.
Detecting Backdoor Attacks in Federated Learning via Direction Alignment Inspection
Jiahao Xu, Zikai Zhang, Rui Hu
ProSec: Fortifying Code LLMs with Proactive Security Alignment
Xiangzhe Xu, Zian Su, Jinyao Guo et al.
CLIP is Strong Enough to Fight Back: Test-time Counterattacks towards Zero-shot Adversarial Robustness of CLIP
Songlong Xing, Zhengyu Zhao, Nicu Sebe
Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households
Zhihao Cao, ZiDong Wang, Siwen Xie et al.
Adversarially Robust Distillation by Reducing the Student-Teacher Variance Gap
Junhao Dong, Piotr Koniusz, Junxi Chen et al.
DRIFT: Dynamic Rule-Based Defense with Injection Isolation for Securing LLM Agents
Hao Li, Xiaogeng Liu, CHIU Chun et al.
Backdoor Attacks Against No-Reference Image Quality Assessment Models via a Scalable Trigger
Yi Yu, Song Xia, Xun Lin et al.
Hiding Imperceptible Noise in Curvature-Aware Patches for 3D Point Cloud Attack
Mingyu Yang, Daizong Liu, Keke Tang et al.
Breach By A Thousand Leaks: Unsafe Information Leakage in 'Safe' AI Responses
David Glukhov, Ziwen Han, I Shumailov et al.
Backdoor Contrastive Learning via Bi-level Trigger Optimization
Weiyu Sun, Xinyu Zhang, Hao LU et al.
Improved Bandits in Many-to-One Matching Markets with Incentive Compatibility
Fang Kong, Shuai Li
Jailbreaking as a Reward Misspecification Problem
Zhihui Xie, Jiahui Gao, Lei Li et al.
BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Andy Zhang, Joey Ji, Celeste Menders et al.
ADBA: Approximation Decision Boundary Approach for Black-Box Adversarial Attacks
Feiyang Wang, Xingquan Zuo, Hai Huang et al.
Anyattack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models
Jiaming Zhang, Junhong Ye, Xingjun Ma et al.
Foundation Model-oriented Robustness: Robust Image Model Evaluation with Pretrained Models
Peiyan Zhang, Haoyang Liu, Chaozhuo Li et al.
TASAR: Transfer-based Attack on Skeletal Action Recognition
Yunfeng Diao, Baiqi Wu, Ruixuan Zhang et al.
RSafe: Incentivizing proactive reasoning to build robust and adaptive LLM safeguards
jingnan zheng, Xiangtian Ji, Yijun Lu et al.
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Peng Xie, Yequan Bie, Jianda Mao et al.
Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in RL
Xiangyu Liu, Souradip Chakraborty, Yanchao Sun et al.