ICLR "adversarial attacks" Papers
7 papers found
Confidence Elicitation: A New Attack Vector for Large Language Models
Brian Formento, Chuan Sheng Foo, See-Kiong Ng
ICLR 2025posterarXiv:2502.04643
2
citations
GSBA$^K$: $top$-$K$ Geometric Score-based Black-box Attack
Md Farhamdur Reza, Richeng Jin, Tianfu Wu et al.
ICLR 2025posterarXiv:2503.12827
2
citations
Jailbreaking as a Reward Misspecification Problem
Zhihui Xie, Jiahui Gao, Lei Li et al.
ICLR 2025posterarXiv:2406.14393
9
citations
Rationalizing and Augmenting Dynamic Graph Neural Networks
Guibin Zhang, Yiyan Qi, Ziyang Cheng et al.
ICLR 2025oral
Robust LLM safeguarding via refusal feature adversarial training
Lei Yu, Virginie Do, Karen Hambardzumyan et al.
ICLR 2025posterarXiv:2409.20089
Towards Certification of Uncertainty Calibration under Adversarial Attacks
Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz et al.
ICLR 2025posterarXiv:2405.13922
2
citations
Towards Understanding the Robustness of Diffusion-Based Purification: A Stochastic Perspective
Yiming Liu, Kezhao Liu, Yao Xiao et al.
ICLR 2025posterarXiv:2404.14309
6
citations