ICLR Poster "adversarial attacks" Papers
6 papers found
Confidence Elicitation: A New Attack Vector for Large Language Models
Brian Formento, Chuan Sheng Foo, See-Kiong Ng
ICLR 2025posterarXiv:2502.04643
2
citations
GSBA$^K$: $top$-$K$ Geometric Score-based Black-box Attack
Md Farhamdur Reza, Richeng Jin, Tianfu Wu et al.
ICLR 2025posterarXiv:2503.12827
2
citations
Jailbreaking as a Reward Misspecification Problem
Zhihui Xie, Jiahui Gao, Lei Li et al.
ICLR 2025posterarXiv:2406.14393
9
citations
Robust LLM safeguarding via refusal feature adversarial training
Lei Yu, Virginie Do, Karen Hambardzumyan et al.
ICLR 2025posterarXiv:2409.20089
Towards Certification of Uncertainty Calibration under Adversarial Attacks
Cornelius Emde, Francesco Pinto, Thomas Lukasiewicz et al.
ICLR 2025posterarXiv:2405.13922
2
citations
Towards Understanding the Robustness of Diffusion-Based Purification: A Stochastic Perspective
Yiming Liu, Kezhao Liu, Yao Xiao et al.
ICLR 2025posterarXiv:2404.14309
6
citations