2025 Poster "harmful content generation" Papers
6 papers found
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran, Hieu Minh Nguyen, Akash Kundu et al.
ICLR 2025posterarXiv:2503.10728
17
citations
Durable Quantization Conditioned Misalignment Attack on Large Language Models
Peiran Dong, Haowei Li, Song Guo
ICLR 2025poster
1
citations
Fantastic Targets for Concept Erasure in Diffusion Models and Where To Find Them
Anh Bui, Thuy-Trang Vu, Long Vuong et al.
ICLR 2025posterarXiv:2501.18950
Information Retrieval Induced Safety Degradation in AI Agents
Cheng Yu, Benedikt Stroebl, Diyi Yang et al.
NeurIPS 2025posterarXiv:2505.14215
One Head to Rule Them All: Amplifying LVLM Safety through a Single Critical Attention Head
Junhao Xia, Haotian Zhu, Shuchao Pang et al.
NeurIPS 2025poster
VLMs can Aggregate Scattered Training Patches
Zhanhui Zhou, Lingjie Chen, Chao Yang et al.
NeurIPS 2025posterarXiv:2506.03614