Poster "model safety" Papers
5 papers found
On Large Language Model Continual Unlearning
Chongyang Gao, Lixu Wang, Kaize Ding et al.
ICLR 2025posterarXiv:2407.10223
26
citations
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
ICLR 2025posterarXiv:2410.13708
40
citations
Robust LLM safeguarding via refusal feature adversarial training
Lei Yu, Virginie Do, Karen Hambardzumyan et al.
ICLR 2025posterarXiv:2409.20089
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi, Alexander Wei, Eric Wallace et al.
ICML 2024poster
Position: Building Guardrails for Large Language Models Requires Systematic Design
Yi DONG, Ronghui Mu, Gaojie Jin et al.
ICML 2024poster