"model safety" Papers
3 papers found
On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang et al.
ICLR 2025posterarXiv:2410.13708
40
citations
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Danny Halawi, Alexander Wei, Eric Wallace et al.
ICML 2024poster
Position: Building Guardrails for Large Language Models Requires Systematic Design
Yi DONG, Ronghui Mu, Gaojie Jin et al.
ICML 2024poster