by Xinpeng Wang Papers
2 papers found
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu et al.
NeurIPS 2025poster
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
Xinpeng Wang, Chengzhi (Martin) Hu, Paul Röttger et al.
ICLR 2025posterarXiv:2410.03415
24
citations