"large language model safety" Papers
2 papers found
Adaptive Defense against Harmful Fine-Tuning for Large Language Models via Bayesian Data Scheduler
Zixuan Hu, Li Shen, Zhenyi Wang et al.
NeurIPS 2025spotlightarXiv:2510.27172
2
citations
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
Nathaniel Li, Alexander Pan, Anjali Gopal et al.
ICML 2024poster