Mantas Mazeika
4
Papers
139
Total Citations
Papers (4)
Tamper-Resistant Safeguards for Open-Weight LLMs
ICLR 2025arXiv
108
citations
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
NeurIPS 2025
31
citations
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
ICML 2024
0
citations
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
ICML 2024
0
citations