Mantas Mazeika
11
Papers
139
Total Citations
Papers (11)
Tamper-Resistant Safeguards for Open-Weight LLMs
ICLR 2025arXiv
108
citations
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
NeurIPS 2025
31
citations
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
ICML 2024
0
citations
The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning
ICML 2024
0
citations
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
CVPR 2022arXiv
0
citations
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
NeurIPS 2023
0
citations
Using Pre-Training Can Improve Model Robustness and Uncertainty
ICML 2019
0
citations
Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise
NeurIPS 2018
0
citations
Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty
NeurIPS 2019
0
citations
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios
NeurIPS 2022
0
citations
Forecasting Future World Events With Neural Networks
NeurIPS 2022
0
citations