Mantas Mazeika

11

Papers

139

Total Citations

Papers (11)

Tamper-Resistant Safeguards for Open-Weight LLMs

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning

PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Using Pre-Training Can Improve Model Robustness and Uncertainty

Using Trusted Data to Train Deep Networks on Labels Corrupted by Severe Noise

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios

Forecasting Future World Events With Neural Networks