Mantas Mazeika

Center for AI Safety researcher; lead author of HarmBench and contributor to MMLU and frontier-risk evaluations.

Role: researcher
Currently at: Center for AI Safety
Twitter: twitter.com/mmazeika
GitHub: github.com/mmazeika
Scholar: scholar.google.com/citations
Papers: 12

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: scholar.google.com/citations

Attribution policy →

12papers

Authored papers

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

arXiv 2025

2025

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

arXiv 2025

2025

TextQuests: How Good are LLMs at Text-Based Video Games?

arXiv 2025

2025

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

arXiv 2024

2024

Tamper-Resistant Safeguards for Open-Weight LLMs

arXiv 2024

2024

Representation Engineering: A Top-Down Approach to AI Transparency

arXiv 2023

2023

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR

2022

Forecasting Future World Events with Neural Networks

arXiv 2022

2022

Measuring Coding Challenge Competence With APPS

arXiv 2021

2021

Measuring Massive Multitask Language Understanding

ICLR

2020

Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty

using-self-supervised-learning-can-improve-1

2019

Deep Anomaly Detection with Outlier Exposure

deep-anomaly-detection-with-outlier-exposure-1

2018

Affiliations

Currently at

Center for AI Safety

researcher · non profit

Previously

University of Illinois Urbana-Champaignuniversity lab

Frequent co-authors

from 12 papers

Dan Hendrycks

director

11 shared papers

Andy Zou

founder

7 shared papers

Dawn Song

professor

6 shared papers

Long Phan

researcher

4 shared papers

Steven Basart

researcher

4 shared papers

Jacob Steinhardt

founder

3 shared papers

Xuwang Yin

3 shared papers

Alice Gatti

researcher

2 shared papers

Bo Li

2 shared papers

Collin Burns

researcher

2 shared papers