Paul Röttger

Papers: 17

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile: Semantic Scholar

Attribution policy →

17papers

Authored papers

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

arXiv 2025

2025

IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

arXiv 2025

2025

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

arXiv 2025

2025

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

arXiv 2025

2025

Introducing v0.5 of the AI Safety Benchmark from MLCommons

arXiv 2024

2024

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

arXiv 2024

2024

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

arXiv 2024

2024

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

arXiv 2024

2024

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

arXiv 2024

2024

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

arXiv 2024

2024

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

arXiv 2024

2024

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset

arXiv 2024

2024

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

arXiv 2023

2023

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

arXiv 2023

2023

Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models

NAACL (WOAH) 2022 7

2022

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

NAACL 2022 7

2021

HateCheck: Functional Tests for Hate Speech Detection Models

ACL 2021 5

2020

Affiliations

No known affiliations.

Frequent co-authors

from 17 papers

Bertie Vidgen

Dirk Hovy

Hannah Rose Kirk

Scott A. Hale

Giuseppe Attanasio

Adina Williams

Alicia Parrish

Barbara Plank

Bertram Vidgen

Bolei Ma