0

Paul Röttger

Papers
17

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
17papers

Authored papers

17

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages

arXiv 2025

2025

IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

arXiv 2025

2025

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent

arXiv 2025

2025

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

arXiv 2025

2025

Introducing v0.5 of the AI Safety Benchmark from MLCommons

arXiv 2024

2024

The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models

arXiv 2024

2024

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

arXiv 2024

2024

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ

arXiv 2024

2024

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

arXiv 2024

2024

From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets

arXiv 2024

2024

Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think

arXiv 2024

2024

Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset

arXiv 2024

2024

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

arXiv 2023

2023

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

arXiv 2023

2023

Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models

NAACL (WOAH) 2022 7

2022

Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate

NAACL 2022 7

2021

HateCheck: Functional Tests for Hate Speech Detection Models

ACL 2021 5

2020

Affiliations

No known affiliations.

Frequent co-authors

10

from 17 papers