Paul Röttger
- Papers
- 17
Cite
Notes
Only stored in your browser.
Authored papers
17AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
arXiv 2025
IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance
arXiv 2025
TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
arXiv 2025
MSTS: A Multimodal Safety Test Suite for Vision-Language Models
arXiv 2025
Introducing v0.5 of the AI Safety Benchmark from MLCommons
arXiv 2024
The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
arXiv 2024
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
arXiv 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
arXiv 2024
"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
arXiv 2024
From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets
arXiv 2024
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
arXiv 2024
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset
arXiv 2024
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
arXiv 2023
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
arXiv 2023
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models
NAACL (WOAH) 2022 7
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-based Hate
NAACL 2022 7
HateCheck: Functional Tests for Hate Speech Detection Models
ACL 2021 5
Affiliations
Frequent co-authors
10from 17 papers