0

RewardBench

Active

2,985 prompt-chosen-rejected triples across chat, reasoning, safety, and code - a benchmark for evaluating reward models and LLM judges.

Capabilities
LLM JudgingSafety
Format
HF Dataset
Size
2985 tasks
License
ODC-BY-1.0
Published
Dec 2023
Notable for
Benchmark for evaluating llm judging and safety.

Cite

Notes

Only stored in your browser.

Related tools

2
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Contributors

1

FAQ

What is RewardBench?
2,985 prompt-chosen-rejected triples across chat, reasoning, safety, and code - a benchmark for evaluating reward models and LLM judges.
What capabilities does RewardBench test?
RewardBench evaluates llm judging, safety.
How can a model improve its RewardBench score?
Tools linked to RewardBench on Sophon include Reward Bench RL Env (Prime Intellect), HelpSteer2 - RL environments, datasets, and scaffolds that target this eval.
What license is RewardBench under?
RewardBench is available under ODC-BY-1.0.