RewardBench
Active
2,985 prompt-chosen-rejected triples across chat, reasoning, safety, and code - a benchmark for evaluating reward models and LLM judges.
- Publisher
- Allen Institute for AI (Ai2)
- Capabilities
- LLM JudgingSafety
- Format
- HF Dataset
- Size
- 2985 tasks
- License
- ODC-BY-1.0
- Published
- Dec 2023
- Notable for
- Benchmark for evaluating llm judging and safety.
- Canonical
- github.com/allenai/reward-bench
Cite
Notes
Only stored in your browser.
Related tools
2Implementations, trainers, datasets and scaffolds linked to this eval.
Contributors
1FAQ
- What is RewardBench?
- 2,985 prompt-chosen-rejected triples across chat, reasoning, safety, and code - a benchmark for evaluating reward models and LLM judges.
- What capabilities does RewardBench test?
- RewardBench evaluates llm judging, safety.
- How can a model improve its RewardBench score?
- Tools linked to RewardBench on Sophon include Reward Bench RL Env (Prime Intellect), HelpSteer2 - RL environments, datasets, and scaffolds that target this eval.
- What license is RewardBench under?
- RewardBench is available under ODC-BY-1.0.