0

RewardBench 2

Active

2025 successor to RewardBench - harder, multi-completion (not just chosen-vs-rejected), with refreshed prompts to address contamination.

Capabilities
LLM JudgingSafety
Format
HF Dataset
Size
1865 tasks
License
ODC-BY-1.0
Published
Jun 2025
Notable for
Benchmark for evaluating llm judging and safety.

Cite

Notes

Only stored in your browser.

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

1

FAQ

What is RewardBench 2?
2025 successor to RewardBench - harder, multi-completion (not just chosen-vs-rejected), with refreshed prompts to address contamination.
What capabilities does RewardBench 2 test?
RewardBench 2 evaluates llm judging, safety.
How can a model improve its RewardBench 2 score?
Tools linked to RewardBench 2 on Sophon include Reward Bench RL Env (Prime Intellect) - RL environments, datasets, and scaffolds that target this eval.
What license is RewardBench 2 under?
RewardBench 2 is available under ODC-BY-1.0.