BIG-Bench Hard (BBH)
Frontier
23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
- Publisher
- Google Research
- Capabilities
- PlanningScientific ReasoningMathLogic
- Format
- HF Dataset
- Size
- 6511 tasks
- License
- MIT
- Published
- Jun 2022
- Notable for
- Benchmark for evaluating planning, scientific reasoning and math.
Cite
Notes
Only stored in your browser.
Top score 72.7% by Qwen2.5 72B Instruct - 79 models reporting (6 frontier)
Score history
42Top models
79Where it's ranked
1Related tools
4Implementations, trainers, datasets and scaffolds linked to this eval.
Papers
3Contributors
2FAQ
- What is BIG-Bench Hard (BBH)?
- 23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
- What capabilities does BIG-Bench Hard (BBH) test?
- BIG-Bench Hard (BBH) evaluates planning, scientific reasoning, math, logic.
- What is the current top score on BIG-Bench Hard (BBH)?
- The top reported score is 72.7% by Qwen2.5 72B Instruct, across 79 models reporting (6 from frontier labs).
- How can a model improve its BIG-Bench Hard (BBH) score?
- Tools linked to BIG-Bench Hard (BBH) on Sophon include BBH RL Env (Community), Bigbench BBH RL Env (Prime Community), OpenOrca, SlimOrca - RL environments, datasets, and scaffolds that target this eval.
- What license is BIG-Bench Hard (BBH) under?
- BIG-Bench Hard (BBH) is available under MIT.

