0

BIG-Bench Hard (BBH)

Frontier

23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.

Format
HF Dataset
Size
6511 tasks
License
MIT
Published
Jun 2022
Notable for
Benchmark for evaluating planning, scientific reasoning and math.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
OpenLLMprime-hub
Attribution policy →

Top score 72.7% by Qwen2.5 72B Instruct - 79 models reporting (6 frontier)

Score history

42
20%40%60%80%100%Feb 23Aug 23Feb 24Aug 24Feb 25Llama 65BDeepSeek LLM 67B ChatQwen1.5 110B ChatQwen2.5 72B Instruct

Top models

79
BIG-Bench Hard (BBH)Bar chart with 21 bars. Highest value: Internlm2 5 20B Chat at 74.7.
21 models

Where it's ranked

1

Related tools

4
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

3

Contributors

2

FAQ

What is BIG-Bench Hard (BBH)?
23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
What capabilities does BIG-Bench Hard (BBH) test?
BIG-Bench Hard (BBH) evaluates planning, scientific reasoning, math, logic.
What is the current top score on BIG-Bench Hard (BBH)?
The top reported score is 72.7% by Qwen2.5 72B Instruct, across 79 models reporting (6 from frontier labs).
How can a model improve its BIG-Bench Hard (BBH) score?
Tools linked to BIG-Bench Hard (BBH) on Sophon include BBH RL Env (Community), Bigbench BBH RL Env (Prime Community), OpenOrca, SlimOrca - RL environments, datasets, and scaffolds that target this eval.
What license is BIG-Bench Hard (BBH) under?
BIG-Bench Hard (BBH) is available under MIT.