What capabilities does BIG-Bench Hard (BBH) test?

BIG-Bench Hard (BBH) evaluates planning, scientific reasoning, math, logic.

What is the current top score on BIG-Bench Hard (BBH)?

The top reported score is 72.7% by Qwen2.5 72B Instruct, across 79 models reporting (6 from frontier labs).

How can a model improve its BIG-Bench Hard (BBH) score?

Tools linked to BIG-Bench Hard (BBH) on Sophon include BBH RL Env (Community), Bigbench BBH RL Env (Prime Community), OpenOrca, SlimOrca - RL environments, datasets, and scaffolds that target this eval.

What license is BIG-Bench Hard (BBH) under?

BIG-Bench Hard (BBH) is available under MIT.

BIG-Bench Hard (BBH)

Frontier

23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.

Open

Publisher: Google Research
Capabilities: Planning Scientific Reasoning Math Logic
Format: HF Dataset
Size: 6511 tasks
License: MIT
Published: Jun 2022
Notable for: Benchmark for evaluating planning, scientific reasoning and math.
Canonical: github.com/suzgunmirac/BIG-Bench-Hard
Also on: huggingface.co/datasets/lukaemon/bbh

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: OpenLLM prime-hub

Attribution policy →

Top score 72.7% by Qwen2.5 72B Instruct - 79 models reporting (6 frontier)

Score history

Top models

BIG-Bench Hard (BBH)Bar chart with 21 bars. Highest value: Internlm2 5 20B Chat at 74.7.

21 models

Where it's ranked

Open LLM Leaderboard

Hugging Face

Aggregated

aggregated with 6 others · live

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

BBH RL Env (Community)

BigBenchHard (BBH) evaluation environment with Chain-of-Thought

ImplementationRL EnvReasoningBbh

Bigbench BBH RL Env (Prime Community)

Prime Community

Big Bench + BBH implementation

ImplementationRL EnvNLPBbhBigbench

OpenOrca

OpenOrca Team

An open reproduction of Microsoft's Orca recipe - FLAN prompts with GPT-4 chain-of-thought completions that taught reasoning by imitation.

Training dataSFT DatasetMathInstruction FollowingScientific Reasoning

SlimOrca

OpenOrca Team

A heavily-deduplicated, GPT-4-only slice of OpenOrca that delivers similar downstream quality at one-third the size.

Training dataSFT DatasetMathInstruction FollowingScientific Reasoning

Papers

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR · 2022

Introduces BIG-bench, a 200+ task collaborative benchmark spanning logic, social bias, code, and creative reasoning, contributed by 450+ authors.

introduces

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

ACL · 2022

Google paper isolating the 23 hardest BIG-Bench tasks (BBH) where prior models lagged humans, showing chain-of-thought prompting closes most of the gap.

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

TMLR · 2022

Introduces BIG-bench, a 200+ task collaborative benchmark spanning logic, social bias, code, and creative reasoning, contributed by 450+ authors.

Contributors

MMirac Suzgun JJason Wei

FAQ

What is BIG-Bench Hard (BBH)?: 23 challenging multi-step reasoning tasks distilled from BIG-Bench where prior models underperformed average humans.
What capabilities does BIG-Bench Hard (BBH) test?: BIG-Bench Hard (BBH) evaluates planning, scientific reasoning, math, logic.
What is the current top score on BIG-Bench Hard (BBH)?: The top reported score is 72.7% by Qwen2.5 72B Instruct, across 79 models reporting (6 from frontier labs).
How can a model improve its BIG-Bench Hard (BBH) score?: Tools linked to BIG-Bench Hard (BBH) on Sophon include BBH RL Env (Community), Bigbench BBH RL Env (Prime Community), OpenOrca, SlimOrca - RL environments, datasets, and scaffolds that target this eval.
What license is BIG-Bench Hard (BBH) under?: BIG-Bench Hard (BBH) is available under MIT.