What capabilities does Mostly Basic Python Problems (MBPP) test?

Mostly Basic Python Problems (MBPP) evaluates code generation.

What is the current top score on Mostly Basic Python Problems (MBPP)?

The top reported score is 100.0% by GPT-5 Nano, across 15 models reporting (3 from frontier labs).

What license is Mostly Basic Python Problems (MBPP) under?

Mostly Basic Python Problems (MBPP) is available under CC-BY-4.0.

Mostly Basic Python Problems (MBPP)

Saturated

974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.

Open

Publisher: Google Research
Capabilities: Code Generation
Domain: code
Format: HF Dataset
Size: 974 tasks
License: CC-BY-4.0
Published: Aug 2021
Notable for: Benchmark for evaluating code generation in the code domain.
Canonical: github.com/google-research/google-research/tree/master/mbpp
Also on: huggingface.co/datasets/mbpp

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: EvalPlus prime-hub

Attribution policy →

Top score 100.0% by GPT-5 Nano - 15 models reporting (3 frontier)

Score history

Top models

Mostly Basic Python Problems (MBPP)Bar chart with 15 bars. Highest value: GPT-5 Nano at 100.

15 models

Where it's ranked

EvalPlus Leaderboard

EvalPlus Team

Aggregated

aggregated with 3 others · monthly

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

MBPP RL Env (Community)

MBPP baseline environment for Verifiers (compile + tests via SandboxFusion).

ImplementationRL EnvMbppCodePython

Openenv Coding RL Env (Meta FAIR (Fundamental AI Research))

Meta FAIR (Fundamental AI Research)

Sandboxed Python code-execution environment built on smolagents, exposing stdout/stderr/exit_code via the OpenEnv HTTP interface for closed-loop code-solver training.

Trains towardRL EnvCode GenerationTool CallingCode

MBPP Baseline RL Env (Community)

MBPP baseline environment for Verifiers (compile + tests via SandboxFusion).

Trains towardRL EnvMbppCodePython

Papers

Program Synthesis with Large Language Models

preprint · 2021

Google paper that introduces MBPP - 974 short crowd-sourced Python problems with unit tests - and MathQA-Python, longtime companions to HumanEval.

introduces

Program Synthesis with Large Language Models

preprint · 2021

Google paper that introduces MBPP - 974 short crowd-sourced Python problems with unit tests - and MathQA-Python, longtime companions to HumanEval.

Contributors

JJacob Austin

FAQ

What is Mostly Basic Python Problems (MBPP)?: 974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.
What capabilities does Mostly Basic Python Problems (MBPP) test?: Mostly Basic Python Problems (MBPP) evaluates code generation.
What is the current top score on Mostly Basic Python Problems (MBPP)?: The top reported score is 100.0% by GPT-5 Nano, across 15 models reporting (3 from frontier labs).
How can a model improve its Mostly Basic Python Problems (MBPP) score?: Tools linked to Mostly Basic Python Problems (MBPP) on Sophon include MBPP RL Env (Community), Openenv Coding RL Env (Meta FAIR (Fundamental AI Research)), MBPP Baseline RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is Mostly Basic Python Problems (MBPP) under?: Mostly Basic Python Problems (MBPP) is available under CC-BY-4.0.