0

Mostly Basic Python Problems (MBPP)

Saturated

974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.

Capabilities
Code Generation
Domain
code
Format
HF Dataset
Size
974 tasks
License
CC-BY-4.0
Published
Aug 2021
Notable for
Benchmark for evaluating code generation in the code domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
EvalPlusprime-hub
Attribution policy →

Top score 100.0% by GPT-5 Nano - 15 models reporting (3 frontier)

Score history

7
55%66%78%89%100%Apr 24Aug 24Dec 24Apr 25Aug 25Llama 3 Instruct 70BGrok BetaQwen2.5 Coder 32B InstructGPT-5 Nano

Top models

15
Mostly Basic Python Problems (MBPP)Bar chart with 15 bars. Highest value: GPT-5 Nano at 100.
15 models

Where it's ranked

1

Related tools

3
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

1

FAQ

What is Mostly Basic Python Problems (MBPP)?
974 short crowd-sourced Python tasks with three unit tests each, used alongside HumanEval as a baseline code-generation benchmark.
What capabilities does Mostly Basic Python Problems (MBPP) test?
Mostly Basic Python Problems (MBPP) evaluates code generation.
What is the current top score on Mostly Basic Python Problems (MBPP)?
The top reported score is 100.0% by GPT-5 Nano, across 15 models reporting (3 from frontier labs).
How can a model improve its Mostly Basic Python Problems (MBPP) score?
Tools linked to Mostly Basic Python Problems (MBPP) on Sophon include MBPP RL Env (Community), Openenv Coding RL Env (Meta FAIR (Fundamental AI Research)), MBPP Baseline RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is Mostly Basic Python Problems (MBPP) under?
Mostly Basic Python Problems (MBPP) is available under CC-BY-4.0.