What capabilities does DABstep test?

DABstep evaluates tool calling, planning, code generation.

What is the current top score on DABstep?

The top reported score is 100.0% by GPT-5 Mini, across 24 models reporting (15 from frontier labs).

How can a model improve its DABstep score?

Tools linked to DABstep on Sophon include Dabstep RL Env (Prime Community), Dabstep RL Env (Prime Intellect), Dabstep RL Env (Community), OpenEnv Jupyter Agent (E2B-backed) - RL environments, datasets, and scaffolds that target this eval.

What license is DABstep under?

DABstep is available under CC-BY-4.0.

DABstep

Saturated

450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.

Open

Publisher: Hugging Face
Capabilities: Tool Calling Planning Code Generation
Domain: agentic
Format: HF Dataset
Size: 450 tasks
License: CC-BY-4.0
Published: May 2026
Notable for: Benchmark for evaluating tool calling, planning and code generation in the agentic domain.
Canonical: huggingface.co/blog/dabstep
Also on: huggingface.co/spaces/adyen/DABstep

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: dabstep-org prime-hub

Attribution policy →

Top score 100.0% by GPT-5 Mini - 24 models reporting (15 frontier)

Score history

Top models

DABstepBar chart with 21 bars. Highest value: GPT-5 Mini at 100.

21 models

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Dabstep RL Env (Prime Community)

Prime Community

Data Agent Benchmark for Multi-step Reasoning benchmark

ImplementationRL EnvCodingData AnalysisOpenai Compatible

Dabstep RL Env (Prime Intellect)

Prime Intellect

Data Agent Benchmark for Multi-step Reasoning benchmark

ImplementationRL EnvCodingData AnalysisOpenai Compatible

Dabstep RL Env (Community)

Data Agent Benchmark for Multi-step Reasoning benchmark

ImplementationRL EnvCodingData AnalysisOpenai Compatible

OpenEnv Jupyter Agent (E2B-backed)

Hugging Face

Stateful Jupyter-notebook environment backed by E2B sandboxes; the model edits and executes cells turn-by-turn, building up a notebook that solves a data-science task.

Trains towardRL EnvCode GenerationPlanningTool Calling

FAQ

What is DABstep?: 450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.
What capabilities does DABstep test?: DABstep evaluates tool calling, planning, code generation.
What is the current top score on DABstep?: The top reported score is 100.0% by GPT-5 Mini, across 24 models reporting (15 from frontier labs).
How can a model improve its DABstep score?: Tools linked to DABstep on Sophon include Dabstep RL Env (Prime Community), Dabstep RL Env (Prime Intellect), Dabstep RL Env (Community), OpenEnv Jupyter Agent (E2B-backed) - RL environments, datasets, and scaffolds that target this eval.
What license is DABstep under?: DABstep is available under CC-BY-4.0.