0

DABstep

Saturated

450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.

Open
Publisher
Hugging Face
Domain
agentic
Format
HF Dataset
Size
450 tasks
License
CC-BY-4.0
Published
May 2026
Notable for
Benchmark for evaluating tool calling, planning and code generation in the agentic domain.

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
dabstep-orgprime-hub
Attribution policy →

Top score 100.0% by GPT-5 Mini - 24 models reporting (15 frontier)

Score history

23
0%25%50%75%100%Sep 24Dec 24Mar 25Jun 25Sep 25Llama 3.2 Instruct 1BGPT-4oDeepSeek V3Qwen3 (family)GPT-5 Mini

Top models

24
DABstepBar chart with 21 bars. Highest value: GPT-5 Mini at 100.
21 models

Related tools

4
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is DABstep?
450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.
What capabilities does DABstep test?
DABstep evaluates tool calling, planning, code generation.
What is the current top score on DABstep?
The top reported score is 100.0% by GPT-5 Mini, across 24 models reporting (15 from frontier labs).
How can a model improve its DABstep score?
Tools linked to DABstep on Sophon include Dabstep RL Env (Prime Community), Dabstep RL Env (Prime Intellect), Dabstep RL Env (Community), OpenEnv Jupyter Agent (E2B-backed) - RL environments, datasets, and scaffolds that target this eval.
What license is DABstep under?
DABstep is available under CC-BY-4.0.