DABstep
Saturated
450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.
- Publisher
- Hugging Face
- Capabilities
- Tool CallingPlanningCode Generation
- Domain
- agentic
- Format
- HF Dataset
- Size
- 450 tasks
- License
- CC-BY-4.0
- Published
- May 2026
- Notable for
- Benchmark for evaluating tool calling, planning and code generation in the agentic domain.
- Canonical
- huggingface.co/blog/dabstep
Cite
Notes
Only stored in your browser.
Top score 100.0% by GPT-5 Mini - 24 models reporting (15 frontier)
Score history
23Top models
24Related tools
4Implementations, trainers, datasets and scaffolds linked to this eval.
FAQ
- What is DABstep?
- 450+ multi-step data-analysis tasks combining structured records with unstructured docs, requiring an agent to plan, query, and reason in stages.
- What capabilities does DABstep test?
- DABstep evaluates tool calling, planning, code generation.
- What is the current top score on DABstep?
- The top reported score is 100.0% by GPT-5 Mini, across 24 models reporting (15 from frontier labs).
- How can a model improve its DABstep score?
- Tools linked to DABstep on Sophon include Dabstep RL Env (Prime Community), Dabstep RL Env (Prime Intellect), Dabstep RL Env (Community), OpenEnv Jupyter Agent (E2B-backed) - RL environments, datasets, and scaffolds that target this eval.
- What license is DABstep under?
- DABstep is available under CC-BY-4.0.