0

BrowseComp

Frontier

1,266 hard fact-finding questions on the open web requiring persistent browsing and reasoning over scattered, obscure sources.

Publisher
OpenAI
Domain
agentic
Format
Custom
Size
1266 tasks
License
MIT
Published
Apr 2025
Updates
Monthly
Notable for
The canonical benchmark for "deep research" / browsing agents — where GPT-4o scored near 0% at launch but GPT-5.5 Pro now exceeds 90%.
Official leaderboard
openai.com/index/browsecomp

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 20.0% by GPT-5 Mini - 5 models reporting (5 frontier)

Score history

5
0%25%50%75%100%Apr 25Jun 25Aug 25Oct 25Dec 25o3GPT-5 Mini

Top models

5
BrowseCompBar chart with 5 bars. Highest value: GPT-5 Mini at 20.
5 models

Where it's ranked

1

Related tools

7
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

Papers

2

Contributors

1

FAQ

What is BrowseComp?
1,266 hard fact-finding questions on the open web requiring persistent browsing and reasoning over scattered, obscure sources.
What capabilities does BrowseComp test?
BrowseComp evaluates browser use, retrieval, planning.
What is the current top score on BrowseComp?
The top reported score is 20.0% by GPT-5 Mini, across 5 models reporting (5 from frontier labs).
How can a model improve its BrowseComp score?
Tools linked to BrowseComp on Sophon include BB Browsecomp RL Env (Prime Intellect), Browsecomp RL Env (Prime Intellect), DeepDive (Serper-powered web QA), Browsecomp Openai RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is BrowseComp under?
BrowseComp is available under MIT.