What capabilities does BrowseComp test?

BrowseComp evaluates browser use, retrieval, planning.

What is the current top score on BrowseComp?

The top reported score is 20.0% by GPT-5 Mini, across 5 models reporting (5 from frontier labs).

How can a model improve its BrowseComp score?

Tools linked to BrowseComp on Sophon include BB Browsecomp RL Env (Prime Intellect), Browsecomp RL Env (Prime Intellect), DeepDive (Serper-powered web QA), Browsecomp Openai RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

What license is BrowseComp under?

BrowseComp is available under MIT.

BrowseComp

Frontier

1,266 hard fact-finding questions on the open web requiring persistent browsing and reasoning over scattered, obscure sources.

Open

Publisher: OpenAI
Capabilities: Browser Use Retrieval Planning
Domain: agentic
Format: Custom
Size: 1266 tasks
License: MIT
Published: Apr 2025
Updates: Monthly
Notable for: The canonical benchmark for "deep research" / browsing agents — where GPT-4o scored near 0% at launch but GPT-5.5 Pro now exceeds 90%.
Canonical: openai.com/index/browsecomp
Official leaderboard: openai.com/index/browsecomp
Also on: github.com/openai/simple-evals

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores: prime-hub

Attribution policy →

Top score 20.0% by GPT-5 Mini - 5 models reporting (5 frontier)

Score history

Top models

BrowseCompBar chart with 5 bars. Highest value: GPT-5 Mini at 20.

5 models

Where it's ranked

Official leaderboard

openai.com

Single benchmark

monthly

Related tools

View all

Implementations, trainers, datasets and scaffolds linked to this eval.

BB Browsecomp RL Env (Prime Intellect)

Prime Intellect

BrowserEnv[DOM] Environment on Browsecomp

ImplementationRL Env

Browsecomp RL Env (Prime Intellect)

Prime Intellect

BrowseComp evaluation environment

ImplementationRL EnvWeb SearchTool UseWeb

DeepDive (Serper-powered web QA)

Prime Intellect

Multi-turn open-web research QA environment where the model uses a Serper search tool to answer hard, hop-heavy questions of the BrowseComp / SimpleQA style.

Trains towardRL EnvRetrievalTool CallingPlanning

Browsecomp Openai RL Env (Community)

Tool-use environment for the model to browse the web and locate hard-to-find information; scored using an LLM-as-judge rubric

Trains towardRL EnvWeb SearchTool UseWeb

Browsecomp Openai RL Env (Prime Intellect)

Prime Intellect

Tool-use environment for the model to browse the web and locate hard-to-find information; scored using an LLM-as-judge rubric

Trains towardRL EnvWeb SearchTool UseWeb

DDBC RL Env (Prime Intellect)

Prime Intellect

browsecomp with deepdive tools

Trains towardRL EnvSearchQA

DDBC RLM RL Env (Prime Intellect)

Prime Intellect

BrowseComp with DeepDive tools using RLM

Trains towardRL EnvSearchQA

Papers

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

preprint · 2025

OpenAI benchmark of 1,266 web-research questions that require persistent, creative browsing to find a single short verifiable answer.

introduces

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

preprint · 2025

OpenAI benchmark of 1,266 web-research questions that require persistent, creative browsing to find a single short verifiable answer.

Contributors

JJason Wei

FAQ

What is BrowseComp?: 1,266 hard fact-finding questions on the open web requiring persistent browsing and reasoning over scattered, obscure sources.
What capabilities does BrowseComp test?: BrowseComp evaluates browser use, retrieval, planning.
What is the current top score on BrowseComp?: The top reported score is 20.0% by GPT-5 Mini, across 5 models reporting (5 from frontier labs).
How can a model improve its BrowseComp score?: Tools linked to BrowseComp on Sophon include BB Browsecomp RL Env (Prime Intellect), Browsecomp RL Env (Prime Intellect), DeepDive (Serper-powered web QA), Browsecomp Openai RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is BrowseComp under?: BrowseComp is available under MIT.