BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

OpenAI benchmark of 1,266 web-research questions that require persistent, creative browsing to find a single short verifiable answer.

Open

Preview
Publisher: OpenAI
Year: 2025
Venue: preprint
ArXiv: arxiv.org/abs/2504.12516
Code: github.com/openai/simple-evals
Authors: 10
Hosting: External sourcelicense unknown

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/2504.12516
TL;DR: semanticscholar.org/paper/41d1ea36a9af136efc42f3c85516d00cc1d13458
Code: github.com/openai/simple-evals

Attribution policy →

Introduces 1 artifact - 1 eval

TL;DR

Semantic Scholar

While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information.

Artifacts

Evals

BrowseComp

Authors

Alex Tachard Passos Amelia Glaese Hyung Won Chung Isa Fulford Jason Wei Jeffrey Han Scott Mayer McKinney Spencer Papay William Fedus Zhiqing Sun