BrowseComp

Description

BrowseComp is an environment for evaluating web search reasoning capabilities. Based on OpenAI's simple-evals benchmark, it contains 1,266 encrypted research questions that require multi-hop reasoning and cannot be answered without current web information. The environment provides built-in web search and URL fetching tools powered by Tavily.

Capabilities

Multi-hop web search reasoning
Information retrieval and synthesis
Research question answering
Confidence calibration

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

There is one split in this environment:

test: 1,266 encrypted research questions

Questions require multi-hop reasoning across multiple web searches. Example: "What was the name of the 1995 film starring the actress who played Victoria that married the final scorer of the World Cup 1998 winner?"

Reward Structure

This is a sparse reward environment with LLM-based grading:

Agent receives a research question
Agent uses web_search and fetch_url tools to gather information
Agent submits answer with explanation, exact_answer, and confidence
An LLM grader (gpt-5-mini) evaluates semantic equivalence
Binary reward: 1.0 if correct, 0.0 if incorrect

Data

Data is sourced from OpenAI's BrowseComp benchmark. Data is stored on the OpenReward platform.

Tools

Tool	Description
`web_search`	Search the web using Tavily (returns titles, URLs, snippets)
`fetch_url`	Fetch full content from a URL (truncated to 8000 chars)
`submit_answer`	Submit answer with explanation, exact_answer, and confidence

Time Horizon

Multi-turn. Agents can perform multiple web searches before submitting a final answer.

Environment Difficulty

Model	Accuracy
Gemini 3.1 Pro (search, Python, browse)	85.9%
Claude Opus 4.6	84.0%
Kimi K2.5 (agent swarm)	78.4%
MiniMax M2.5	76.3%
GLM-5 (with ctx management)	75.9%

This benchmark requires persistent multi-hop web navigation to find hard-to-find, entangled information.

Other Environment Requirements

OpenAI API key required for LLM-based grading
Tavily API key required for web search

Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.

Safety

Agents in BrowseComp perform web searches and answer research questions. The environment does not present direct safety risks. Note: Do not share decrypted questions publicly per OpenAI's request.

Citation

@article{wei2025browsecomp,
  title={BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents},
  author={Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia},
  journal={arXiv preprint arXiv:2504.12516},
  year={2025},
  url={https://arxiv.org/abs/2504.12516}
}