0

BrowseComp

Fresh

BrowseComp is a benchmark for measuring agents' web-browsing ability, comprising 1,266 questions that require persistent navigation to locate hard-to-find, entangled information. It yields short, easily verifiable answers and tests agents' persistence and creativity in finding…

Type
RL Env
Runtime
ORS
License
unknown
Size
1266 tasks
Published
Jan 2026

Cite

Notes

Only stored in your browser.

BrowseComp

OpenReward Environment

Description

BrowseComp is an environment for evaluating web search reasoning capabilities. Based on OpenAI's simple-evals benchmark, it contains 1,266 encrypted research questions that require multi-hop reasoning and cannot be answered without current web information. The environment provides built-in web search and URL fetching tools powered by Tavily.

Capabilities

  • Multi-hop web search reasoning
  • Information retrieval and synthesis
  • Research question answering
  • Confidence calibration

Compute Requirements

Agents are given a standard environment with no sandbox or file system access.

License

MIT

Tasks

There is one split in this environment:

  • test: 1,266 encrypted research questions

Questions require multi-hop reasoning across multiple web searches. Example: "What was the name of the 1995 film starring the actress who played Victoria that married the final scorer of the World Cup 1998 winner?"

Reward Structure

This is a sparse reward environment with LLM-based grading:

  1. Agent receives a research question
  2. Agent uses web_search and fetch_url tools to gather information
  3. Agent submits answer with explanation, exact_answer, and confidence
  4. An LLM grader (gpt-5-mini) evaluates semantic equivalence
  5. Binary reward: 1.0 if correct, 0.0 if incorrect

Data

Data is sourced from OpenAI's BrowseComp benchmark. Data is stored on the OpenReward platform.

Tools

ToolDescription
web_searchSearch the web using Tavily (returns titles, URLs, snippets)
fetch_urlFetch full content from a URL (truncated to 8000 chars)
submit_answerSubmit answer with explanation, exact_answer, and confidence

Time Horizon

Multi-turn. Agents can perform multiple web searches before submitting a final answer.

Environment Difficulty

ModelAccuracy
Gemini 3.1 Pro (search, Python, browse)85.9%
Claude Opus 4.684.0%
Kimi K2.5 (agent swarm)78.4%
MiniMax M2.576.3%
GLM-5 (with ctx management)75.9%

This benchmark requires persistent multi-hop web navigation to find hard-to-find, entangled information.

Other Environment Requirements

  • OpenAI API key required for LLM-based grading
  • Tavily API key required for web search

Pass via secrets={"openai_api_key": "...", "tavily_api_key": "..."}.

Safety

Agents in BrowseComp perform web searches and answer research questions. The environment does not present direct safety risks. Note: Do not share decrypted questions publicly per OpenAI's request.

Citation

@article{wei2025browsecomp,
  title={BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents},
  author={Wei, Jason and Sun, Zhiqing and Papay, Spencer and McKinney, Scott and Han, Jeffrey and Fulford, Isa and Chung, Hyung Won and Passos, Alex Tachard and Fedus, William and Glaese, Amelia},
  journal={arXiv preprint arXiv:2504.12516},
  year={2025},
  url={https://arxiv.org/abs/2504.12516}
}