web-search-env

web-search-env is a Verifiers environment for search-based QA tasks. It gives the model a single search(query) tool and keeps the search provider hidden behind a provider-agnostic backend.

Use it to evaluate the same agent with different search providers, or to run the benchmark suite from Exa's search RL blog post.

Install

prime env install web-search-env

Search Providers

Set search_backend to one of:

fixture: local deterministic search, no API key required
exa: Exa search via EXA_API_KEY
parallel: Parallel search via PARALLEL_API_KEY
serpapi: Google results through SerpApi via SERPAPI_API_KEY

Example:

prime eval run web-search-env \
  -m openai/gpt-4.1-mini \
  -n 20 -r 1 -t 1024 \
  -a '{"eval_dataset":"hotpotqa","search_backend":"exa","judge_backend":"llm"}'

Use the same command with only search_backend changed to compare providers. cache_path can be set to reuse live search results across repeated runs.

Datasets

Supported datasets:

sample
simpleqa
2wikimultihopqa or 2wiki
hotpotqa
frames
musique
browsecomp
hle or hle-text
widesearch

BrowseComp is packaged with the environment, so it does not download the CSV at runtime.

HLE uses the official cais/hle dataset. That dataset is gated on Hugging Face: accept access there and set HF_TOKEN or HUGGING_FACE_HUB_TOKEN before running it. The loader uses the text-only subset because this environment does not pass images to the model.

WideSearch uses the official ByteDance-Seed/WideSearch dataset and downloads the matching gold CSV tables from the same Hugging Face repo. Use an LLM judge for this dataset; exact or normalized string matching is not meaningful for large table outputs.

HLE and WideSearch are opt-in only. They are not part of the default sample run, the Exa blog suite, or the Exa training setup.

Run a single dataset:

prime eval run web-search-env \
  -m openai/gpt-4.1-mini \
  -n 50 -r 1 -t 1024 \
  -a '{
    "eval_dataset":"musique",
    "search_backend":"parallel",
    "parallel_mode":"advanced",
    "judge_backend":"llm"
  }'

Run multiple datasets:

prime eval run web-search-env \
  -m openai/gpt-4.1-mini \
  -n 100 -r 1 -t 1024 \
  -a '{"eval_dataset":"musique,hotpotqa","search_backend":"exa","judge_backend":"llm"}'

Run HLE:

prime eval run web-search-env \
  -m openai/gpt-4.1-mini \
  -n 20 -r 1 -t 4096 \
  -a '{"eval_dataset":"hle","search_backend":"exa","judge_backend":"llm"}'

Run WideSearch:

prime eval run web-search-env \
  -m openai/gpt-4.1-mini \
  -n 5 -r 1 -t 8192 \
  -a '{"eval_dataset":"widesearch","search_backend":"exa","judge_backend":"llm","max_turns":10}'

Exa Blog Suite

The Exa blog suite is:

simpleqa
2wikimultihopqa
hotpotqa
frames
musique
browsecomp

Use benchmark_suite="exa" to run that set:

prime eval run web-search-env \
  -m openai/gpt-4.1-mini \
  -n 30 -r 1 -t 1024 -T 0.2 \
  -a '{
    "benchmark_suite":"exa",
    "benchmark_examples_per_dataset":5,
    "search_backend":"exa",
    "judge_backend":"llm",
    "cache_path":"./search_traces/exa_suite.json"
  }'

benchmark_examples_per_dataset=5 means 5 examples from each of the 6 datasets, so -n 30 runs 30 total rollouts when -r 1.

This selects the same public benchmark set described in the Exa post. Exact reported numbers still depend on the model, sampling settings, judge, seeds, and live search results.

For an n=8 run, use:

prime eval run configs/eval/web-search-exa-suite-n8.toml

Packaged copies of the n=8 configs are also included under environments/web_search_env/configs/eval/.

That config runs the six-benchmark suite as one eval. To print per-benchmark scores from the saved results.jsonl, run:

python environments/web_search_env/scripts/summarize_by_benchmark.py \
  environments/web_search_env/outputs/evals/<run-dir>/<run-id>/results.jsonl

Use configs/eval/web-search-exa-suite-n8-by-benchmark.toml instead if you want Prime's own final summary to have one eval section per benchmark.

Exa Training Setup

Use exa_training_setup=true for the training and reward design described in the Exa post. It selects MuSiQue and HotpotQA for training, uses the SimpleQA-style LLM grader as the terminal binary reward, applies a -0.25 context-limit penalty, and sets the eval side to the six after-train benchmarks:

musique
hotpotqa
2wikimultihopqa
frames
browsecomp
simpleqa

Hosted training config:

prime train configs/rl/web-search-exa-training.toml \
  --env-var EXA_API_KEY \
  --env-var OPENAI_API_KEY

The packaged copy is also under environments/web_search_env/configs/rl/web-search-exa-training.toml.

Common Args

eval_dataset: dataset name or comma-separated list
train_datasets: training dataset name or comma-separated list
training_suite: use "exa" for MuSiQue + HotpotQA training data
exa_training_setup: use the Exa training/reward/eval defaults
benchmark_suite: use "exa", "hle", or "widesearch"
benchmark_examples_per_dataset: per-dataset cap for suite runs
train_examples_per_dataset: per-dataset cap for training suite data
search_backend: fixture, exa, parallel, or serpapi
result_count: search results per tool call, default 5
snippet_chars: max snippet characters per result, default 2000
search_max_qps: optional per-process request cap for live search providers
search_max_retries: retry count for retryable search failures
search_initial_jitter_seconds: random delay before each rollout's first live search
judge_backend: simpleqa, llm, normalized, exact, or substring
judge_model: judge model for judge_backend="simpleqa" or judge_backend="llm"
max_turns: max assistant/tool turns, default 10