web-search-env
web-search-env is a Verifiers environment for search-based QA tasks. It gives
the model a single search(query) tool and keeps the search provider hidden
behind a provider-agnostic backend.
Use it to evaluate the same agent with different search providers, or to run the benchmark suite from Exa's search RL blog post.
Install
prime env install web-search-env
Search Providers
Set search_backend to one of:
fixture: local deterministic search, no API key requiredexa: Exa search viaEXA_API_KEYparallel: Parallel search viaPARALLEL_API_KEYserpapi: Google results through SerpApi viaSERPAPI_API_KEY
Example:
prime eval run web-search-env \
-m openai/gpt-4.1-mini \
-n 20 -r 1 -t 1024 \
-a '{"eval_dataset":"hotpotqa","search_backend":"exa","judge_backend":"llm"}'
Use the same command with only search_backend changed to compare providers.
cache_path can be set to reuse live search results across repeated runs.
Datasets
Supported datasets:
samplesimpleqa2wikimultihopqaor2wikihotpotqaframesmusiquebrowsecomphleorhle-textwidesearch
BrowseComp is packaged with the environment, so it does not download the CSV at runtime.
HLE uses the official cais/hle dataset. That dataset is gated on Hugging Face:
accept access there and set HF_TOKEN or HUGGING_FACE_HUB_TOKEN before
running it. The loader uses the text-only subset because this environment does
not pass images to the model.
WideSearch uses the official ByteDance-Seed/WideSearch dataset and downloads
the matching gold CSV tables from the same Hugging Face repo. Use an LLM judge
for this dataset; exact or normalized string matching is not meaningful for
large table outputs.
HLE and WideSearch are opt-in only. They are not part of the default sample run, the Exa blog suite, or the Exa training setup.
Run a single dataset:
prime eval run web-search-env \
-m openai/gpt-4.1-mini \
-n 50 -r 1 -t 1024 \
-a '{
"eval_dataset":"musique",
"search_backend":"parallel",
"parallel_mode":"advanced",
"judge_backend":"llm"
}'
Run multiple datasets:
prime eval run web-search-env \
-m openai/gpt-4.1-mini \
-n 100 -r 1 -t 1024 \
-a '{"eval_dataset":"musique,hotpotqa","search_backend":"exa","judge_backend":"llm"}'
Run HLE:
prime eval run web-search-env \
-m openai/gpt-4.1-mini \
-n 20 -r 1 -t 4096 \
-a '{"eval_dataset":"hle","search_backend":"exa","judge_backend":"llm"}'
Run WideSearch:
prime eval run web-search-env \
-m openai/gpt-4.1-mini \
-n 5 -r 1 -t 8192 \
-a '{"eval_dataset":"widesearch","search_backend":"exa","judge_backend":"llm","max_turns":10}'
Exa Blog Suite
The Exa blog suite is:
simpleqa
2wikimultihopqa
hotpotqa
frames
musique
browsecomp
Use benchmark_suite="exa" to run that set:
prime eval run web-search-env \
-m openai/gpt-4.1-mini \
-n 30 -r 1 -t 1024 -T 0.2 \
-a '{
"benchmark_suite":"exa",
"benchmark_examples_per_dataset":5,
"search_backend":"exa",
"judge_backend":"llm",
"cache_path":"./search_traces/exa_suite.json"
}'
benchmark_examples_per_dataset=5 means 5 examples from each of the 6 datasets,
so -n 30 runs 30 total rollouts when -r 1.
This selects the same public benchmark set described in the Exa post. Exact reported numbers still depend on the model, sampling settings, judge, seeds, and live search results.
For an n=8 run, use:
prime eval run configs/eval/web-search-exa-suite-n8.toml
Packaged copies of the n=8 configs are also included under
environments/web_search_env/configs/eval/.
That config runs the six-benchmark suite as one eval. To print per-benchmark
scores from the saved results.jsonl, run:
python environments/web_search_env/scripts/summarize_by_benchmark.py \
environments/web_search_env/outputs/evals/<run-dir>/<run-id>/results.jsonl
Use configs/eval/web-search-exa-suite-n8-by-benchmark.toml instead if you want
Prime's own final summary to have one eval section per benchmark.
Exa Training Setup
Use exa_training_setup=true for the training and reward design described in
the Exa post. It selects MuSiQue and HotpotQA for training, uses the
SimpleQA-style LLM grader as the terminal binary reward, applies a -0.25
context-limit penalty, and sets the eval side to the six after-train benchmarks:
musique
hotpotqa
2wikimultihopqa
frames
browsecomp
simpleqa
Hosted training config:
prime train configs/rl/web-search-exa-training.toml \
--env-var EXA_API_KEY \
--env-var OPENAI_API_KEY
The packaged copy is also under
environments/web_search_env/configs/rl/web-search-exa-training.toml.
Common Args
eval_dataset: dataset name or comma-separated listtrain_datasets: training dataset name or comma-separated listtraining_suite: use"exa"for MuSiQue + HotpotQA training dataexa_training_setup: use the Exa training/reward/eval defaultsbenchmark_suite: use"exa","hle", or"widesearch"benchmark_examples_per_dataset: per-dataset cap for suite runstrain_examples_per_dataset: per-dataset cap for training suite datasearch_backend:fixture,exa,parallel, orserpapiresult_count: search results per tool call, default5snippet_chars: max snippet characters per result, default2000search_max_qps: optional per-process request cap for live search providerssearch_max_retries: retry count for retryable search failuressearch_initial_jitter_seconds: random delay before each rollout's first live searchjudge_backend:simpleqa,llm,normalized,exact, orsubstringjudge_model: judge model forjudge_backend="simpleqa"orjudge_backend="llm"max_turns: max assistant/tool turns, default10