prime-grep
Cross-repo code-search environment for the prime-swe-grep task set. Each task is a user-style question; the agent uses grep / list_files / read_file to search four pinned repos in a sandbox, then calls submit_spans with the spans that justify the answer.
Repos (pinned)
| Repo | Commit |
|---|---|
prime-rl | 65919439195fb384eda593b3f93d62cba5b60cf3 |
verifiers | 58b119fa1b24eff85b74a75ccf3e132523b3c6c3 |
vllm | a171e6b52dff47dc567657e7d51f641bdcb22774 |
pytorch | c200b7e590a77d52373861844be4287c8ef9507a |
Bumping any of these requires re-verifying every span in tasks/ — spans store absolute line numbers, so cosmetic upstream edits will silently shift the ground truth.
Lifecycle
- setup (
PrimeGrepEnv.setup_state) — creates one persistent prime sandbox for the rollout from the prebuilt image. - rollout —
vf.StatefulToolEnvalternates between tool calls (grep/list_files/read_file) and assistant messages untilsubmit_spansrecords the final answer. - stop (
PrimeGrepEnv.answer_submitted) — fires oncestate["submitted"]is set. - cleanup (
PrimeGrepEnv.destroy_sandbox) — deletes the sandbox throughsrc/sandbox_manager. - reward (
span_score) — comparesstate["submitted_spans"]against the task's gold spans.
Reward modes
Set per-task via reward_mode in the YAML, or globally via the PRIME_GREP_REWARD_MODE env var.
file_recall(default) — 1.0 per essential gold span whose(repo, path)appears in the submission. Lenient: ignores exact line ranges.span_iou— line-range IoU averaged across essential gold spans. Strict: sloppy ranges hurt the score.
Supporting spans (role: supporting in the YAML) never affect the score in either mode. They exist so depth answers (e.g. citing a torch primitive at the bottom of a call chain) are accepted but not required.
Tasks
tasks/*.yaml is bundled with this package; the repo root has a symlink for authoring convenience. See ../../AUTHORING.md for the recipe.
Quickstart
prime eval run prime-grep -n 8 -r 1
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
num_examples | int | -1 | Limit on number of tasks (-1 = all) |
max_turns | int | 30 | Max rollout turns before forced stop |
sandbox_config | dict | prebuilt prime-grep image config | Override sandbox request settings |
For compatibility with previous v1-style configs, load_environment also
accepts taskset.num_examples and harness.max_turns, but it always returns a
plain PrimeGrepEnv(vf.StatefulToolEnv).
Metrics
| Metric | Meaning |
|---|---|
span_score | Mean recall over essential gold spans (mode-dependent) |
num_predicted | Number of spans the model submitted |
submitted | 1.0 if the model called submit_spans, else 0.0 |
avg_parallel_tool_calls_per_turn | Mean tool-call batch size across assistant turns that call tools; single-call turns are 1.0 |
Sandbox management
Sandbox command execution and cleanup are handled by src/sandbox_manager.
It centralizes app-level retries, lifecycle counters, recoverable tool-facing
errors, fatal infra errors, and best-effort sandbox deletion. In particular, a
transient DELETE /sandbox/{id} 500 during rollout cleanup is retried and then
recorded as an orphaned sandbox instead of failing the worker after an otherwise
successful rollout.