prime-grep

Cross-repo code-search environment for the prime-swe-grep task set. Each task is a user-style question; the agent uses grep / list_files / read_file to search four pinned repos in a sandbox, then calls submit_spans with the spans that justify the answer.

Repos (pinned)

Repo	Commit
`prime-rl`	`65919439195fb384eda593b3f93d62cba5b60cf3`
`verifiers`	`58b119fa1b24eff85b74a75ccf3e132523b3c6c3`
`vllm`	`a171e6b52dff47dc567657e7d51f641bdcb22774`
`pytorch`	`c200b7e590a77d52373861844be4287c8ef9507a`

Bumping any of these requires re-verifying every span in tasks/ — spans store absolute line numbers, so cosmetic upstream edits will silently shift the ground truth.

Lifecycle

setup (PrimeGrepEnv.setup_state) — creates one persistent prime sandbox for the rollout from the prebuilt image.
rollout — vf.StatefulToolEnv alternates between tool calls (grep / list_files / read_file) and assistant messages until submit_spans records the final answer.
stop (PrimeGrepEnv.answer_submitted) — fires once state["submitted"] is set.
cleanup (PrimeGrepEnv.destroy_sandbox) — deletes the sandbox through src/sandbox_manager.
reward (span_score) — compares state["submitted_spans"] against the task's gold spans.

Reward modes

Set per-task via reward_mode in the YAML, or globally via the PRIME_GREP_REWARD_MODE env var.

file_recall (default) — 1.0 per essential gold span whose (repo, path) appears in the submission. Lenient: ignores exact line ranges.
span_iou — line-range IoU averaged across essential gold spans. Strict: sloppy ranges hurt the score.

Supporting spans (role: supporting in the YAML) never affect the score in either mode. They exist so depth answers (e.g. citing a torch primitive at the bottom of a call chain) are accepted but not required.

Tasks

tasks/*.yaml is bundled with this package; the repo root has a symlink for authoring convenience. See ../../AUTHORING.md for the recipe.

Quickstart

prime eval run prime-grep -n 8 -r 1

Environment Arguments

Arg	Type	Default	Description
`num_examples`	int	`-1`	Limit on number of tasks (-1 = all)
`max_turns`	int	`30`	Max rollout turns before forced stop
`sandbox_config`	dict	prebuilt prime-grep image config	Override sandbox request settings

For compatibility with previous v1-style configs, load_environment also accepts taskset.num_examples and harness.max_turns, but it always returns a plain PrimeGrepEnv(vf.StatefulToolEnv).

Metrics

Metric	Meaning
`span_score`	Mean recall over essential gold spans (mode-dependent)
`num_predicted`	Number of spans the model submitted
`submitted`	1.0 if the model called `submit_spans`, else 0.0
`avg_parallel_tool_calls_per_turn`	Mean tool-call batch size across assistant turns that call tools; single-call turns are 1.0

Sandbox management

Sandbox command execution and cleanup are handled by src/sandbox_manager. It centralizes app-level retries, lifecycle counters, recoverable tool-facing errors, fatal infra errors, and best-effort sandbox deletion. In particular, a transient DELETE /sandbox/{id} 500 during rollout cleanup is retried and then recorded as an orphaned sandbox instead of failing the worker after an otherwise successful rollout.