opencode-science
Overview
- Environment ID:
opencode_science - Short description: Solve science problems using an OpenCode agent inside a sandbox, verified with
math_verify. - Tags:
science,opencode,multi-turn
Datasets
- Primary dataset: PrimeIntellect/INTELLECT-3-RL (subset
science, splittrain). - Any HuggingFace dataset with question/answer columns can be used.
Task
- Type: multi-turn (OpenCode CLI agent in a sandbox)
- Output format expectations: Agent output should contain a
\boxed{}answer. - Rubric:
HybridMathRubric— extracts\boxed{}from the agent's terminal output and verifies against the expected answer usingmath_verify. Produces a binarycorrect_answerscore (1.0 or 0.0).
Architecture
OpenCodeScienceEnv inherits from base classes in the verifiers package:
OpenCodeScienceEnv (environments/opencode_science/opencode_science.py)
└── OpenCodeQAEnv (verifiers/envs/experimental/opencode_qa_env.py)
└── OpenCodeEnv (verifiers/envs/experimental/opencode_env.py)
└── vf.CliAgentEnv (verifiers/envs/experimental/cli_agent_env.py)
OpenCodeEnv— installs and configures the OpenCode CLI agent in a sandbox, handles prompt/config upload.OpenCodeQAEnv— loads a HuggingFace QA dataset and formats it for the agent.OpenCodeScienceEnv— sets science-specific defaults (dataset, rubric, instruction prompt).
Quickstart
# install (local development)
uv pip install -e ./environments/opencode_science
# single debug rollout
uv run vf-eval --env opencode_science -d -v -n1 -r1
# multiple rollouts, save results
uv run vf-eval --env opencode_science -n5 -r3 -s
Environment Arguments
These are the arguments accepted by load_environment():
| Arg | Type | Default | Description |
|---|---|---|---|
dataset_name | str | "PrimeIntellect/INTELLECT-3-RL" | HuggingFace dataset name |
dataset_subset | str | "science" | Dataset subset/config |
dataset_split | str | "train" | Dataset split |
question_key | str | "question" | Column name for questions |
answer_key | str | "answer" | Column name for expected answers |
instruction_prompt | str | "Solve the following problem.\n\n" | Prefix prepended to each question |
instruction_prompt_post | str | "" | Suffix appended to each question |
difficulty_key | str | None | "avg@16_qwen3_4b_instruct_2507" | Column for difficulty filtering |
min_avg_reward | float | 0.0 | Minimum reward for dataset filtering |
max_avg_reward | float | 1.0 | Maximum reward for dataset filtering |
system_prompt | str | None | (OpenCode default) | System prompt for the agent |
disabled_tools | list[str] | None | ["question", "task", "websearch"] | OpenCode tools to disable |
agent_workdir | str | "/app" | Working directory inside the sandbox |
answer_path | str | "/app/answer.txt" | Path where the agent writes its final answer |
score_remotely | bool | True | Whether to read the answer from answer_path in the sandbox |
use_judge_fallback | bool | True | Fall back to LLM judge if math_verify fails |
judge_model | str | "openai/gpt-5-nano" | Model for the judge fallback |
judge_base_url | str | None | "https://api.pinference.ai/api/v1" | Base URL for the judge API |
judge_api_key_var | str | None | "PRIME_API_KEY" | Environment variable for the judge API key |
sandbox_docker_image | str | "...opencode-science:rl2" | Docker image for the sandbox (opencode binary baked in) |
timeout_seconds | float | 3600.0 | Rollout timeout (1h) |
sandbox_cpu_cores | int | 1 | CPU cores for the sandbox |
sandbox_memory_gb | int | 2 | Memory (GB) for the sandbox |
sandbox_disk_size_gb | int | 4 | Disk size (GB) for the sandbox |
sandbox_client_max_workers | int | None | None | Max concurrent sandbox workers |
max_turns | int | 100 | Max conversation turns |
Metrics
| Metric | Meaning |
|---|---|
reward | Main scalar reward: 1.0 if math_verify confirms correctness, else 0.0 |
correct_answer | Binary math_verify result (same as reward when no other reward functions are added) |
How it works
- On init, loads the HuggingFace dataset (science subset) and prepends the instruction prompt to each question.
- Each rollout creates a sandbox, installs the OpenCode CLI, uploads the prompt and config, then runs the agent.
- The agent's API calls are intercepted and routed to the configured LLM.
- After the agent finishes, the rubric reads the answer from
/app/answer.txtin the sandbox (whenscore_remotely=True) or extracts the\boxed{}answer from the conversation, and verifies it against the expected answer usingmath_verify. If verification fails anduse_judge_fallback=True, an LLM judge provides a fallback score.
Changelog
v0.3.11
- Bump
verifiersto>=0.1.15.dev2for the OpenCode harness config that disables title-generation calls while preserving thesmall_modelpin.
v0.3.10
- Default
sandbox_client_max_workerstoNoneso the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.
v0.3.9
- Harden sandbox image bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.
v0.3.8
- Fix
sandbox_docker_imageprefix. Thecme8364tg000o1139v84cu0cv/...prefix carried over from v0.3.7 is a user-scoped ID that the cluster cannot pull from, causingImagePullBackOffon every sandbox creation. Swap to the team-scopedteam-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2.
v0.3.7
- Pin
sandbox_docker_imagedefault toteam-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2. The new image bakes the opencode v1.1.63-rl2 binary into the sandbox so cold sandboxes no longer need to install it at rollout time. Documentation and image table updated to match.
v0.3.5
- Bump opencode fork release from
1.1.63-rl1to1.1.63-rl2(PrimeIntellect-ai/opencode#3). Fork release surfaces session-level retry exhaustion as a non-zero exit with a structured stderr dump, so hosted RL rollouts that previously returned silent empty trajectories now produce realAgentErrorentries. Companion default bump in verifiers: PrimeIntellect-ai/verifiers#1184.
v0.3.4
- Bump verifiers to stable
>=0.1.12.
v0.3.3
- Bump verifiers to
>=0.1.13.dev1.
v0.3.2
- Bump verifiers to stable
>=0.1.12.
v0.3.1
- Fix package structure: convert flat module to proper package directory so hatchling includes it in the wheel. Fixes
ModuleNotFoundErrorin hosted training. - Import harness and taskset from
verifiers.envs.experimental.composableinstead of separate packages.
v0.3.0
- Import harness and taskset from verifiers package proper (verifiers >= 0.1.12.dev5).
v0.2.2
- Migrate OpenCode fork from
rasdani/opencodetoPrimeIntellect-ai/opencode. Bump release from1.1.63-swe8to1.1.63-rl1(trimmed system prompt for RL training efficiency).
v0.2.1
- Bump verifiers to >=0.1.12.dev3: fixes opencode model ID for LoRA adapter names without
/in hosted training. - Use personal sandbox image for public reproducibility.
v0.2.0
- Rewrite to composable architecture. Uses
ComposableEnv+MathTaskSet(subset="science")+opencode_harness. Scored byRemoteHybridMathRubricwith judge fallback. ReplacesOpenCodeScienceEnvclass hierarchy. - Verify OpenCode tarball integrity with pinned SHA-256 checksum (via
opencode_harness).
v0.1.1
- Bump verifiers to v0.1.12.dev1: perf improvements to
MathRubric(used internally byHybridMathRubric); now usesextract_boxed_answerin strict mode — if no\boxed{}answer is found the parsed answer is""which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response
v0.1.0
- Initial release