opencode-science

Overview

Environment ID: opencode_science
Short description: Solve science problems using an OpenCode agent inside a sandbox, verified with math_verify.
Tags: science, opencode, multi-turn

Datasets

Primary dataset: PrimeIntellect/INTELLECT-3-RL (subset science, split train).
Any HuggingFace dataset with question/answer columns can be used.

Task

Type: multi-turn (OpenCode CLI agent in a sandbox)
Output format expectations: Agent output should contain a \boxed{} answer.
Rubric: HybridMathRubric — extracts \boxed{} from the agent's terminal output and verifies against the expected answer using math_verify. Produces a binary correct_answer score (1.0 or 0.0).

Architecture

OpenCodeScienceEnv inherits from base classes in the verifiers package:

OpenCodeScienceEnv  (environments/opencode_science/opencode_science.py)
  └── OpenCodeQAEnv  (verifiers/envs/experimental/opencode_qa_env.py)
       └── OpenCodeEnv  (verifiers/envs/experimental/opencode_env.py)
            └── vf.CliAgentEnv  (verifiers/envs/experimental/cli_agent_env.py)

OpenCodeEnv — installs and configures the OpenCode CLI agent in a sandbox, handles prompt/config upload.
OpenCodeQAEnv — loads a HuggingFace QA dataset and formats it for the agent.
OpenCodeScienceEnv — sets science-specific defaults (dataset, rubric, instruction prompt).

Quickstart

# install (local development)
uv pip install -e ./environments/opencode_science

# single debug rollout
uv run vf-eval --env opencode_science -d -v -n1 -r1

# multiple rollouts, save results
uv run vf-eval --env opencode_science -n5 -r3 -s

Environment Arguments

These are the arguments accepted by load_environment():

Arg	Type	Default	Description
`dataset_name`	str	`"PrimeIntellect/INTELLECT-3-RL"`	HuggingFace dataset name
`dataset_subset`	str	`"science"`	Dataset subset/config
`dataset_split`	str	`"train"`	Dataset split
`question_key`	str	`"question"`	Column name for questions
`answer_key`	str	`"answer"`	Column name for expected answers
`instruction_prompt`	str	`"Solve the following problem.\n\n"`	Prefix prepended to each question
`instruction_prompt_post`	str	`""`	Suffix appended to each question
`difficulty_key`	str \| None	`"avg@16_qwen3_4b_instruct_2507"`	Column for difficulty filtering
`min_avg_reward`	float	`0.0`	Minimum reward for dataset filtering
`max_avg_reward`	float	`1.0`	Maximum reward for dataset filtering
`system_prompt`	str \| None	(OpenCode default)	System prompt for the agent
`disabled_tools`	list[str] \| None	`["question", "task", "websearch"]`	OpenCode tools to disable
`agent_workdir`	str	`"/app"`	Working directory inside the sandbox
`answer_path`	str	`"/app/answer.txt"`	Path where the agent writes its final answer
`score_remotely`	bool	`True`	Whether to read the answer from `answer_path` in the sandbox
`use_judge_fallback`	bool	`True`	Fall back to LLM judge if math_verify fails
`judge_model`	str	`"openai/gpt-5-nano"`	Model for the judge fallback
`judge_base_url`	str \| None	`"https://api.pinference.ai/api/v1"`	Base URL for the judge API
`judge_api_key_var`	str \| None	`"PRIME_API_KEY"`	Environment variable for the judge API key
`sandbox_docker_image`	str	`"...opencode-science:rl2"`	Docker image for the sandbox (opencode binary baked in)
`timeout_seconds`	float	`3600.0`	Rollout timeout (1h)
`sandbox_cpu_cores`	int	`1`	CPU cores for the sandbox
`sandbox_memory_gb`	int	`2`	Memory (GB) for the sandbox
`sandbox_disk_size_gb`	int	`4`	Disk size (GB) for the sandbox
`sandbox_client_max_workers`	int \| None	`None`	Max concurrent sandbox workers
`max_turns`	int	`100`	Max conversation turns

Metrics

Metric	Meaning
`reward`	Main scalar reward: 1.0 if `math_verify` confirms correctness, else 0.0
`correct_answer`	Binary `math_verify` result (same as reward when no other reward functions are added)

How it works

On init, loads the HuggingFace dataset (science subset) and prepends the instruction prompt to each question.
Each rollout creates a sandbox, installs the OpenCode CLI, uploads the prompt and config, then runs the agent.
The agent's API calls are intercepted and routed to the configured LLM.
After the agent finishes, the rubric reads the answer from /app/answer.txt in the sandbox (when score_remotely=True) or extracts the \boxed{} answer from the conversation, and verifies it against the expected answer using math_verify. If verification fails and use_judge_fallback=True, an LLM judge provides a fallback score.

Changelog

v0.3.11

Bump verifiers to >=0.1.15.dev2 for the OpenCode harness config that disables title-generation calls while preserving the small_model pin.

v0.3.10

Default sandbox_client_max_workers to None so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.

v0.3.9

Harden sandbox image bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.

v0.3.8

Fix sandbox_docker_image prefix. The cme8364tg000o1139v84cu0cv/... prefix carried over from v0.3.7 is a user-scoped ID that the cluster cannot pull from, causing ImagePullBackOff on every sandbox creation. Swap to the team-scoped team-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2.

v0.3.7

Pin sandbox_docker_image default to team-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2. The new image bakes the opencode v1.1.63-rl2 binary into the sandbox so cold sandboxes no longer need to install it at rollout time. Documentation and image table updated to match.

v0.3.5

Bump opencode fork release from 1.1.63-rl1 to 1.1.63-rl2 (PrimeIntellect-ai/opencode#3). Fork release surfaces session-level retry exhaustion as a non-zero exit with a structured stderr dump, so hosted RL rollouts that previously returned silent empty trajectories now produce real AgentError entries. Companion default bump in verifiers: PrimeIntellect-ai/verifiers#1184.

v0.3.4

Bump verifiers to stable >=0.1.12.

v0.3.3

Bump verifiers to >=0.1.13.dev1.

v0.3.2

Bump verifiers to stable >=0.1.12.

v0.3.1

Fix package structure: convert flat module to proper package directory so hatchling includes it in the wheel. Fixes ModuleNotFoundError in hosted training.
Import harness and taskset from verifiers.envs.experimental.composable instead of separate packages.

v0.3.0

Import harness and taskset from verifiers package proper (verifiers >= 0.1.12.dev5).

v0.2.2

Migrate OpenCode fork from rasdani/opencode to PrimeIntellect-ai/opencode. Bump release from 1.1.63-swe8 to 1.1.63-rl1 (trimmed system prompt for RL training efficiency).

v0.2.1

Bump verifiers to >=0.1.12.dev3: fixes opencode model ID for LoRA adapter names without / in hosted training.
Use personal sandbox image for public reproducibility.

v0.2.0

Rewrite to composable architecture. Uses ComposableEnv + MathTaskSet(subset="science") + opencode_harness. Scored by RemoteHybridMathRubric with judge fallback. Replaces OpenCodeScienceEnv class hierarchy.
Verify OpenCode tarball integrity with pinned SHA-256 checksum (via opencode_harness).

v0.1.1

Bump verifiers to v0.1.12.dev1: perf improvements to MathRubric (used internally by HybridMathRubric); now uses extract_boxed_answer in strict mode — if no \boxed{} answer is found the parsed answer is "" which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response

v0.1.0

Initial release