0

Opencode Science RL Env (Prime Intellect)

Fresh

Solve science problems using OpenCode agent via ComposableEnv.

Type
RL Env
Runtime
multi-turn
License
unknown
Size
v0.3.12
Published
Mar 2026

Cite

Notes

Only stored in your browser.

opencode-science

Overview

  • Environment ID: opencode_science
  • Short description: Solve science problems using an OpenCode agent inside a sandbox, verified with math_verify.
  • Tags: science, opencode, multi-turn

Datasets

  • Primary dataset: PrimeIntellect/INTELLECT-3-RL (subset science, split train).
  • Any HuggingFace dataset with question/answer columns can be used.

Task

  • Type: multi-turn (OpenCode CLI agent in a sandbox)
  • Output format expectations: Agent output should contain a \boxed{} answer.
  • Rubric: HybridMathRubric — extracts \boxed{} from the agent's terminal output and verifies against the expected answer using math_verify. Produces a binary correct_answer score (1.0 or 0.0).

Architecture

OpenCodeScienceEnv inherits from base classes in the verifiers package:

OpenCodeScienceEnv  (environments/opencode_science/opencode_science.py)
  └── OpenCodeQAEnv  (verifiers/envs/experimental/opencode_qa_env.py)
       └── OpenCodeEnv  (verifiers/envs/experimental/opencode_env.py)
            └── vf.CliAgentEnv  (verifiers/envs/experimental/cli_agent_env.py)
  • OpenCodeEnv — installs and configures the OpenCode CLI agent in a sandbox, handles prompt/config upload.
  • OpenCodeQAEnv — loads a HuggingFace QA dataset and formats it for the agent.
  • OpenCodeScienceEnv — sets science-specific defaults (dataset, rubric, instruction prompt).

Quickstart

# install (local development)
uv pip install -e ./environments/opencode_science

# single debug rollout
uv run vf-eval --env opencode_science -d -v -n1 -r1

# multiple rollouts, save results
uv run vf-eval --env opencode_science -n5 -r3 -s

Environment Arguments

These are the arguments accepted by load_environment():

ArgTypeDefaultDescription
dataset_namestr"PrimeIntellect/INTELLECT-3-RL"HuggingFace dataset name
dataset_subsetstr"science"Dataset subset/config
dataset_splitstr"train"Dataset split
question_keystr"question"Column name for questions
answer_keystr"answer"Column name for expected answers
instruction_promptstr"Solve the following problem.\n\n"Prefix prepended to each question
instruction_prompt_poststr""Suffix appended to each question
difficulty_keystr | None"avg@16_qwen3_4b_instruct_2507"Column for difficulty filtering
min_avg_rewardfloat0.0Minimum reward for dataset filtering
max_avg_rewardfloat1.0Maximum reward for dataset filtering
system_promptstr | None(OpenCode default)System prompt for the agent
disabled_toolslist[str] | None["question", "task", "websearch"]OpenCode tools to disable
agent_workdirstr"/app"Working directory inside the sandbox
answer_pathstr"/app/answer.txt"Path where the agent writes its final answer
score_remotelyboolTrueWhether to read the answer from answer_path in the sandbox
use_judge_fallbackboolTrueFall back to LLM judge if math_verify fails
judge_modelstr"openai/gpt-5-nano"Model for the judge fallback
judge_base_urlstr | None"https://api.pinference.ai/api/v1"Base URL for the judge API
judge_api_key_varstr | None"PRIME_API_KEY"Environment variable for the judge API key
sandbox_docker_imagestr"...opencode-science:rl2"Docker image for the sandbox (opencode binary baked in)
timeout_secondsfloat3600.0Rollout timeout (1h)
sandbox_cpu_coresint1CPU cores for the sandbox
sandbox_memory_gbint2Memory (GB) for the sandbox
sandbox_disk_size_gbint4Disk size (GB) for the sandbox
sandbox_client_max_workersint | NoneNoneMax concurrent sandbox workers
max_turnsint100Max conversation turns

Metrics

MetricMeaning
rewardMain scalar reward: 1.0 if math_verify confirms correctness, else 0.0
correct_answerBinary math_verify result (same as reward when no other reward functions are added)

How it works

  1. On init, loads the HuggingFace dataset (science subset) and prepends the instruction prompt to each question.
  2. Each rollout creates a sandbox, installs the OpenCode CLI, uploads the prompt and config, then runs the agent.
  3. The agent's API calls are intercepted and routed to the configured LLM.
  4. After the agent finishes, the rubric reads the answer from /app/answer.txt in the sandbox (when score_remotely=True) or extracts the \boxed{} answer from the conversation, and verifies it against the expected answer using math_verify. If verification fails and use_judge_fallback=True, an LLM judge provides a fallback score.

Changelog

v0.3.11

  • Bump verifiers to >=0.1.15.dev2 for the OpenCode harness config that disables title-generation calls while preserving the small_model pin.

v0.3.10

  • Default sandbox_client_max_workers to None so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.

v0.3.9

  • Harden sandbox image bootstrap against transient Ubuntu archive mirror sync flakes by adding apt acquire retries.

v0.3.8

  • Fix sandbox_docker_image prefix. The cme8364tg000o1139v84cu0cv/... prefix carried over from v0.3.7 is a user-scoped ID that the cluster cannot pull from, causing ImagePullBackOff on every sandbox creation. Swap to the team-scoped team-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2.

v0.3.7

  • Pin sandbox_docker_image default to team-clyvldofb0000gg1kx39rgzjq/opencode-science:rl2. The new image bakes the opencode v1.1.63-rl2 binary into the sandbox so cold sandboxes no longer need to install it at rollout time. Documentation and image table updated to match.

v0.3.5

  • Bump opencode fork release from 1.1.63-rl1 to 1.1.63-rl2 (PrimeIntellect-ai/opencode#3). Fork release surfaces session-level retry exhaustion as a non-zero exit with a structured stderr dump, so hosted RL rollouts that previously returned silent empty trajectories now produce real AgentError entries. Companion default bump in verifiers: PrimeIntellect-ai/verifiers#1184.

v0.3.4

  • Bump verifiers to stable >=0.1.12.

v0.3.3

  • Bump verifiers to >=0.1.13.dev1.

v0.3.2

  • Bump verifiers to stable >=0.1.12.

v0.3.1

  • Fix package structure: convert flat module to proper package directory so hatchling includes it in the wheel. Fixes ModuleNotFoundError in hosted training.
  • Import harness and taskset from verifiers.envs.experimental.composable instead of separate packages.

v0.3.0

  • Import harness and taskset from verifiers package proper (verifiers >= 0.1.12.dev5).

v0.2.2

  • Migrate OpenCode fork from rasdani/opencode to PrimeIntellect-ai/opencode. Bump release from 1.1.63-swe8 to 1.1.63-rl1 (trimmed system prompt for RL training efficiency).

v0.2.1

  • Bump verifiers to >=0.1.12.dev3: fixes opencode model ID for LoRA adapter names without / in hosted training.
  • Use personal sandbox image for public reproducibility.

v0.2.0

  • Rewrite to composable architecture. Uses ComposableEnv + MathTaskSet(subset="science") + opencode_harness. Scored by RemoteHybridMathRubric with judge fallback. Replaces OpenCodeScienceEnv class hierarchy.
  • Verify OpenCode tarball integrity with pinned SHA-256 checksum (via opencode_harness).

v0.1.1

  • Bump verifiers to v0.1.12.dev1: perf improvements to MathRubric (used internally by HybridMathRubric); now uses extract_boxed_answer in strict mode — if no \boxed{} answer is found the parsed answer is "" which always scores 0, preventing false positives where the model is rewarded for containing the correct answer anywhere in the response

v0.1.0

  • Initial release