rlm-science

RLM agent solving science problems inside Prime Sandboxes via ComposableEnv.

Overview

Environment ID: rlm_science
Agent: RLM — minimalistic CLI agent with builtin ipython (Python + shell). numpy, scipy, and sympy are installed into rlm's tool venv at agent install time via RLM_EXTRA_UV_ARGS, so they're importable from inside ipython.
Dataset: PrimeIntellect/INTELLECT-3-RL (subset science, split train) by default. Any HF dataset with question/answer columns works.
Scoring: RemoteHybridMathRubric — reads the agent's answer from /app/answer.txt, runs math_verify against the gold answer (the env installs math-verify into the sandbox's system Python during setup so the scorer can import it), and falls back to an LLM judge on miss.

Quickstart

# From research-environments root
uv pip install -e ./environments/rlm_science

# Single debug rollout (GH_TOKEN required when the host needs to fill the rlm checkout cache)
GH_TOKEN=... uv run vf-eval rlm-science -d -v -n1 -r1

# Multiple rollouts, save results
GH_TOKEN=... uv run vf-eval rlm-science -n5 -r3 -s

Environment Arguments

Argument	Default	Description
`dataset_name`	`"PrimeIntellect/INTELLECT-3-RL"`	HF dataset name
`dataset_subset`	`"science"`	HF subset
`dataset_split`	`"train"`	HF split
`question_key`	`"question"`	Column for questions
`answer_key`	`"answer"`	Column for expected answers
`instruction_prompt`	`"Solve the following problem.\n\n"`	Prefix prepended to each question
`answer_path`	`"/app/answer.txt"`	Path the agent writes its final answer to
`difficulty_key`	`"avg@16_qwen3_4b_instruct_2507"`	Column for difficulty filtering
`min_avg_reward`	`0.0`	Min reward for difficulty filter
`max_avg_reward`	`1.0`	Max reward for difficulty filter
`judge_model`	`"openai/gpt-5-nano"`	Judge model for fallback
`judge_base_url`	`"https://api.pinference.ai/api/v1"`	Judge API base URL
`judge_api_key_var`	`"PRIME_API_KEY"`	Env var for judge API key
`use_judge_fallback`	`True`	Use LLM judge if `math_verify` fails
`judge_prompt`	`None`	Override judge prompt
`judge_timeout`	`1200.0`	Judge HTTP timeout (s)
`sandbox_docker_image`	`"python:3.11-slim"`	Sandbox image; computation libs are installed live into rlm's tool venv
`extra_pip_packages`	`["numpy", "scipy", "sympy"]`	Extra packages installed into rlm's tool venv via `RLM_EXTRA_UV_ARGS`. Pass `[]` to install nothing extra. Avoid shell metacharacters (e.g. `>=`) — install.sh word-splits these tokens.
`gh_token`	`$GH_TOKEN`	GitHub token used on the host to fill the rlm checkout cache when needed
`max_turns`	`200`	Interception server turns
`timeout_seconds`	`3600.0`	Rollout timeout (1h); also drives sandbox lifetime
`poll_interval`	`1.0`	Intercept-queue poll cadence (s)
`sandbox_cpu_cores`	`1`	CPU cores per sandbox (matches `opencode_science`)
`sandbox_memory_gb`	`2`	Memory (GB) per sandbox
`sandbox_disk_size_gb`	`4`	Disk (GB) per sandbox
`sandbox_client_max_workers`	`None`	Max workers in shared sandbox client
`labels`	`["rlm-science"]`	Sandbox labels
`**rlm_kwargs`	—	Forwarded as-is to `rlm_harness`. Includes `rlm_max_turns`, `rlm_exec_timeout`, `summarize_at_tokens`, `rlm_ref`, `rlm_repo_url`, `local_checkout`, `rlm_tools`, `append_to_system_prompt`.

Computation libraries inside the agent

rlm's ipython kernel runs in rlm's own tool venv (created by uv tool install in rlm's install.sh). To make numpy/scipy/sympy importable from inside the kernel, this env passes RLM_EXTRA_UV_ARGS="--with numpy --with scipy --with sympy" via ComposableEnv(install_env=...). The install script reads it and uv tool installs the extras alongside rlm itself. Override via extra_pip_packages.

How scoring works

MathTaskSet.get_instruction appends "Write your final answer to \/app/answer.txt`."to each prompt. After the rollout,RemoteHybridMathRubricreads that file from the sandbox and runsmath_verify` against the gold answer. On miss it falls back to an LLM judge. Reward is 1.0 on match, else 0.0.

Changelog

v0.1.1

Default sandbox_client_max_workers to None so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.

v0.1.0

Initial release.