rlm-science
RLM agent solving science problems inside Prime Sandboxes via ComposableEnv.
Overview
- Environment ID:
rlm_science - Agent: RLM — minimalistic CLI agent with builtin
ipython(Python + shell).numpy,scipy, andsympyare installed into rlm's tool venv at agent install time viaRLM_EXTRA_UV_ARGS, so they're importable from inside ipython. - Dataset: PrimeIntellect/INTELLECT-3-RL (subset
science, splittrain) by default. Any HF dataset with question/answer columns works. - Scoring:
RemoteHybridMathRubric— reads the agent's answer from/app/answer.txt, runsmath_verifyagainst the gold answer (the env installsmath-verifyinto the sandbox's system Python during setup so the scorer can import it), and falls back to an LLM judge on miss.
Quickstart
# From research-environments root
uv pip install -e ./environments/rlm_science
# Single debug rollout (GH_TOKEN required when the host needs to fill the rlm checkout cache)
GH_TOKEN=... uv run vf-eval rlm-science -d -v -n1 -r1
# Multiple rollouts, save results
GH_TOKEN=... uv run vf-eval rlm-science -n5 -r3 -s
Environment Arguments
| Argument | Default | Description |
|---|---|---|
dataset_name | "PrimeIntellect/INTELLECT-3-RL" | HF dataset name |
dataset_subset | "science" | HF subset |
dataset_split | "train" | HF split |
question_key | "question" | Column for questions |
answer_key | "answer" | Column for expected answers |
instruction_prompt | "Solve the following problem.\n\n" | Prefix prepended to each question |
answer_path | "/app/answer.txt" | Path the agent writes its final answer to |
difficulty_key | "avg@16_qwen3_4b_instruct_2507" | Column for difficulty filtering |
min_avg_reward | 0.0 | Min reward for difficulty filter |
max_avg_reward | 1.0 | Max reward for difficulty filter |
judge_model | "openai/gpt-5-nano" | Judge model for fallback |
judge_base_url | "https://api.pinference.ai/api/v1" | Judge API base URL |
judge_api_key_var | "PRIME_API_KEY" | Env var for judge API key |
use_judge_fallback | True | Use LLM judge if math_verify fails |
judge_prompt | None | Override judge prompt |
judge_timeout | 1200.0 | Judge HTTP timeout (s) |
sandbox_docker_image | "python:3.11-slim" | Sandbox image; computation libs are installed live into rlm's tool venv |
extra_pip_packages | ["numpy", "scipy", "sympy"] | Extra packages installed into rlm's tool venv via RLM_EXTRA_UV_ARGS. Pass [] to install nothing extra. Avoid shell metacharacters (e.g. >=) — install.sh word-splits these tokens. |
gh_token | $GH_TOKEN | GitHub token used on the host to fill the rlm checkout cache when needed |
max_turns | 200 | Interception server turns |
timeout_seconds | 3600.0 | Rollout timeout (1h); also drives sandbox lifetime |
poll_interval | 1.0 | Intercept-queue poll cadence (s) |
sandbox_cpu_cores | 1 | CPU cores per sandbox (matches opencode_science) |
sandbox_memory_gb | 2 | Memory (GB) per sandbox |
sandbox_disk_size_gb | 4 | Disk (GB) per sandbox |
sandbox_client_max_workers | None | Max workers in shared sandbox client |
labels | ["rlm-science"] | Sandbox labels |
**rlm_kwargs | — | Forwarded as-is to rlm_harness. Includes rlm_max_turns, rlm_exec_timeout, summarize_at_tokens, rlm_ref, rlm_repo_url, local_checkout, rlm_tools, append_to_system_prompt. |
Computation libraries inside the agent
rlm's ipython kernel runs in rlm's own tool venv (created by uv tool install in rlm's install.sh). To make numpy/scipy/sympy importable from inside the kernel, this env passes RLM_EXTRA_UV_ARGS="--with numpy --with scipy --with sympy" via ComposableEnv(install_env=...). The install script reads it and uv tool installs the extras alongside rlm itself. Override via extra_pip_packages.
How scoring works
MathTaskSet.get_instruction appends "Write your final answer to \/app/answer.txt`."to each prompt. After the rollout,RemoteHybridMathRubricreads that file from the sandbox and runsmath_verify` against the gold answer. On miss it falls back to an LLM judge. Reward is 1.0 on match, else 0.0.
Changelog
v0.1.1
- Default
sandbox_client_max_workerstoNoneso the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.
v0.1.0
- Initial release.