rlm-swe

RLM agent solving SWE tasks inside Prime Sandboxes via ComposableEnv.

Overview

Environment ID: rlm_swe
Agent: RLM — minimalistic CLI agent with builtin ipython, plus the locally shipped edit skill. Context auto-compacts at the threshold set by summarize_at_tokens.
TaskSet: R2E-Gym (default), SWE-bench, Multi-SWE, OpenSWE via task_type arg
Scoring: Test-based evaluation via the SWE taskset's rubric

Quickstart

# From research-environments root
uv pip install -e ./environments/rlm_swe

# Single debug rollout (requires GH_TOKEN when the host must fill the local RLM cache)
GH_TOKEN=... uv run vf-eval rlm-swe -a '{"task_type":"r2e"}' -d -v -n1 -r1

Environment Arguments

Argument	Default	Description
`task_type`	`"r2e"`	SWE backend: `r2e`, `swebench`, `multiswe`, `openswe`
`dataset_name`	(taskset default)	Override dataset name
`filter_repos`	None	Filter to specific repos
`filter_fn`	None	Custom filter function forwarded to the upstream SWE taskset dataset loader
`ds_keep_in_memory`	None	Forwarded to the upstream SWE taskset dataset loader
`ds_num_proc`	None	Forwarded to the upstream SWE taskset dataset loader
`gh_token`	`$GH_TOKEN`	GitHub token for private rlm repo, used only on the host to fill the local cache when needed
`**kwargs`	—	Forwarded as-is to `rlm_harness`. Includes `rlm_max_turns`, `rlm_exec_timeout`, `summarize_at_tokens`, `rlm_ref`, `rlm_repo_url`, `local_checkout`, `rlm_tools`, `append_to_system_prompt`, `allow_git`. See the harness docstring for defaults and meanings. Note: `rlm_local_checkout` was renamed to `local_checkout` to match the harness kwarg
`max_turns`	200	Max interception server turns
`timeout_seconds`	5400	Sandbox timeout (90min)
`poll_interval`	1.0	Seconds between `CliAgentEnv` intercept-queue polls / liveness checks
`sandbox_cpu_cores`	4	CPU cores per sandbox
`sandbox_memory_gb`	4	Memory per sandbox
`sandbox_disk_size_gb`	2	Disk per sandbox
`sandbox_client_max_workers`	None	Max worker threads in the shared sandbox client
`labels`	`["rlm-swe"]`	Sandbox labels attached to created rollouts
`behavior_judge_model`	`null`	Enables behavior-only reward shaping when set. The judge runs on every rollout; behavior reward only contributes to `final_reward` when `task_reward == 1.0`.
`behavior_judge_base_url`	`https://api.pinference.ai/api/v1`	Behavior judge API base URL.
`behavior_judge_api_key_var`	`PRIME_API_KEY`	Env var that holds the behavior judge API key.
`behavior_judge_sampling_args`	`null`	Extra sampling args forwarded to the behavior judge request. Defaults to `response_format={"type":"json_object"}` and `max_tokens=4096` via `setdefault`; user-supplied values win.
`behavior_reward_alpha`	`1.0`	Weight on behavior reward; `final_reward = task_reward + alpha * behavior_reward` on solved rollouts, `final_reward = task_reward` otherwise.
`behavior_judge_max_retries`	`3`	Max judge calls per rollout. Retries on empty / non-JSON / truncated replies; on exhaustion, behavior reward zeros (task reward is unaffected).

Behavior reward shaping

Set behavior_judge_model to opt in to behavior-only reward shaping on top of the SWE taskset's task reward. When enabled:

Every rollout is judged by behavior_judge_model against eleven SWE-tailored behaviors (eight harness behaviors from general-agent plus python_first_tool_use, venv_discovery — project-toolchain discovery, language-agnostic — and submission_reflection). verification_and_audit is extended with SWE-specific cues (minimal repro, targeted + broader test runs, explicit output inspection).
task_reward = base_rubric_reward (e.g. solved from SWEBenchRubric).
behavior_reward = mean(judge_score over applicable behaviors) is logged un-gated so unsolved attempts still surface judge feedback.
final_reward = task_reward + behavior_reward_alpha * behavior_reward when task_reward == 1.0; otherwise final_reward = task_reward.
Each behavior result (applicable, score, evidence) plus a top-level summary is persisted to rollout state.
append_to_system_prompt defaults to the bundled prompts/behavior.md guidance when the judge is enabled; pass a literal string or a path to override.

Changelog

v0.4.2

Render the judge user prompt as a plaintext [role]\n<content> conversation built from state["prompt"] + state["completion"] instead of dumping the raw trajectory JSON. Tool calls render as [tool_call: <name>]\n<arguments>. Reasoning fields (reasoning_content, thinking_blocks) are omitted by construction — behavior is judged on the agent's observable actions, not its private chain-of-thought, and this also keeps the 60k-char budget from being eaten by verbose reasoning traces on reasoning-capable models.

v0.4.1

Persist the behavior judge prompt to rollout state under behavior_judge_prompt ({"system", "user"}). Useful for inspecting exactly what the judge sees — e.g. confirming whether agent reasoning_content makes it into the judged trajectory. Save it with vf-eval -C behavior_judge_prompt.

v0.4.0

Add behavior-only reward shaping for solved rollouts. Set behavior_judge_model to enable; the judge scores eleven SWE-tailored behaviors (eight harness behaviors plus python_first_tool_use, venv_discovery, and submission_reflection). The existing verification_and_audit behavior is extended with SWE-specific cues (minimal repro, targeted + broader test runs).
Ship prompts/behavior.md as the default append_to_system_prompt when the judge is enabled.
Always ship prompts/venv_hint.md as the default append_to_system_prompt when the judge is not enabled, restoring the venv guidance that rlm-harness PR #78 removed from the harness default system prompt.
Resolve append_to_system_prompt as a file path when a non-multiline string points to an existing file; otherwise forward verbatim.
New args: behavior_judge_model, behavior_judge_base_url, behavior_judge_api_key_var, behavior_judge_sampling_args, behavior_reward_alpha. All other defaults unchanged.

v0.3.4

Default sandbox_client_max_workers to None so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.

v0.3.3

Add filter_fn parameter to load_environment(), forwarded to the upstream SWE taskset so callers can supply a custom dataset filter function.

v0.3.2

Declare multi-swe-bench>=1.1.2 as a direct dep. MultiSWERubric calls multi_swe_bench.harness.report.generate_report to score task_type="multiswe" rollouts; without it the rubric raises ModuleNotFoundError and silently zeros every reward (verified during a gpt-5.4 vf-eval run).

v0.3.1

Declare swebench==4.1.0 as a direct dep — needed when task_type="swebench" routes through verifiers' composable swe_bench taskset (which imports swebench at module top level without declaring it).

v0.3.0

Stop enumerating RLM kwargs on load_environment; everything except gh_token now flows through **kwargs directly to rlm_harness. Removes per-env drift whenever the harness kwarg surface changes. Rename: rlm_local_checkout → local_checkout (match harness kwarg name). No runtime default changes; new defaults come from the harness.
Drop RLM_MAX_TURNS, RLM_MAX_TURNS_IN_CONTEXT, RLM_EXEC_TIMEOUT from the env's environment_vars dict — the harness now owns these via Harness.environment_vars and merges them into the sandbox.
Require verifiers>=0.1.13.dev5.

v0.2.9

Re-add rlm_tools argument (previously removed in v0.2.7 as a no-op). It now fans out through rlm_harness to both Harness.tool_names (drives ToolMonitorRubric) and the sandbox's RLM_TOOLS env var. Defaults to ["ipython", "summarize"]; also available: bash, edit.

v0.2.8

Replace rlm_branch with rlm_ref (branch, tag, or full commit SHA) and make the default host cache commit-keyed.
Clarify that rlm_ref still uses the auto-materialized host cache, while rlm_local_checkout is now an existing-checkout override that bypasses the cache.

v0.2.7

Remove the unused rlm_tools argument and stop exporting the dead RLM_TOOLS / RLM_SYSTEM_PROMPT_VERBOSITY environment variables.
Require verifiers>=0.1.13.dev3.
Refresh the README argument table to match the current load_environment() signature.

v0.2.6

Add rlm_local_checkout as the host-side RLM checkout path override.
Cache the RLM checkout on the host and upload it into each sandbox, reducing direct clone pressure on the private repo during large runs.

v0.2.5

Bump verifiers to >=0.1.13.dev1.

v0.2.4

Add rlm_exec_timeout parameter (default 300s); forwarded as RLM_EXEC_TIMEOUT to the sandbox, capping per-tool execution time inside the RLM agent.
Unify timeout knob: timeout_seconds now drives both the rollout deadline and the sandbox container lifetime (sandbox_timeout_minutes is derived via math.ceil), preventing sandbox teardown before the agent reaches its deadline.
Expose poll_interval kwarg; forwarded to ComposableEnv / CliAgentEnv to tune the intercept-queue poll cadence.

v0.2.3

Ship the edit skill with this environment (under rlm_swe/skills/edit/), so the rlm harness no longer needs to bundle it; auto-uploaded to the sandbox via ComposableEnv's skills-upload mechanism

v0.2.2

Simplify to use ComposableEnv directly; metrics and GH_TOKEN handling are now driven by upstream harness configuration
Surface all rlm_-prefixed session metrics instead of a fixed whitelist

v0.2.1

Add rlm_repo_url and rlm_branch so rlm-swe can install and run RLM from a selected GitHub repo and branch

v0.1.3

Add rlm_max_turns_in_context to cap retained assistant turns in live context
Add append_to_system_prompt to append environment-specific instructions to the default RLM system prompt

v0.1.2

Extract rlm session metrics from meta.json after each rollout and surface as top-level state keys (rlm_turns, rlm_stop_reason, rlm_prompt_tokens, rlm_completion_tokens, rlm_prompt_tokens_per_turn, rlm_completion_tokens_per_turn, etc.)

v0.1.1

Scope gh_token / GH_TOKEN to the RLM install step only, without exporting it as a sandbox runtime environment variable

v0.1.0

Initial release