0

RLM SWE RL Env (Community)

Fresh

RLM agent on SWE tasks (R2E-Gym, SWE-bench).

Type
RL Env
License
apache-2.0
Published
Apr 2026

Cite

Notes

Only stored in your browser.

rlm-swe

RLM agent solving SWE tasks inside Prime Sandboxes via ComposableEnv.

Overview

  • Environment ID: rlm_swe
  • Agent: RLM — minimalistic CLI agent with builtin ipython, plus the locally shipped edit skill. Context auto-compacts at the threshold set by summarize_at_tokens.
  • TaskSet: R2E-Gym (default), SWE-bench, Multi-SWE, OpenSWE via task_type arg
  • Scoring: Test-based evaluation via the SWE taskset's rubric

Quickstart

# From research-environments root
uv pip install -e ./environments/rlm_swe

# Single debug rollout (requires GH_TOKEN when the host must fill the local RLM cache)
GH_TOKEN=... uv run vf-eval rlm-swe -a '{"task_type":"r2e"}' -d -v -n1 -r1

Environment Arguments

ArgumentDefaultDescription
task_type"r2e"SWE backend: r2e, swebench, multiswe, openswe
dataset_name(taskset default)Override dataset name
filter_reposNoneFilter to specific repos
filter_fnNoneCustom filter function forwarded to the upstream SWE taskset dataset loader
ds_keep_in_memoryNoneForwarded to the upstream SWE taskset dataset loader
ds_num_procNoneForwarded to the upstream SWE taskset dataset loader
gh_token$GH_TOKENGitHub token for private rlm repo, used only on the host to fill the local cache when needed
**kwargsForwarded as-is to rlm_harness. Includes rlm_max_turns, rlm_exec_timeout, summarize_at_tokens, rlm_ref, rlm_repo_url, local_checkout, rlm_tools, append_to_system_prompt, allow_git. See the harness docstring for defaults and meanings. Note: rlm_local_checkout was renamed to local_checkout to match the harness kwarg
max_turns200Max interception server turns
timeout_seconds5400Sandbox timeout (90min)
poll_interval1.0Seconds between CliAgentEnv intercept-queue polls / liveness checks
sandbox_cpu_cores4CPU cores per sandbox
sandbox_memory_gb4Memory per sandbox
sandbox_disk_size_gb2Disk per sandbox
sandbox_client_max_workersNoneMax worker threads in the shared sandbox client
labels["rlm-swe"]Sandbox labels attached to created rollouts
behavior_judge_modelnullEnables behavior-only reward shaping when set. The judge runs on every rollout; behavior reward only contributes to final_reward when task_reward == 1.0.
behavior_judge_base_urlhttps://api.pinference.ai/api/v1Behavior judge API base URL.
behavior_judge_api_key_varPRIME_API_KEYEnv var that holds the behavior judge API key.
behavior_judge_sampling_argsnullExtra sampling args forwarded to the behavior judge request. Defaults to response_format={"type":"json_object"} and max_tokens=4096 via setdefault; user-supplied values win.
behavior_reward_alpha1.0Weight on behavior reward; final_reward = task_reward + alpha * behavior_reward on solved rollouts, final_reward = task_reward otherwise.
behavior_judge_max_retries3Max judge calls per rollout. Retries on empty / non-JSON / truncated replies; on exhaustion, behavior reward zeros (task reward is unaffected).

Behavior reward shaping

Set behavior_judge_model to opt in to behavior-only reward shaping on top of the SWE taskset's task reward. When enabled:

  • Every rollout is judged by behavior_judge_model against eleven SWE-tailored behaviors (eight harness behaviors from general-agent plus python_first_tool_use, venv_discovery — project-toolchain discovery, language-agnostic — and submission_reflection). verification_and_audit is extended with SWE-specific cues (minimal repro, targeted + broader test runs, explicit output inspection).
  • task_reward = base_rubric_reward (e.g. solved from SWEBenchRubric).
  • behavior_reward = mean(judge_score over applicable behaviors) is logged un-gated so unsolved attempts still surface judge feedback.
  • final_reward = task_reward + behavior_reward_alpha * behavior_reward when task_reward == 1.0; otherwise final_reward = task_reward.
  • Each behavior result (applicable, score, evidence) plus a top-level summary is persisted to rollout state.
  • append_to_system_prompt defaults to the bundled prompts/behavior.md guidance when the judge is enabled; pass a literal string or a path to override.

Changelog

v0.4.2

  • Render the judge user prompt as a plaintext [role]\n<content> conversation built from state["prompt"] + state["completion"] instead of dumping the raw trajectory JSON. Tool calls render as [tool_call: <name>]\n<arguments>. Reasoning fields (reasoning_content, thinking_blocks) are omitted by construction — behavior is judged on the agent's observable actions, not its private chain-of-thought, and this also keeps the 60k-char budget from being eaten by verbose reasoning traces on reasoning-capable models.

v0.4.1

  • Persist the behavior judge prompt to rollout state under behavior_judge_prompt ({"system", "user"}). Useful for inspecting exactly what the judge sees — e.g. confirming whether agent reasoning_content makes it into the judged trajectory. Save it with vf-eval -C behavior_judge_prompt.

v0.4.0

  • Add behavior-only reward shaping for solved rollouts. Set behavior_judge_model to enable; the judge scores eleven SWE-tailored behaviors (eight harness behaviors plus python_first_tool_use, venv_discovery, and submission_reflection). The existing verification_and_audit behavior is extended with SWE-specific cues (minimal repro, targeted + broader test runs).
  • Ship prompts/behavior.md as the default append_to_system_prompt when the judge is enabled.
  • Always ship prompts/venv_hint.md as the default append_to_system_prompt when the judge is not enabled, restoring the venv guidance that rlm-harness PR #78 removed from the harness default system prompt.
  • Resolve append_to_system_prompt as a file path when a non-multiline string points to an existing file; otherwise forward verbatim.
  • New args: behavior_judge_model, behavior_judge_base_url, behavior_judge_api_key_var, behavior_judge_sampling_args, behavior_reward_alpha. All other defaults unchanged.

v0.3.4

  • Default sandbox_client_max_workers to None so the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.

v0.3.3

  • Add filter_fn parameter to load_environment(), forwarded to the upstream SWE taskset so callers can supply a custom dataset filter function.

v0.3.2

  • Declare multi-swe-bench>=1.1.2 as a direct dep. MultiSWERubric calls multi_swe_bench.harness.report.generate_report to score task_type="multiswe" rollouts; without it the rubric raises ModuleNotFoundError and silently zeros every reward (verified during a gpt-5.4 vf-eval run).

v0.3.1

  • Declare swebench==4.1.0 as a direct dep — needed when task_type="swebench" routes through verifiers' composable swe_bench taskset (which imports swebench at module top level without declaring it).

v0.3.0

  • Stop enumerating RLM kwargs on load_environment; everything except gh_token now flows through **kwargs directly to rlm_harness. Removes per-env drift whenever the harness kwarg surface changes. Rename: rlm_local_checkoutlocal_checkout (match harness kwarg name). No runtime default changes; new defaults come from the harness.
  • Drop RLM_MAX_TURNS, RLM_MAX_TURNS_IN_CONTEXT, RLM_EXEC_TIMEOUT from the env's environment_vars dict — the harness now owns these via Harness.environment_vars and merges them into the sandbox.
  • Require verifiers>=0.1.13.dev5.

v0.2.9

  • Re-add rlm_tools argument (previously removed in v0.2.7 as a no-op). It now fans out through rlm_harness to both Harness.tool_names (drives ToolMonitorRubric) and the sandbox's RLM_TOOLS env var. Defaults to ["ipython", "summarize"]; also available: bash, edit.

v0.2.8

  • Replace rlm_branch with rlm_ref (branch, tag, or full commit SHA) and make the default host cache commit-keyed.
  • Clarify that rlm_ref still uses the auto-materialized host cache, while rlm_local_checkout is now an existing-checkout override that bypasses the cache.

v0.2.7

  • Remove the unused rlm_tools argument and stop exporting the dead RLM_TOOLS / RLM_SYSTEM_PROMPT_VERBOSITY environment variables.
  • Require verifiers>=0.1.13.dev3.
  • Refresh the README argument table to match the current load_environment() signature.

v0.2.6

  • Add rlm_local_checkout as the host-side RLM checkout path override.
  • Cache the RLM checkout on the host and upload it into each sandbox, reducing direct clone pressure on the private repo during large runs.

v0.2.5

  • Bump verifiers to >=0.1.13.dev1.

v0.2.4

  • Add rlm_exec_timeout parameter (default 300s); forwarded as RLM_EXEC_TIMEOUT to the sandbox, capping per-tool execution time inside the RLM agent.
  • Unify timeout knob: timeout_seconds now drives both the rollout deadline and the sandbox container lifetime (sandbox_timeout_minutes is derived via math.ceil), preventing sandbox teardown before the agent reaches its deadline.
  • Expose poll_interval kwarg; forwarded to ComposableEnv / CliAgentEnv to tune the intercept-queue poll cadence.

v0.2.3

  • Ship the edit skill with this environment (under rlm_swe/skills/edit/), so the rlm harness no longer needs to bundle it; auto-uploaded to the sandbox via ComposableEnv's skills-upload mechanism

v0.2.2

  • Simplify to use ComposableEnv directly; metrics and GH_TOKEN handling are now driven by upstream harness configuration
  • Surface all rlm_-prefixed session metrics instead of a fixed whitelist

v0.2.1

  • Add rlm_repo_url and rlm_branch so rlm-swe can install and run RLM from a selected GitHub repo and branch

v0.1.3

  • Add rlm_max_turns_in_context to cap retained assistant turns in live context
  • Add append_to_system_prompt to append environment-specific instructions to the default RLM system prompt

v0.1.2

  • Extract rlm session metrics from meta.json after each rollout and surface as top-level state keys (rlm_turns, rlm_stop_reason, rlm_prompt_tokens, rlm_completion_tokens, rlm_prompt_tokens_per_turn, rlm_completion_tokens_per_turn, etc.)

v0.1.1

  • Scope gh_token / GH_TOKEN to the RLM install step only, without exporting it as a sandbox runtime environment variable

v0.1.0

  • Initial release