rlm-swe
RLM agent solving SWE tasks inside Prime Sandboxes via ComposableEnv.
Overview
- Environment ID:
rlm_swe - Agent: RLM — minimalistic CLI agent with builtin
ipython, plus the locally shippededitskill. Context auto-compacts at the threshold set bysummarize_at_tokens. - TaskSet: R2E-Gym (default), SWE-bench, Multi-SWE, OpenSWE via
task_typearg - Scoring: Test-based evaluation via the SWE taskset's rubric
Quickstart
# From research-environments root
uv pip install -e ./environments/rlm_swe
# Single debug rollout (requires GH_TOKEN when the host must fill the local RLM cache)
GH_TOKEN=... uv run vf-eval rlm-swe -a '{"task_type":"r2e"}' -d -v -n1 -r1
Environment Arguments
| Argument | Default | Description |
|---|---|---|
task_type | "r2e" | SWE backend: r2e, swebench, multiswe, openswe |
dataset_name | (taskset default) | Override dataset name |
filter_repos | None | Filter to specific repos |
filter_fn | None | Custom filter function forwarded to the upstream SWE taskset dataset loader |
ds_keep_in_memory | None | Forwarded to the upstream SWE taskset dataset loader |
ds_num_proc | None | Forwarded to the upstream SWE taskset dataset loader |
gh_token | $GH_TOKEN | GitHub token for private rlm repo, used only on the host to fill the local cache when needed |
**kwargs | — | Forwarded as-is to rlm_harness. Includes rlm_max_turns, rlm_exec_timeout, summarize_at_tokens, rlm_ref, rlm_repo_url, local_checkout, rlm_tools, append_to_system_prompt, allow_git. See the harness docstring for defaults and meanings. Note: rlm_local_checkout was renamed to local_checkout to match the harness kwarg |
max_turns | 200 | Max interception server turns |
timeout_seconds | 5400 | Sandbox timeout (90min) |
poll_interval | 1.0 | Seconds between CliAgentEnv intercept-queue polls / liveness checks |
sandbox_cpu_cores | 4 | CPU cores per sandbox |
sandbox_memory_gb | 4 | Memory per sandbox |
sandbox_disk_size_gb | 2 | Disk per sandbox |
sandbox_client_max_workers | None | Max worker threads in the shared sandbox client |
labels | ["rlm-swe"] | Sandbox labels attached to created rollouts |
behavior_judge_model | null | Enables behavior-only reward shaping when set. The judge runs on every rollout; behavior reward only contributes to final_reward when task_reward == 1.0. |
behavior_judge_base_url | https://api.pinference.ai/api/v1 | Behavior judge API base URL. |
behavior_judge_api_key_var | PRIME_API_KEY | Env var that holds the behavior judge API key. |
behavior_judge_sampling_args | null | Extra sampling args forwarded to the behavior judge request. Defaults to response_format={"type":"json_object"} and max_tokens=4096 via setdefault; user-supplied values win. |
behavior_reward_alpha | 1.0 | Weight on behavior reward; final_reward = task_reward + alpha * behavior_reward on solved rollouts, final_reward = task_reward otherwise. |
behavior_judge_max_retries | 3 | Max judge calls per rollout. Retries on empty / non-JSON / truncated replies; on exhaustion, behavior reward zeros (task reward is unaffected). |
Behavior reward shaping
Set behavior_judge_model to opt in to behavior-only reward shaping on top
of the SWE taskset's task reward. When enabled:
- Every rollout is judged by
behavior_judge_modelagainst eleven SWE-tailored behaviors (eight harness behaviors fromgeneral-agentpluspython_first_tool_use,venv_discovery— project-toolchain discovery, language-agnostic — andsubmission_reflection).verification_and_auditis extended with SWE-specific cues (minimal repro, targeted + broader test runs, explicit output inspection). task_reward = base_rubric_reward(e.g.solvedfromSWEBenchRubric).behavior_reward = mean(judge_score over applicable behaviors)is logged un-gated so unsolved attempts still surface judge feedback.final_reward = task_reward + behavior_reward_alpha * behavior_rewardwhentask_reward == 1.0; otherwisefinal_reward = task_reward.- Each behavior result (
applicable,score,evidence) plus a top-levelsummaryis persisted to rollout state. append_to_system_promptdefaults to the bundledprompts/behavior.mdguidance when the judge is enabled; pass a literal string or a path to override.
Changelog
v0.4.2
- Render the judge user prompt as a plaintext
[role]\n<content>conversation built fromstate["prompt"] + state["completion"]instead of dumping the raw trajectory JSON. Tool calls render as[tool_call: <name>]\n<arguments>. Reasoning fields (reasoning_content,thinking_blocks) are omitted by construction — behavior is judged on the agent's observable actions, not its private chain-of-thought, and this also keeps the 60k-char budget from being eaten by verbose reasoning traces on reasoning-capable models.
v0.4.1
- Persist the behavior judge prompt to rollout state under
behavior_judge_prompt({"system", "user"}). Useful for inspecting exactly what the judge sees — e.g. confirming whether agentreasoning_contentmakes it into the judged trajectory. Save it withvf-eval -C behavior_judge_prompt.
v0.4.0
- Add behavior-only reward shaping for solved rollouts. Set
behavior_judge_modelto enable; the judge scores eleven SWE-tailored behaviors (eight harness behaviors pluspython_first_tool_use,venv_discovery, andsubmission_reflection). The existingverification_and_auditbehavior is extended with SWE-specific cues (minimal repro, targeted + broader test runs). - Ship
prompts/behavior.mdas the defaultappend_to_system_promptwhen the judge is enabled. - Always ship
prompts/venv_hint.mdas the defaultappend_to_system_promptwhen the judge is not enabled, restoring the venv guidance that rlm-harness PR #78 removed from the harness default system prompt. - Resolve
append_to_system_promptas a file path when a non-multiline string points to an existing file; otherwise forward verbatim. - New args:
behavior_judge_model,behavior_judge_base_url,behavior_judge_api_key_var,behavior_judge_sampling_args,behavior_reward_alpha. All other defaults unchanged.
v0.3.4
- Default
sandbox_client_max_workerstoNoneso the shared sandbox client uses the verifiers default worker cap unless callers explicitly override it.
v0.3.3
- Add
filter_fnparameter toload_environment(), forwarded to the upstream SWE taskset so callers can supply a custom dataset filter function.
v0.3.2
- Declare
multi-swe-bench>=1.1.2as a direct dep.MultiSWERubriccallsmulti_swe_bench.harness.report.generate_reportto scoretask_type="multiswe"rollouts; without it the rubric raisesModuleNotFoundErrorand silently zeros every reward (verified during a gpt-5.4 vf-eval run).
v0.3.1
- Declare
swebench==4.1.0as a direct dep — needed whentask_type="swebench"routes throughverifiers' composableswe_benchtaskset (which importsswebenchat module top level without declaring it).
v0.3.0
- Stop enumerating RLM kwargs on
load_environment; everything exceptgh_tokennow flows through**kwargsdirectly torlm_harness. Removes per-env drift whenever the harness kwarg surface changes. Rename:rlm_local_checkout→local_checkout(match harness kwarg name). No runtime default changes; new defaults come from the harness. - Drop
RLM_MAX_TURNS,RLM_MAX_TURNS_IN_CONTEXT,RLM_EXEC_TIMEOUTfrom the env'senvironment_varsdict — the harness now owns these viaHarness.environment_varsand merges them into the sandbox. - Require
verifiers>=0.1.13.dev5.
v0.2.9
- Re-add
rlm_toolsargument (previously removed in v0.2.7 as a no-op). It now fans out throughrlm_harnessto bothHarness.tool_names(drivesToolMonitorRubric) and the sandbox'sRLM_TOOLSenv var. Defaults to["ipython", "summarize"]; also available:bash,edit.
v0.2.8
- Replace
rlm_branchwithrlm_ref(branch, tag, or full commit SHA) and make the default host cache commit-keyed. - Clarify that
rlm_refstill uses the auto-materialized host cache, whilerlm_local_checkoutis now an existing-checkout override that bypasses the cache.
v0.2.7
- Remove the unused
rlm_toolsargument and stop exporting the deadRLM_TOOLS/RLM_SYSTEM_PROMPT_VERBOSITYenvironment variables. - Require
verifiers>=0.1.13.dev3. - Refresh the README argument table to match the current
load_environment()signature.
v0.2.6
- Add
rlm_local_checkoutas the host-side RLM checkout path override. - Cache the RLM checkout on the host and upload it into each sandbox, reducing direct clone pressure on the private repo during large runs.
v0.2.5
- Bump verifiers to
>=0.1.13.dev1.
v0.2.4
- Add
rlm_exec_timeoutparameter (default 300s); forwarded asRLM_EXEC_TIMEOUTto the sandbox, capping per-tool execution time inside the RLM agent. - Unify timeout knob:
timeout_secondsnow drives both the rollout deadline and the sandbox container lifetime (sandbox_timeout_minutesis derived viamath.ceil), preventing sandbox teardown before the agent reaches its deadline. - Expose
poll_intervalkwarg; forwarded toComposableEnv/CliAgentEnvto tune the intercept-queue poll cadence.
v0.2.3
- Ship the
editskill with this environment (underrlm_swe/skills/edit/), so the rlm harness no longer needs to bundle it; auto-uploaded to the sandbox viaComposableEnv's skills-upload mechanism
v0.2.2
- Simplify to use
ComposableEnvdirectly; metrics andGH_TOKENhandling are now driven by upstream harness configuration - Surface all
rlm_-prefixed session metrics instead of a fixed whitelist
v0.2.1
- Add
rlm_repo_urlandrlm_branchsorlm-swecan install and run RLM from a selected GitHub repo and branch
v0.1.3
- Add
rlm_max_turns_in_contextto cap retained assistant turns in live context - Add
append_to_system_promptto append environment-specific instructions to the default RLM system prompt
v0.1.2
- Extract rlm session metrics from
meta.jsonafter each rollout and surface as top-level state keys (rlm_turns,rlm_stop_reason,rlm_prompt_tokens,rlm_completion_tokens,rlm_prompt_tokens_per_turn,rlm_completion_tokens_per_turn, etc.)
v0.1.1
- Scope
gh_token/GH_TOKENto the RLM install step only, without exporting it as a sandbox runtime environment variable
v0.1.0
- Initial release