0

PLUS RLM RL Env (Community)

Fresh

Mini SWE Agent Plus RLM environment for solving SWE issues inside Prime Sandboxes.

Type
RL Env
License
apache-2.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

mini-swe-agent-plus-rlm

Source Code

mini-swe-agent-plus-rlm environment for solving SWE issues inside prime sandboxes with an RLM harness.

This environment adapts the mini-swe-agent-plus with an RLM REPL harness and optional sub-LLM tool use.

Supported harnesses and datasets:

Overview

  • Environment ID: mini-swe-agent-plus-rlm
  • Short description: RLM environment for solving SWE tasks
  • Tags: coding, multi-turn, sandbox, rlm

Datasets

Task

  • Type: multi-turn, tool use
  • Rubric overview: Reward based on executing repo test-suite
  • Protected files: Modifying test/config files yields a reward of 0 and tests are skipped.

Quickstart

Run an evaluation with default settings:

prime eval run mini-swe-agent-plus-rlm

To run SWE-Bench-Verified

prime eval run mini-swe-agent-plus-rlm -n -1 -r 1 -a '{"dataset_name": "SWE-bench/SWE-bench_Verified", "allow_git": true}'

To run a quicker version of SWE-Bench-Verified (downsampled to 468 examples)

prime eval run mini-swe-agent-plus-rlm -n -1 -r 1 -a '{"dataset_name": "PrimeIntellect/SWE-Bench-Verified-Quick", "allow_git": true}'

Notes:

  • Use -a / --env-args to pass environment-specific configuration as a JSON object.

Environment Arguments

ArgTypeDefaultDescription
dataset_namestr"R2E-Gym/R2E-Gym-Subset"Selects dataset
max_turnsint200Limits max number of agent turns
sandbox_timeout_minutesint600Total sandbox container lifetime in minutes (10h). Must fit rollout + tests + 5 min margin.
code_execution_timeoutint120Per-action timeout in seconds for REPL executions, execute_bash, and edit_via_str_replace
test_timeoutint900Timeout for running tests after the rollout (seconds)
cpu_coresint4Number of CPU cores for the sandbox
memory_gbint4Amount of memory (GB) for the sandbox
disk_size_gbint2Disk size (GB) for the sandbox
sandbox_labelslist[str][]Additional sandbox labels (default mini-swe-agent-plus-rlm is always applied)
max_execution_timeoutsint5Abort rollout after this many command timeouts
max_startup_wait_secondsintNoneOverride infrastructure command timeout (default: max(120, code_execution_timeout))
allow_gitboolFalseAllow git commands in execute_bash tool
filter_reposlist[str]NoneExclude these repos from dataset, e.g. scikit-learn/scikit-learn
tools_on_rootboolFalseMake execute_bash/edit_via_str_replace available as standard tools (direct tool calling alongside the REPL)
tools_in_replboolFalseMake execute_bash/edit_via_str_replace available inside the REPL (callable as functions via HTTP proxy)
tools_on_subboolTrueMake execute_bash/edit_via_str_replace available to sub-LLMs
include_sub_llm_in_trajectoryboolFalseInclude sub-LLM turns in trajectory
sub_modelstrNoneOptional model override for sub-LLMs
repl_languagestr"python"RLM REPL language (python or bash)
rlm_metric_weightsdict[str, float]NoneOverride weights for RLM monitor metrics to use them as training reward signals. See below.
use_dataset_cacheboolFalseUse HuggingFace dataset caching instead of keeping data in memory
custom_instructionsstr""Extra instructions appended to each prompt in a <custom_instructions> block. Empty string adds nothing.

Timeout design

There are only three primary timeout knobs. Everything else is derived:

  • sandbox_timeout_minutes -- Total sandbox lifetime. The platform kills the container after this.
  • code_execution_timeout -- Per-action budget. Applied uniformly to REPL code, bash commands, and edits. The sub-LLM HTTP timeout is auto-derived as code_execution_timeout - 5.
  • test_timeout -- Post-rollout test budget. Independent because tests have different runtime characteristics.

Derived (not user-facing):

  • rollout_timeout_seconds = sandbox_timeout_minutes * 60 - test_timeout - 300
  • sandbox_command_timeout = code_execution_timeout
  • sub_llm_timeout = code_execution_timeout - 5 (set by RLMEnv)

RLM Metric Weights

By default, RLM monitor metrics are tracked with weight 0 (monitor-only). Use rlm_metric_weights to assign nonzero weights so they contribute to the training reward.

Metrics are min-max normalized within each group of rollouts before the weight is applied, so the reward contribution is always in [0, 1] regardless of the metric's natural scale. Best-in-group gets 1.0, worst gets 0.0; when all rollouts have the same value, all get 0.0 (no signal). Use a positive weight to reward higher values (e.g., encourage larger batch sizes) or a negative weight to penalize them (e.g., discourage token usage).

Allowed keys:

KeyDescription
sub_llm_call_countTotal number of individual sub-LLM calls
sub_llm_total_turnsTotal turns across all sub-LLM calls
sub_llm_prompt_tokensPrompt tokens consumed by sub-LLMs
sub_llm_completion_tokensCompletion tokens consumed by sub-LLMs
sub_llm_total_tool_callsTotal tool calls made by sub-LLMs
sub_llm_batch_countNumber of llm_batch invocations
sub_llm_mean_batch_sizeAverage batch size across llm_batch calls

Example (penalize excessive sub-LLM calls, reward batching):

prime eval run mini-swe-agent-plus-rlm -a '{"rlm_metric_weights": {"sub_llm_call_count": -0.01, "sub_llm_batch_count": 0.05}}'

The raw (unnormalized) metrics are still tracked as monitor-only metrics by the RLM environment.

Metrics

MetricMeaning
solvedIf SWE task instance was correctly solved
command_timeout_countNumber of commands that timed out during rollout
rollout_duration_secondsWall-clock duration of the rollout
sandbox_oomSandbox was killed due to out-of-memory
sandbox_timeoutSandbox timed out
sandbox_image_pull_errorFailed to pull sandbox docker image

Changelog

  • 0.1.6: add custom_instructions parameter; replace tool_target enum with three independent booleans (tools_on_root, tools_in_repl, tools_on_sub) matching deepdive_rlm; remove redundant tool instructions from prompt (now handled by RLM scaffolding)
  • 0.1.5: simplify timeouts to 3 primary knobs (sandbox_timeout_minutes, code_execution_timeout, test_timeout); remove redundant sandbox_command_timeout, rollout_timeout_seconds, total_timeout_minutes (now derived); rename max_command_timeoutsmax_execution_timeouts; add max_startup_wait_seconds power-user override. Default changes: sandbox lifetime 360 → 600 min, per-command timeout 90 → 120s (now unified with code_execution_timeout), derived rollout timeout 5400 → 34800s
  • 0.1.4: add use_dataset_cache to opt into HuggingFace disk caching instead of in-memory datasets
  • 0.1.3: align arg names with simplified RLMEnv (max_iterationsmax_turns, remove execution_backend, sandbox_start_command, sandbox_client_max_workers); code_execution_timeout now defaults to 120 instead of falling back to sandbox_command_timeout
  • 0.1.2: sandbox labels no longer force in the default label
  • 0.1.1: add rlm_metric_weights parameter with within-group min-max normalized RLM metrics as training reward signals
  • 0.1.0: port mini-swe-agent-plus v0.2.12 to use the RLM