0

V 2 RLM RL Env (Community)

Fresh

MRCR v2 long-context evaluation environment using RLM with Python REPL

Type
RL Env
License
apache-2.0
Published
Mar 2026

Cite

Notes

Only stored in your browser.

mrcrv2-rlm

Overview

  • Environment ID: mrcrv2-rlm
  • Short description: MRCR v2 long-context benchmark using RLM (Recursive Language Model) with Python REPL
  • Tags: long-context, rlm, python, multi-turn, repl

How It Works

This environment implements the MRCR v2 (Multi-Round Coreference Resolution) benchmark using the RLMEnv.

The model is given a long conversation transcript containing multiple User/Assistant exchanges. The transcript includes "needle" texts (relevant items sharing the same format/topic/style) interleaved with filler texts. The model must find and reproduce a specific needle from the conversation, prepended with a 12-character hash.

Scoring uses the official MRCR v2 metric: difflib.SequenceMatcher.ratio() between the predicted and target content (after verifying the hash prefix).

By default, this benchmark uses the 1M token context range and 8 needles.

Dataset

Data is downloaded from Google Cloud Storage via download.sh. Files are CSV format with columns: queries, answer, context_len, answer_token_count, view_ops, num_relevant, etc. When using the env from source, auto-download runs if no CSVs are present. When using the installed package (e.g. pip install), no data is bundled—set data_dir to a directory where you have run download.sh, or the env will load with 0 examples.

# Download small (<=128K) 2-needle datasets
./download.sh -n 2 -s

# Download all sizes and needle counts
./download.sh -n 2,4,8 -s -m -l

Quickstart

# Basic evaluation (1 sample, 4k-8k context)
prime eval run mrcrv2-rlm -n 1 -r 1 -m openai/gpt-5-mini \
  -a '{"max_examples": 1, "context_range": "4k-8k"}'

# Default: 8-needle, 512k-1m context (auto-downloads if needed)
prime eval run mrcrv2-rlm -m gpt-5-mini -n 5

# 4-needle, 32k-64k context
prime eval run mrcrv2-rlm -m gpt-5-mini -n 5 -a '{"needle_count": 4, "context_range": "32k-64k"}'

Environment Arguments

ArgTypeDefaultDescription
needle_countint8Number of needles: 2, 4, or 8
context_rangestr"512k-1m"Context length range (see below)
data_dirstr | NoneNoneDirectory containing CSVs (defaults to mrcr_v2/ next to script)
auto_downloadboolTrueIf True and no CSVs in data_dir, run download.sh (8 needles, up to 1M)
shuffleboolFalseWhether to shuffle the dataset
seedint | NoneNoneRandom seed for shuffling
max_examplesint | NoneNoneMaximum number of examples to load. With shuffle=True, the full CSV is loaded, shuffled, then truncated so you get a random subset; with shuffle=False, only the first N rows are read.
include_env_tipsboolFalseInclude strategy tips in prompt
prompt_in_context_fileboolFalsePut both query and context in the context file
repl_languageLiteral["bash", "python"]"bash"REPL language for the RLM
max_turnsint30Maximum REPL iterations
sub_llm_max_turnsint5Max tool-calling turns for each sub-LLM call
sub_modelstrNoneModel for sub-LLM calls
max_sub_llm_parallelismint5Max concurrent sub-LLM calls
max_output_lengthint8192Maximum code execution output length
code_execution_timeoutint120Timeout in seconds for code execution
abort_on_code_timeoutboolFalseAbort rollout on code timeout
max_startup_wait_secondsint120Max seconds to wait for sandbox startup
pip_install_packagesstr""Packages to install in sandbox
sandbox_docker_imagestr"python:3.11-slim"Docker image for sandbox
sandbox_cpu_coresint1CPU cores for sandbox
sandbox_memory_gbint2Memory in GB for sandbox
sandbox_disk_size_gbint5Disk size in GB for sandbox
sandbox_gpu_countint0Number of GPUs for sandbox
sandbox_timeout_minutesint60Overall sandbox lifetime in minutes

Context Range Options

RangeToken Count
4k-8k4,096 - 8,192
8k-16k8,192 - 16,384
16k-32k16,384 - 32,768
32k-64k32,768 - 65,536
64k-128k65,536 - 131,072
upto_128kAll of the above combined
128k-256k131,072 - 262,144
256k-512k262,144 - 524,288
512k-1m524,288 - 1,048,576
1m-2m1,048,576 - 2,097,152
2m-4m2,097,152 - 4,194,304
4m-8m4,194,304 - 8,388,608

Metrics

MetricMeaning
mrcr_v2_rewardOfficial MRCR v2 metric: SequenceMatcher.ratio() after hash verification (main reward)
exact_match_reward1.0 if answer exactly matches ground truth

Changelog

  • 0.1.0: Initial release. MRCR v2 benchmark using RLM with Python REPL; official SequenceMatcher metric; configurable needle count and context ranges, default is 1M, 8 needles; data via download.sh.