mrcrv2-rlm
Overview
- Environment ID:
mrcrv2-rlm - Short description: MRCR v2 long-context benchmark using RLM (Recursive Language Model) with Python REPL
- Tags: long-context, rlm, python, multi-turn, repl
How It Works
This environment implements the MRCR v2 (Multi-Round Coreference Resolution) benchmark using the RLMEnv.
The model is given a long conversation transcript containing multiple User/Assistant exchanges. The transcript includes "needle" texts (relevant items sharing the same format/topic/style) interleaved with filler texts. The model must find and reproduce a specific needle from the conversation, prepended with a 12-character hash.
Scoring uses the official MRCR v2 metric: difflib.SequenceMatcher.ratio() between the predicted and target content (after verifying the hash prefix).
By default, this benchmark uses the 1M token context range and 8 needles.
Dataset
Data is downloaded from Google Cloud Storage via download.sh. Files are CSV format with columns: queries, answer, context_len, answer_token_count, view_ops, num_relevant, etc. When using the env from source, auto-download runs if no CSVs are present. When using the installed package (e.g. pip install), no data is bundled—set data_dir to a directory where you have run download.sh, or the env will load with 0 examples.
# Download small (<=128K) 2-needle datasets
./download.sh -n 2 -s
# Download all sizes and needle counts
./download.sh -n 2,4,8 -s -m -l
Quickstart
# Basic evaluation (1 sample, 4k-8k context)
prime eval run mrcrv2-rlm -n 1 -r 1 -m openai/gpt-5-mini \
-a '{"max_examples": 1, "context_range": "4k-8k"}'
# Default: 8-needle, 512k-1m context (auto-downloads if needed)
prime eval run mrcrv2-rlm -m gpt-5-mini -n 5
# 4-needle, 32k-64k context
prime eval run mrcrv2-rlm -m gpt-5-mini -n 5 -a '{"needle_count": 4, "context_range": "32k-64k"}'
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
needle_count | int | 8 | Number of needles: 2, 4, or 8 |
context_range | str | "512k-1m" | Context length range (see below) |
data_dir | str | None | None | Directory containing CSVs (defaults to mrcr_v2/ next to script) |
auto_download | bool | True | If True and no CSVs in data_dir, run download.sh (8 needles, up to 1M) |
shuffle | bool | False | Whether to shuffle the dataset |
seed | int | None | None | Random seed for shuffling |
max_examples | int | None | None | Maximum number of examples to load. With shuffle=True, the full CSV is loaded, shuffled, then truncated so you get a random subset; with shuffle=False, only the first N rows are read. |
include_env_tips | bool | False | Include strategy tips in prompt |
prompt_in_context_file | bool | False | Put both query and context in the context file |
repl_language | Literal["bash", "python"] | "bash" | REPL language for the RLM |
max_turns | int | 30 | Maximum REPL iterations |
sub_llm_max_turns | int | 5 | Max tool-calling turns for each sub-LLM call |
sub_model | str | None | Model for sub-LLM calls |
max_sub_llm_parallelism | int | 5 | Max concurrent sub-LLM calls |
max_output_length | int | 8192 | Maximum code execution output length |
code_execution_timeout | int | 120 | Timeout in seconds for code execution |
abort_on_code_timeout | bool | False | Abort rollout on code timeout |
max_startup_wait_seconds | int | 120 | Max seconds to wait for sandbox startup |
pip_install_packages | str | "" | Packages to install in sandbox |
sandbox_docker_image | str | "python:3.11-slim" | Docker image for sandbox |
sandbox_cpu_cores | int | 1 | CPU cores for sandbox |
sandbox_memory_gb | int | 2 | Memory in GB for sandbox |
sandbox_disk_size_gb | int | 5 | Disk size in GB for sandbox |
sandbox_gpu_count | int | 0 | Number of GPUs for sandbox |
sandbox_timeout_minutes | int | 60 | Overall sandbox lifetime in minutes |
Context Range Options
| Range | Token Count |
|---|---|
4k-8k | 4,096 - 8,192 |
8k-16k | 8,192 - 16,384 |
16k-32k | 16,384 - 32,768 |
32k-64k | 32,768 - 65,536 |
64k-128k | 65,536 - 131,072 |
upto_128k | All of the above combined |
128k-256k | 131,072 - 262,144 |
256k-512k | 262,144 - 524,288 |
512k-1m | 524,288 - 1,048,576 |
1m-2m | 1,048,576 - 2,097,152 |
2m-4m | 2,097,152 - 4,194,304 |
4m-8m | 4,194,304 - 8,388,608 |
Metrics
| Metric | Meaning |
|---|---|
mrcr_v2_reward | Official MRCR v2 metric: SequenceMatcher.ratio() after hash verification (main reward) |
exact_match_reward | 1.0 if answer exactly matches ground truth |
Changelog
- 0.1.0: Initial release. MRCR v2 benchmark using RLM with Python REPL; official SequenceMatcher metric; configurable needle count and context ranges, default is 1M, 8 needles; data via download.sh.