mrcrv2-rlm

Overview

Environment ID: mrcrv2-rlm
Short description: MRCR v2 long-context benchmark using RLM (Recursive Language Model) with Python REPL
Tags: long-context, rlm, python, multi-turn, repl

How It Works

This environment implements the MRCR v2 (Multi-Round Coreference Resolution) benchmark using the RLMEnv.

The model is given a long conversation transcript containing multiple User/Assistant exchanges. The transcript includes "needle" texts (relevant items sharing the same format/topic/style) interleaved with filler texts. The model must find and reproduce a specific needle from the conversation, prepended with a 12-character hash.

Scoring uses the official MRCR v2 metric: difflib.SequenceMatcher.ratio() between the predicted and target content (after verifying the hash prefix).

By default, this benchmark uses the 1M token context range and 8 needles.

Dataset

Data is downloaded from Google Cloud Storage via download.sh. Files are CSV format with columns: queries, answer, context_len, answer_token_count, view_ops, num_relevant, etc. When using the env from source, auto-download runs if no CSVs are present. When using the installed package (e.g. pip install), no data is bundled—set data_dir to a directory where you have run download.sh, or the env will load with 0 examples.

# Download small (<=128K) 2-needle datasets
./download.sh -n 2 -s

# Download all sizes and needle counts
./download.sh -n 2,4,8 -s -m -l

Quickstart

# Basic evaluation (1 sample, 4k-8k context)
prime eval run mrcrv2-rlm -n 1 -r 1 -m openai/gpt-5-mini \
  -a '{"max_examples": 1, "context_range": "4k-8k"}'

# Default: 8-needle, 512k-1m context (auto-downloads if needed)
prime eval run mrcrv2-rlm -m gpt-5-mini -n 5

# 4-needle, 32k-64k context
prime eval run mrcrv2-rlm -m gpt-5-mini -n 5 -a '{"needle_count": 4, "context_range": "32k-64k"}'

Environment Arguments

Arg	Type	Default	Description
`needle_count`	int	`8`	Number of needles: 2, 4, or 8
`context_range`	str	`"512k-1m"`	Context length range (see below)
`data_dir`	str \| None	`None`	Directory containing CSVs (defaults to `mrcr_v2/` next to script)
`auto_download`	bool	`True`	If True and no CSVs in data_dir, run download.sh (8 needles, up to 1M)
`shuffle`	bool	`False`	Whether to shuffle the dataset
`seed`	int \| None	`None`	Random seed for shuffling
`max_examples`	int \| None	`None`	Maximum number of examples to load. With `shuffle=True`, the full CSV is loaded, shuffled, then truncated so you get a random subset; with `shuffle=False`, only the first N rows are read.
`include_env_tips`	bool	`False`	Include strategy tips in prompt
`prompt_in_context_file`	bool	`False`	Put both query and context in the context file
`repl_language`	Literal["bash", "python"]	`"bash"`	REPL language for the RLM
`max_turns`	int	`30`	Maximum REPL iterations
`sub_llm_max_turns`	int	`5`	Max tool-calling turns for each sub-LLM call
`sub_model`	str	`None`	Model for sub-LLM calls
`max_sub_llm_parallelism`	int	`5`	Max concurrent sub-LLM calls
`max_output_length`	int	`8192`	Maximum code execution output length
`code_execution_timeout`	int	`120`	Timeout in seconds for code execution
`abort_on_code_timeout`	bool	`False`	Abort rollout on code timeout
`max_startup_wait_seconds`	int	`120`	Max seconds to wait for sandbox startup
`pip_install_packages`	str	`""`	Packages to install in sandbox
`sandbox_docker_image`	str	`"python:3.11-slim"`	Docker image for sandbox
`sandbox_cpu_cores`	int	`1`	CPU cores for sandbox
`sandbox_memory_gb`	int	`2`	Memory in GB for sandbox
`sandbox_disk_size_gb`	int	`5`	Disk size in GB for sandbox
`sandbox_gpu_count`	int	`0`	Number of GPUs for sandbox
`sandbox_timeout_minutes`	int	`60`	Overall sandbox lifetime in minutes

Context Range Options

Range	Token Count
`4k-8k`	4,096 - 8,192
`8k-16k`	8,192 - 16,384
`16k-32k`	16,384 - 32,768
`32k-64k`	32,768 - 65,536
`64k-128k`	65,536 - 131,072
`upto_128k`	All of the above combined
`128k-256k`	131,072 - 262,144
`256k-512k`	262,144 - 524,288
`512k-1m`	524,288 - 1,048,576
`1m-2m`	1,048,576 - 2,097,152
`2m-4m`	2,097,152 - 4,194,304
`4m-8m`	4,194,304 - 8,388,608

Metrics

Metric	Meaning
`mrcr_v2_reward`	Official MRCR v2 metric: `SequenceMatcher.ratio()` after hash verification (main reward)
`exact_match_reward`	1.0 if answer exactly matches ground truth

Changelog

0.1.0: Initial release. MRCR v2 benchmark using RLM with Python REPL; official SequenceMatcher metric; configurable needle count and context ranges, default is 1M, 8 needles; data via download.sh.