longcot-rlm

Overview

Environment ID: longcot-rlm
Short description: LongCoT long-horizon reasoning benchmark using RLM (Recursive Language Model) with Python REPL
Tags: reasoning, rlm, python, multi-turn, repl, math, logic, chemistry, chess, cs

How It Works

This environment implements the LongCoT benchmark for evaluating long-horizon reasoning, using the RLMEnv.

Each question in LongCoT is self-contained: the prompt embeds the full task (a chess position, logic puzzle, chemistry subproblem chain, CS algorithm trace, or chained-math problem) and instructs the model to return its answer in the form solution = <answer>.

The model receives that prompt in the user message and has access to a Python REPL in a sandbox. It can install / use:

python-chess for chess templates (FEN, SAN move generation, board_fen()).
rdkit for chemistry SMILES templates (canonicalization).
sympy for math templates (symbolic equivalence).

Scoring is delegated to the upstream longcot package's verify() — the exact template-dispatched verifier used by the reference harness. The full problem dict for each question (needed by logic + some chess verifiers) comes from the JSON files bundled inside the longcot package, not from the HF parquet (which omits problem metadata).

Dataset

LongHorizonReasoning/longcot — 2,502 questions across 5 domains × 3 difficulties. Questions ship inside the longcot package, not loaded from HF.
Domains: logic, cs, chemistry, chess, math.
Difficulties: easy, medium, hard.
Templates (dispatched to domain-specific verifiers):
- logic: BlocksWorld, Dungeon, PackagingMinWaste, RandomHanoi, Sokoban, Sudoku, TrapezoidCounting, WizardsTotalStrength
- cs: HM, MFMC, Scheduling, TM, MCM, LLVM, Backprop, DistMem, VLIW, CodeTrace
- chemistry: easy1, easy2, med1–med4, hard1–hard4
- chess: uci_to_fen, piece_combinations, reconstruct_moves, best_3_moves, best_move, knight_path, knight_path_enemy, knight_game, max_rooks, forced_checkmate
- math: linear, dag, dag_first, conditional, backtracking

Quickstart

# GPT-5.2 on longcot-mini (easy split, ~500 questions) — the upstream "mini" benchmark
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot-mini"}'

# GPT-5.2 on the full longcot benchmark (medium + hard, ~2,000 questions)
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2000 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot"}'

# All splits (easy + medium + hard)
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2500 -r 1 -a '{"benchmark": "all"}'

# Just math
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot-mini", "domain": "math"}'

# Chess only, medium+hard
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"domain": "chess", "difficulty": ["medium", "hard"]}'

# A single template
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"template": "BlocksWorld"}'

# With environment tips + shuffling
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"include_env_tips": true, "shuffle": true}'

# Enable Gemini fallback judges (needs GEMINI_API_KEY / GOOGLE_API_KEY)
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"math_enable_fallback": true, "chemistry_enable_fallback": true}'

Environment Arguments

Arg	Type	Default	Description
`benchmark`	`"longcot-mini"` \| `"longcot"` \| `"all"` \| None	`None`	Upstream benchmark alias. `"longcot-mini"` = easy (~500), `"longcot"` = medium + hard (~2,000), `"all"` = every split. Mutually exclusive with `difficulty`.
`domain`	str \| list[str] \| None	`None`	Domain filter: `"logic"`, `"cs"`, `"chemistry"`, `"chess"`, `"math"`, or a list. `None` = all.
`difficulty`	str \| list[str] \| None	`None`	Difficulty filter: `"easy"`, `"medium"`, `"hard"`, or a list. `None` = all. Mutually exclusive with `benchmark`.
`template`	str \| list[str] \| None	`None`	Optional template-name filter (e.g. `"BlocksWorld"`, `"uci_to_fen"`, `"linear"`).
`shuffle`	bool	`False`	Whether to shuffle the dataset.
`seed`	int \| None	`None`	Random seed for shuffling.
`max_examples`	int \| None	`None`	Maximum number of examples (None = all).
`include_env_tips`	bool	`False`	Append strategy tips (wrapped in `<env_tips>`) to the prompt.
`prompt_in_context_file`	bool	`False`	If `True`, stash the prompt inside the RLM context file (`{"query": prompt, "context": ""}`) and leave the user message empty.
`exclude_broken_easy_math_ids`	bool	`True`	Temporary — drops the 21 easy-math question IDs flagged as wrong/impossible in LongHorizonReasoning/longcot#4 so they don't contaminate longcot-mini scoring. Remove once upstream fixes the dataset.
`math_enable_fallback`	bool	`False`	Enable the upstream Gemini fallback judge for math equivalence.
`chemistry_enable_fallback`	bool	`False`	Enable the upstream Gemini fallback SMILES extractor.
`math_numeric_fallback`	bool	`True`	Local numeric-equivalence fallback for math templates. Runs only when the upstream verifier rejects, and accepts component pairs whose 30-digit SymPy evaluation agrees to 1e-12 relative tolerance — catches formatting differences like `1.01^100` ↔ `(101/100)^100` that `sp.simplify` misses because of Float/Rational type mixing.
`math_textual_judge_model`	str \| None	`None`	OpenAI-compatible model ID for a per-component textual-equivalence judge (e.g. `"openai/gpt-5-nano"`). Invoked only for components where both sides are textual (free-form families of solutions, set descriptions) so the numeric / SymPy paths can't decide. `None` disables the judge.
`math_textual_judge_api_key_var`	str	`"PRIME_API_KEY"`	Env var holding the API key for the textual judge.
`math_textual_judge_base_url`	str \| None	`"https://api.pinference.ai/api/v1"`	Base URL for the textual judge.
`repl_language`	Literal["bash", "python"]	`"python"`	REPL language for the RLM.
`max_turns`	int	`30`	Maximum REPL iterations.
`sub_llm_max_turns`	int	`5`	Max tool-calling turns for each sub-LLM call.
`sub_model`	str \| None	`None`	Model for sub-LLM calls (defaults to same as root model).
`max_sub_llm_parallelism`	int	`5`	Max concurrent sub-LLM calls.
`max_output_length`	int	`8192`	Maximum code execution output length.
`code_execution_timeout`	int	`600`	Timeout in seconds for a single REPL call. Also bounds the sandbox-side HTTP timeout on `llm_batch` (upstream uses `code_execution_timeout - 5`). LongCoT uses 600 (vs 120 for other RLM envs) because GPT-5.2 with high reasoning regularly takes 90–300s on hard competition-math sub-problems — at 120 roughly one-fifth of `llm_batch` calls time out.
`abort_on_code_timeout`	bool	`False`	If True, abort rollout on code timeout.
`max_startup_wait_seconds`	int	`120`	Max seconds to wait for sandbox startup.
`pip_install_packages`	str	`"rdkit chess sympy numpy"`	Packages to install in sandbox.
`sandbox_docker_image`	str	`"python:3.11-slim"`	Docker image for sandbox.
`sandbox_cpu_cores`	int	`1`	CPU cores for sandbox.
`sandbox_memory_gb`	int	`2`	Memory in GB for sandbox.
`sandbox_disk_size_gb`	int	`5`	Disk size in GB for sandbox.
`sandbox_gpu_count`	int	`0`	Number of GPUs for sandbox.
`sandbox_timeout_minutes`	int	`60`	Overall sandbox lifetime in minutes.

Metrics

The rubric calls longcot.verify(question, final_answer, options) and emits 1.0 for correct, 0.0 otherwise. Per-template scoring:

Math (linear, dag, dag_first, conditional, backtracking): SymPy-based list equivalence. On upstream rejection, a per-component fallback runs, trying in order:
1. longcot's own SymPy compare (already the upstream behavior).
2. Local numeric equivalence (30-digit precision, 1e-12 relative tolerance) — catches 1.01^100 ↔ (101/100)^100, 1/2 ↔ 0.5, etc., which the upstream rejects because sp.simplify(Float - Rational) returns ~1e-15 rather than exact 0.
3. If math_textual_judge_model is configured, an LLM judge is invoked for textual components (free-form families of solutions, set descriptions) — e.g. gold "All polynomials of the form f(x)=x^m for some m∈ℤ^+ and f(x)=c for some c∈ℤ^+ with ω(c)≤2023^{2023}+1" vs predicted "P(x)=x^k (k∈ℤ_{≥1}) or P(x)=c with c∈ℤ_{>0} and ω(c)≤2023^{2023}+1".
4. Optional upstream Gemini fallback (math_enable_fallback=True) for the whole list.
Chemistry SMILES (easy1, easy2, med3, hard3): RDKit canonicalization match; optional Gemini fallback to extract SMILES from noisy output.
Chemistry list (med1, med2, med4, hard1, hard2, hard4): element-wise equality (int/string/mixed).
Chess: FEN piece-placement equality, SAN token equality, replay-to-final-FEN, or integer equality depending on template.
CS: strict JSON/dict equality, integer equality, or int-list equality.
Logic: full simulation of the puzzle against problem["instance"] with state verification.

The model should include solution = <answer> somewhere in its final answer (verifiers also have fallbacks that scan the whole response when that marker is missing).

Changelog

0.1.1: Default textual math judge requests now use Pinference (https://api.pinference.ai/api/v1) with PRIME_API_KEY.
0.1.0: Initial RLM version using the upstream longcot.verify for template-dispatched scoring; supports domain, difficulty, and template filtering.