longcot-rlm
Overview
- Environment ID:
longcot-rlm - Short description: LongCoT long-horizon reasoning benchmark using RLM (Recursive Language Model) with Python REPL
- Tags: reasoning, rlm, python, multi-turn, repl, math, logic, chemistry, chess, cs
How It Works
This environment implements the LongCoT benchmark for evaluating long-horizon reasoning, using the RLMEnv.
Each question in LongCoT is self-contained: the prompt embeds the full task (a chess position, logic puzzle, chemistry subproblem chain, CS algorithm trace, or chained-math problem) and instructs the model to return its answer in the form solution = <answer>.
The model receives that prompt in the user message and has access to a Python REPL in a sandbox. It can install / use:
python-chessfor chess templates (FEN, SAN move generation,board_fen()).rdkitfor chemistry SMILES templates (canonicalization).sympyfor math templates (symbolic equivalence).
Scoring is delegated to the upstream longcot package's verify() — the exact template-dispatched verifier used by the reference harness. The full problem dict for each question (needed by logic + some chess verifiers) comes from the JSON files bundled inside the longcot package, not from the HF parquet (which omits problem metadata).
Dataset
- LongHorizonReasoning/longcot — 2,502 questions across 5 domains × 3 difficulties. Questions ship inside the
longcotpackage, not loaded from HF. - Domains:
logic,cs,chemistry,chess,math. - Difficulties:
easy,medium,hard. - Templates (dispatched to domain-specific verifiers):
- logic:
BlocksWorld,Dungeon,PackagingMinWaste,RandomHanoi,Sokoban,Sudoku,TrapezoidCounting,WizardsTotalStrength - cs:
HM,MFMC,Scheduling,TM,MCM,LLVM,Backprop,DistMem,VLIW,CodeTrace - chemistry:
easy1,easy2,med1–med4,hard1–hard4 - chess:
uci_to_fen,piece_combinations,reconstruct_moves,best_3_moves,best_move,knight_path,knight_path_enemy,knight_game,max_rooks,forced_checkmate - math:
linear,dag,dag_first,conditional,backtracking
- logic:
Quickstart
# GPT-5.2 on longcot-mini (easy split, ~500 questions) — the upstream "mini" benchmark
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot-mini"}'
# GPT-5.2 on the full longcot benchmark (medium + hard, ~2,000 questions)
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2000 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot"}'
# All splits (easy + medium + hard)
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2500 -r 1 -a '{"benchmark": "all"}'
# Just math
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot-mini", "domain": "math"}'
# Chess only, medium+hard
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"domain": "chess", "difficulty": ["medium", "hard"]}'
# A single template
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"template": "BlocksWorld"}'
# With environment tips + shuffling
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"include_env_tips": true, "shuffle": true}'
# Enable Gemini fallback judges (needs GEMINI_API_KEY / GOOGLE_API_KEY)
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"math_enable_fallback": true, "chemistry_enable_fallback": true}'
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
benchmark | "longcot-mini" | "longcot" | "all" | None | None | Upstream benchmark alias. "longcot-mini" = easy (~500), "longcot" = medium + hard (~2,000), "all" = every split. Mutually exclusive with difficulty. |
domain | str | list[str] | None | None | Domain filter: "logic", "cs", "chemistry", "chess", "math", or a list. None = all. |
difficulty | str | list[str] | None | None | Difficulty filter: "easy", "medium", "hard", or a list. None = all. Mutually exclusive with benchmark. |
template | str | list[str] | None | None | Optional template-name filter (e.g. "BlocksWorld", "uci_to_fen", "linear"). |
shuffle | bool | False | Whether to shuffle the dataset. |
seed | int | None | None | Random seed for shuffling. |
max_examples | int | None | None | Maximum number of examples (None = all). |
include_env_tips | bool | False | Append strategy tips (wrapped in <env_tips>) to the prompt. |
prompt_in_context_file | bool | False | If True, stash the prompt inside the RLM context file ({"query": prompt, "context": ""}) and leave the user message empty. |
exclude_broken_easy_math_ids | bool | True | Temporary — drops the 21 easy-math question IDs flagged as wrong/impossible in LongHorizonReasoning/longcot#4 so they don't contaminate longcot-mini scoring. Remove once upstream fixes the dataset. |
math_enable_fallback | bool | False | Enable the upstream Gemini fallback judge for math equivalence. |
chemistry_enable_fallback | bool | False | Enable the upstream Gemini fallback SMILES extractor. |
math_numeric_fallback | bool | True | Local numeric-equivalence fallback for math templates. Runs only when the upstream verifier rejects, and accepts component pairs whose 30-digit SymPy evaluation agrees to 1e-12 relative tolerance — catches formatting differences like 1.01^100 ↔ (101/100)^100 that sp.simplify misses because of Float/Rational type mixing. |
math_textual_judge_model | str | None | None | OpenAI-compatible model ID for a per-component textual-equivalence judge (e.g. "openai/gpt-5-nano"). Invoked only for components where both sides are textual (free-form families of solutions, set descriptions) so the numeric / SymPy paths can't decide. None disables the judge. |
math_textual_judge_api_key_var | str | "PRIME_API_KEY" | Env var holding the API key for the textual judge. |
math_textual_judge_base_url | str | None | "https://api.pinference.ai/api/v1" | Base URL for the textual judge. |
repl_language | Literal["bash", "python"] | "python" | REPL language for the RLM. |
max_turns | int | 30 | Maximum REPL iterations. |
sub_llm_max_turns | int | 5 | Max tool-calling turns for each sub-LLM call. |
sub_model | str | None | None | Model for sub-LLM calls (defaults to same as root model). |
max_sub_llm_parallelism | int | 5 | Max concurrent sub-LLM calls. |
max_output_length | int | 8192 | Maximum code execution output length. |
code_execution_timeout | int | 600 | Timeout in seconds for a single REPL call. Also bounds the sandbox-side HTTP timeout on llm_batch (upstream uses code_execution_timeout - 5). LongCoT uses 600 (vs 120 for other RLM envs) because GPT-5.2 with high reasoning regularly takes 90–300s on hard competition-math sub-problems — at 120 roughly one-fifth of llm_batch calls time out. |
abort_on_code_timeout | bool | False | If True, abort rollout on code timeout. |
max_startup_wait_seconds | int | 120 | Max seconds to wait for sandbox startup. |
pip_install_packages | str | "rdkit chess sympy numpy" | Packages to install in sandbox. |
sandbox_docker_image | str | "python:3.11-slim" | Docker image for sandbox. |
sandbox_cpu_cores | int | 1 | CPU cores for sandbox. |
sandbox_memory_gb | int | 2 | Memory in GB for sandbox. |
sandbox_disk_size_gb | int | 5 | Disk size in GB for sandbox. |
sandbox_gpu_count | int | 0 | Number of GPUs for sandbox. |
sandbox_timeout_minutes | int | 60 | Overall sandbox lifetime in minutes. |
Metrics
The rubric calls longcot.verify(question, final_answer, options) and emits 1.0 for correct, 0.0 otherwise. Per-template scoring:
- Math (
linear,dag,dag_first,conditional,backtracking): SymPy-based list equivalence. On upstream rejection, a per-component fallback runs, trying in order:- longcot's own SymPy compare (already the upstream behavior).
- Local numeric equivalence (30-digit precision, 1e-12 relative tolerance) — catches
1.01^100↔(101/100)^100,1/2↔0.5, etc., which the upstream rejects becausesp.simplify(Float - Rational)returns ~1e-15 rather than exact 0. - If
math_textual_judge_modelis configured, an LLM judge is invoked for textual components (free-form families of solutions, set descriptions) — e.g. gold"All polynomials of the form f(x)=x^m for some m∈ℤ^+ and f(x)=c for some c∈ℤ^+ with ω(c)≤2023^{2023}+1"vs predicted"P(x)=x^k (k∈ℤ_{≥1}) or P(x)=c with c∈ℤ_{>0} and ω(c)≤2023^{2023}+1". - Optional upstream Gemini fallback (
math_enable_fallback=True) for the whole list.
- Chemistry SMILES (
easy1,easy2,med3,hard3): RDKit canonicalization match; optional Gemini fallback to extract SMILES from noisy output. - Chemistry list (
med1,med2,med4,hard1,hard2,hard4): element-wise equality (int/string/mixed). - Chess: FEN piece-placement equality, SAN token equality, replay-to-final-FEN, or integer equality depending on template.
- CS: strict JSON/dict equality, integer equality, or int-list equality.
- Logic: full simulation of the puzzle against
problem["instance"]with state verification.
The model should include solution = <answer> somewhere in its final answer (verifiers also have fallbacks that scan the whole response when that marker is missing).
Changelog
- 0.1.1: Default textual math judge requests now use Pinference (
https://api.pinference.ai/api/v1) withPRIME_API_KEY. - 0.1.0: Initial RLM version using the upstream
longcot.verifyfor template-dispatched scoring; supportsdomain,difficulty, andtemplatefiltering.