0

Longcot RLM RL Env (Community)

Fresh

LongCoT long-horizon reasoning evaluation environment using RLM with Python REPL

Type
RL Env
License
apache-2.0
Published
Apr 2026

Cite

Notes

Only stored in your browser.

longcot-rlm

Overview

  • Environment ID: longcot-rlm
  • Short description: LongCoT long-horizon reasoning benchmark using RLM (Recursive Language Model) with Python REPL
  • Tags: reasoning, rlm, python, multi-turn, repl, math, logic, chemistry, chess, cs

How It Works

This environment implements the LongCoT benchmark for evaluating long-horizon reasoning, using the RLMEnv.

Each question in LongCoT is self-contained: the prompt embeds the full task (a chess position, logic puzzle, chemistry subproblem chain, CS algorithm trace, or chained-math problem) and instructs the model to return its answer in the form solution = <answer>.

The model receives that prompt in the user message and has access to a Python REPL in a sandbox. It can install / use:

  • python-chess for chess templates (FEN, SAN move generation, board_fen()).
  • rdkit for chemistry SMILES templates (canonicalization).
  • sympy for math templates (symbolic equivalence).

Scoring is delegated to the upstream longcot package's verify() — the exact template-dispatched verifier used by the reference harness. The full problem dict for each question (needed by logic + some chess verifiers) comes from the JSON files bundled inside the longcot package, not from the HF parquet (which omits problem metadata).

Dataset

  • LongHorizonReasoning/longcot — 2,502 questions across 5 domains × 3 difficulties. Questions ship inside the longcot package, not loaded from HF.
  • Domains: logic, cs, chemistry, chess, math.
  • Difficulties: easy, medium, hard.
  • Templates (dispatched to domain-specific verifiers):
    • logic: BlocksWorld, Dungeon, PackagingMinWaste, RandomHanoi, Sokoban, Sudoku, TrapezoidCounting, WizardsTotalStrength
    • cs: HM, MFMC, Scheduling, TM, MCM, LLVM, Backprop, DistMem, VLIW, CodeTrace
    • chemistry: easy1, easy2, med1med4, hard1hard4
    • chess: uci_to_fen, piece_combinations, reconstruct_moves, best_3_moves, best_move, knight_path, knight_path_enemy, knight_game, max_rooks, forced_checkmate
    • math: linear, dag, dag_first, conditional, backtracking

Quickstart

# GPT-5.2 on longcot-mini (easy split, ~500 questions) — the upstream "mini" benchmark
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot-mini"}'

# GPT-5.2 on the full longcot benchmark (medium + hard, ~2,000 questions)
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2000 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot"}'

# All splits (easy + medium + hard)
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 2500 -r 1 -a '{"benchmark": "all"}'

# Just math
uv run vf-eval longcot-rlm -m openai/gpt-5.2 -s -n 500 -r 1 -a '{"include_env_tips": true, "benchmark": "longcot-mini", "domain": "math"}'

# Chess only, medium+hard
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"domain": "chess", "difficulty": ["medium", "hard"]}'

# A single template
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"template": "BlocksWorld"}'

# With environment tips + shuffling
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"include_env_tips": true, "shuffle": true}'

# Enable Gemini fallback judges (needs GEMINI_API_KEY / GOOGLE_API_KEY)
uv run vf-eval longcot-rlm -m gpt-5-mini -n 5 -a '{"math_enable_fallback": true, "chemistry_enable_fallback": true}'

Environment Arguments

ArgTypeDefaultDescription
benchmark"longcot-mini" | "longcot" | "all" | NoneNoneUpstream benchmark alias. "longcot-mini" = easy (~500), "longcot" = medium + hard (~2,000), "all" = every split. Mutually exclusive with difficulty.
domainstr | list[str] | NoneNoneDomain filter: "logic", "cs", "chemistry", "chess", "math", or a list. None = all.
difficultystr | list[str] | NoneNoneDifficulty filter: "easy", "medium", "hard", or a list. None = all. Mutually exclusive with benchmark.
templatestr | list[str] | NoneNoneOptional template-name filter (e.g. "BlocksWorld", "uci_to_fen", "linear").
shuffleboolFalseWhether to shuffle the dataset.
seedint | NoneNoneRandom seed for shuffling.
max_examplesint | NoneNoneMaximum number of examples (None = all).
include_env_tipsboolFalseAppend strategy tips (wrapped in <env_tips>) to the prompt.
prompt_in_context_fileboolFalseIf True, stash the prompt inside the RLM context file ({"query": prompt, "context": ""}) and leave the user message empty.
exclude_broken_easy_math_idsboolTrueTemporary — drops the 21 easy-math question IDs flagged as wrong/impossible in LongHorizonReasoning/longcot#4 so they don't contaminate longcot-mini scoring. Remove once upstream fixes the dataset.
math_enable_fallbackboolFalseEnable the upstream Gemini fallback judge for math equivalence.
chemistry_enable_fallbackboolFalseEnable the upstream Gemini fallback SMILES extractor.
math_numeric_fallbackboolTrueLocal numeric-equivalence fallback for math templates. Runs only when the upstream verifier rejects, and accepts component pairs whose 30-digit SymPy evaluation agrees to 1e-12 relative tolerance — catches formatting differences like 1.01^100(101/100)^100 that sp.simplify misses because of Float/Rational type mixing.
math_textual_judge_modelstr | NoneNoneOpenAI-compatible model ID for a per-component textual-equivalence judge (e.g. "openai/gpt-5-nano"). Invoked only for components where both sides are textual (free-form families of solutions, set descriptions) so the numeric / SymPy paths can't decide. None disables the judge.
math_textual_judge_api_key_varstr"PRIME_API_KEY"Env var holding the API key for the textual judge.
math_textual_judge_base_urlstr | None"https://api.pinference.ai/api/v1"Base URL for the textual judge.
repl_languageLiteral["bash", "python"]"python"REPL language for the RLM.
max_turnsint30Maximum REPL iterations.
sub_llm_max_turnsint5Max tool-calling turns for each sub-LLM call.
sub_modelstr | NoneNoneModel for sub-LLM calls (defaults to same as root model).
max_sub_llm_parallelismint5Max concurrent sub-LLM calls.
max_output_lengthint8192Maximum code execution output length.
code_execution_timeoutint600Timeout in seconds for a single REPL call. Also bounds the sandbox-side HTTP timeout on llm_batch (upstream uses code_execution_timeout - 5). LongCoT uses 600 (vs 120 for other RLM envs) because GPT-5.2 with high reasoning regularly takes 90–300s on hard competition-math sub-problems — at 120 roughly one-fifth of llm_batch calls time out.
abort_on_code_timeoutboolFalseIf True, abort rollout on code timeout.
max_startup_wait_secondsint120Max seconds to wait for sandbox startup.
pip_install_packagesstr"rdkit chess sympy numpy"Packages to install in sandbox.
sandbox_docker_imagestr"python:3.11-slim"Docker image for sandbox.
sandbox_cpu_coresint1CPU cores for sandbox.
sandbox_memory_gbint2Memory in GB for sandbox.
sandbox_disk_size_gbint5Disk size in GB for sandbox.
sandbox_gpu_countint0Number of GPUs for sandbox.
sandbox_timeout_minutesint60Overall sandbox lifetime in minutes.

Metrics

The rubric calls longcot.verify(question, final_answer, options) and emits 1.0 for correct, 0.0 otherwise. Per-template scoring:

  • Math (linear, dag, dag_first, conditional, backtracking): SymPy-based list equivalence. On upstream rejection, a per-component fallback runs, trying in order:
    1. longcot's own SymPy compare (already the upstream behavior).
    2. Local numeric equivalence (30-digit precision, 1e-12 relative tolerance) — catches 1.01^100(101/100)^100, 1/20.5, etc., which the upstream rejects because sp.simplify(Float - Rational) returns ~1e-15 rather than exact 0.
    3. If math_textual_judge_model is configured, an LLM judge is invoked for textual components (free-form families of solutions, set descriptions) — e.g. gold "All polynomials of the form f(x)=x^m for some m∈ℤ^+ and f(x)=c for some c∈ℤ^+ with ω(c)≤2023^{2023}+1" vs predicted "P(x)=x^k (k∈ℤ_{≥1}) or P(x)=c with c∈ℤ_{>0} and ω(c)≤2023^{2023}+1".
    4. Optional upstream Gemini fallback (math_enable_fallback=True) for the whole list.
  • Chemistry SMILES (easy1, easy2, med3, hard3): RDKit canonicalization match; optional Gemini fallback to extract SMILES from noisy output.
  • Chemistry list (med1, med2, med4, hard1, hard2, hard4): element-wise equality (int/string/mixed).
  • Chess: FEN piece-placement equality, SAN token equality, replay-to-final-FEN, or integer equality depending on template.
  • CS: strict JSON/dict equality, integer equality, or int-list equality.
  • Logic: full simulation of the puzzle against problem["instance"] with state verification.

The model should include solution = <answer> somewhere in its final answer (verifiers also have fallbacks that scan the whole response when that marker is missing).

Changelog

  • 0.1.1: Default textual math judge requests now use Pinference (https://api.pinference.ai/api/v1) with PRIME_API_KEY.
  • 0.1.0: Initial RLM version using the upstream longcot.verify for template-dispatched scoring; supports domain, difficulty, and template filtering.