0

ENV RLM RL Env (Prime Intellect)

Fresh

Multi-turn math environment using RLM with Python REPL and hybrid verification

Type
RL Env
Capabilities
Math
Tags
Python
Runtime
multi-turn
License
unknown
Size
v0.1.6
Published
Dec 2025

Cite

Notes

Only stored in your browser.

math-env-rlm

Overview

  • Environment ID: math-env-rlm
  • Short description: Multi-turn math environment using RLM (Recursive Language Model) with Python REPL and hybrid verification (math_verify + optional LLM judge)
  • Tags: math, rlm, python, multi-turn, repl

Quickstart

Run an evaluation with default settings:

prime eval run math-env-rlm

With LLM judge fallback:

prime eval run math-env-rlm --args judge_model=openai/gpt-4.1-mini

Environment Arguments

ArgTypeDefaultDescription
dataset_namestr"PrimeIntellect/INTELLECT-3-RL"Dataset to load
dataset_subsetstr"math"Dataset subset to load
dataset_splitstr"train"Split to load
dataset_shuffleboolFalseWhether to shuffle the dataset
dataset_seedint42Seed for shuffling the dataset
question_keystr"question"Key to use for the question
answer_keystr"answer"Key to use for the answer
info_keystr"info"Key to use for the info
difficulty_keystrNoneKey to use for the difficulty; "avg@8_qwen3_4b_thinking_2507" or "avg@8_qwen3_4b_instruct_2507"
min_avg_rewardfloat0.0Minimum average reward in difficulty key
max_avg_rewardfloat1.0Maximum average reward in difficulty key
instruction_promptstrSee codeInstruction prompt prepended to questions
include_env_tipsboolFalseInclude tips suggesting Python/sympy usage
map_kwargsdict{}Keyword arguments for the map method
filter_kwargsdict{}Keyword arguments for the filter method
judge_modelstrNoneLLM judge model for fallback verification (None = no judge)
judge_base_urlstr"https://api.pinference.ai/api/v1"Base URL for judge API
judge_api_key_varstr"PRIME_API_KEY"Environment variable for judge API key
judge_promptstrSee codePrompt template for judge
judge_sampling_argsdict{}Sampling args for judge model
judge_timeoutint1200HTTP timeout for judge calls
judge_connectionsint8192Max HTTP connections for judge
math_verify_timeoutint5Timeout in seconds for math_verify
max_turnsint30Maximum REPL iterations
sub_llm_max_turnsint5Max tool-calling turns for each sub-LLM call
sub_modelstrNoneModel for sub-LLM calls (defaults to same as root model)
max_sub_llm_parallelismint5Max concurrent sub-LLM calls
max_output_lengthint8192Maximum code execution output length
code_execution_timeoutint120Timeout in seconds for code execution
abort_on_code_timeoutboolFalseIf True, abort rollout on code timeout; if False, return error to model
max_startup_wait_secondsint120Max seconds to wait for sandbox worker startup
pip_install_packagesstr"numpy sympy scipy"Packages to install in the REPL sandbox
sandbox_docker_imagestr"python:3.11-slim"Docker image for sandbox
sandbox_cpu_coresint1CPU cores for sandbox
sandbox_memory_gbint2Memory in GB for sandbox
sandbox_disk_size_gbint5Disk size in GB for sandbox
sandbox_gpu_countint0Number of GPUs for sandbox
sandbox_timeout_minutesint60Overall sandbox lifetime in minutes

Metrics

MetricMeaning
math_verify_score1.0 if rule-based math_verify passes, 0.0 otherwise
judge_score1.0 if LLM judge passes (only runs if math_verify fails and judge_model is set)
correct_answer1.0 if either math_verify or judge passes (this is the reward)

Changelog

  • 0.1.6: Default judge requests now use Pinference (https://api.pinference.ai/api/v1) with PRIME_API_KEY.
  • 0.1.5: align arg names with simplified RLMEnv (max_iterationsmax_turns, sub_tool_max_turnssub_llm_max_turns, sandbox params → sandbox_* prefix)
  • 0.1.4: sandbox labels no longer force in the default label
  • 0.1.3:
    • add default "math-env-rlm" label to the sandbox_labels no matter what the user passes ther in the kwargs
    • dedupe sandbox_labels if passed via the kwargs