0

Clbench RLM RL Env (Community)

Fresh

CL-bench environment using RLM with strict rubric-based LLM judge scoring

Type
RL Env
License
apache-2.0
Published
Mar 2026

Cite

Notes

Only stored in your browser.

CL-bench (RLM version)

Overview

  • Environment ID: clbench-rlm
  • Short description: Minimal CL-bench RLM environment with strict rubric-based LLM-as-judge scoring.
  • Tags: in-context-learning, long-context, eval

Dataset

  • Primary dataset: tencent/CL-bench
  • Source links: HuggingFace, GitHub
  • Notes: The license on the dataset only allows the usage for evaluation, not training.

Quickstart

# Context offloaded to file (default)
# Environment: Python REPL (repl_language: "python"); include_env_tips adds prompt hint for llm_batch() sub-agent use
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -s -n 100 -r 1 -a '{"repl_language": "python", "include_env_tips": true}'

# Full content loaded in model prompt
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"include_content_in_context": true}'

# Use bash REPL instead of python
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"repl_language": "bash"}'

# Filter by category (use valid pairs from table below)
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -s -n 200 -r 1 -a '{"context_category": "Rule System Application", "sub_category": "Legal & Regulatory", "repl_language": "python", "include_env_tips": true}'

# Filter by multiple sub-categories within same context
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"context_category": "Rule System Application", "sub_category": ["Game Mechanics", "Legal & Regulatory"]}'

Context modes

Each CL-bench example contains a system prompt and a long content trajectory (user/assistant turns). The include_content_in_context flag controls how this content is presented to the model:

  • false (default) — Only the system prompt is loaded into the model's prompt. The full content trajectory is written to context.txt in the sandbox working directory. A blurb instructs the model to read the file via the REPL and use sub-agents to analyze it. This tests the model's ability to work with offloaded long-context through tool use.

  • true — The full trajectory (system prompt + all content turns) is loaded directly into the model's prompt. The content is also written to context.txt so the model can search and re-read it via the REPL. A note tells the model to leverage sub-agents and the REPL to verify its answers.

In both modes the content is always available as context.txt in the sandbox.

Environment Arguments

ArgTypeDefaultDescription
judge_modelstr or null"openai/gpt-5.2"Judge model
judge_base_urlstr or null"https://api.pinference.ai/api/v1"OpenAI-compatible base URL (Prime Intellect)
judge_api_key_varstr or nullnullEnv var used for judge API key (defaults to PRIME_API_KEY)
include_content_in_contextboolfalseIf true, load full content trajectory into the model prompt; if false, offload to context.txt (see above)
include_env_tipsboolfalseAppends a small <env_tips> block encouraging llm_batch() sub-agent delegation
context_categorystr or list[str] or nullnullFilter examples by metadata context_category; pass a string or list of strings to match
sub_categorystr or list[str] or nullnullFilter examples by metadata sub_category; pass a string or list of strings to match
repl_languagestr"python"Sandbox REPL language ("python" or "bash")
max_turnsint30Max root-agent turns
sub_llm_max_turnsint5Max turns per sub-agent (llm_batch)
sub_modelstr or nullnullOptional model override for sub-agents
max_sub_llm_parallelismint5Max concurrent sub-agent calls
max_output_lengthint8192Max REPL output length
code_execution_timeoutint120Timeout (s) for REPL execution
abort_on_code_timeoutboolfalseAbort rollout on execution timeout
max_startup_wait_secondsint120Max sandbox startup wait
pip_install_packagesstr""Extra pip packages for sandbox
sandbox_docker_imagestr"python:3.11-slim"Sandbox image
sandbox_cpu_coresint1Sandbox CPU cores
sandbox_memory_gbint2Sandbox memory
sandbox_disk_size_gbint5Sandbox disk
sandbox_gpu_countint0Sandbox GPUs
sandbox_timeout_minutesint60Sandbox lifetime
**kwargsAny-Additional args forwarded to RLMEnv

Valid categories

Only certain context_category / sub_category pairs exist in the dataset. An error is raised if you specify invalid names or a non-existent combination.

context_category (4 values): Domain Knowledge Reasoning, Empirical Discovery & Simulation, Procedural Task Execution, Rule System Application

Valid (context_category, sub_category) pairs:

context_categorysub_category
Domain Knowledge ReasoningFinance, Healthcare, Humanities, Legal Advisory, Lifestyle, Management, Science
Empirical Discovery & SimulationExperimental Data, Observational Data, Simulation Environment
Procedural Task ExecutionInstructional Procedures, Operational Procedures, Workflow Orchestration
Rule System ApplicationGame Mechanics, Legal & Regulatory, Mathematical Formalism, Programming Syntax, Technical Standards

Changelog

  • 0.1.0: Environment created.