CL-bench (RLM version)
Overview
- Environment ID:
clbench-rlm - Short description: Minimal CL-bench RLM environment with strict rubric-based LLM-as-judge scoring.
- Tags:
in-context-learning,long-context,eval
Dataset
- Primary dataset:
tencent/CL-bench - Source links: HuggingFace, GitHub
- Notes: The license on the dataset only allows the usage for evaluation, not training.
Quickstart
# Context offloaded to file (default)
# Environment: Python REPL (repl_language: "python"); include_env_tips adds prompt hint for llm_batch() sub-agent use
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -s -n 100 -r 1 -a '{"repl_language": "python", "include_env_tips": true}'
# Full content loaded in model prompt
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"include_content_in_context": true}'
# Use bash REPL instead of python
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"repl_language": "bash"}'
# Filter by category (use valid pairs from table below)
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -s -n 200 -r 1 -a '{"context_category": "Rule System Application", "sub_category": "Legal & Regulatory", "repl_language": "python", "include_env_tips": true}'
# Filter by multiple sub-categories within same context
uv run vf-eval clbench-rlm -m openai/gpt-5.2 -a '{"context_category": "Rule System Application", "sub_category": ["Game Mechanics", "Legal & Regulatory"]}'
Context modes
Each CL-bench example contains a system prompt and a long content trajectory (user/assistant turns).
The include_content_in_context flag controls how this content is presented to the model:
-
false(default) — Only the system prompt is loaded into the model's prompt. The full content trajectory is written tocontext.txtin the sandbox working directory. A blurb instructs the model to read the file via the REPL and use sub-agents to analyze it. This tests the model's ability to work with offloaded long-context through tool use. -
true— The full trajectory (system prompt + all content turns) is loaded directly into the model's prompt. The content is also written tocontext.txtso the model can search and re-read it via the REPL. A note tells the model to leverage sub-agents and the REPL to verify its answers.
In both modes the content is always available as context.txt in the sandbox.
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
judge_model | str or null | "openai/gpt-5.2" | Judge model |
judge_base_url | str or null | "https://api.pinference.ai/api/v1" | OpenAI-compatible base URL (Prime Intellect) |
judge_api_key_var | str or null | null | Env var used for judge API key (defaults to PRIME_API_KEY) |
include_content_in_context | bool | false | If true, load full content trajectory into the model prompt; if false, offload to context.txt (see above) |
include_env_tips | bool | false | Appends a small <env_tips> block encouraging llm_batch() sub-agent delegation |
context_category | str or list[str] or null | null | Filter examples by metadata context_category; pass a string or list of strings to match |
sub_category | str or list[str] or null | null | Filter examples by metadata sub_category; pass a string or list of strings to match |
repl_language | str | "python" | Sandbox REPL language ("python" or "bash") |
max_turns | int | 30 | Max root-agent turns |
sub_llm_max_turns | int | 5 | Max turns per sub-agent (llm_batch) |
sub_model | str or null | null | Optional model override for sub-agents |
max_sub_llm_parallelism | int | 5 | Max concurrent sub-agent calls |
max_output_length | int | 8192 | Max REPL output length |
code_execution_timeout | int | 120 | Timeout (s) for REPL execution |
abort_on_code_timeout | bool | false | Abort rollout on execution timeout |
max_startup_wait_seconds | int | 120 | Max sandbox startup wait |
pip_install_packages | str | "" | Extra pip packages for sandbox |
sandbox_docker_image | str | "python:3.11-slim" | Sandbox image |
sandbox_cpu_cores | int | 1 | Sandbox CPU cores |
sandbox_memory_gb | int | 2 | Sandbox memory |
sandbox_disk_size_gb | int | 5 | Sandbox disk |
sandbox_gpu_count | int | 0 | Sandbox GPUs |
sandbox_timeout_minutes | int | 60 | Sandbox lifetime |
**kwargs | Any | - | Additional args forwarded to RLMEnv |
Valid categories
Only certain context_category / sub_category pairs exist in the dataset. An error is raised if you specify invalid names or a non-existent combination.
context_category (4 values): Domain Knowledge Reasoning, Empirical Discovery & Simulation, Procedural Task Execution, Rule System Application
Valid (context_category, sub_category) pairs:
| context_category | sub_category |
|---|---|
| Domain Knowledge Reasoning | Finance, Healthcare, Humanities, Legal Advisory, Lifestyle, Management, Science |
| Empirical Discovery & Simulation | Experimental Data, Observational Data, Simulation Environment |
| Procedural Task Execution | Instructional Procedures, Operational Procedures, Workflow Orchestration |
| Rule System Application | Game Mechanics, Legal & Regulatory, Mathematical Formalism, Programming Syntax, Technical Standards |
Changelog
- 0.1.0: Environment created.