0

Clbench RL Env (Community)

Fresh

CL-bench environment using RLM with strict rubric-based LLM judge scoring

Type
RL Env
License
apache-2.0
Published
Mar 2026

Cite

Notes

Only stored in your browser.

CL-bench

Overview

  • Environment ID: clbench
  • Short description: Minimal CL-bench single-turn environment with strict rubric-based LLM-as-judge scoring.
  • Tags: in-context-learning, long-context, eval

Dataset

  • Primary dataset: tencent/CL-bench
  • Source links: HuggingFace, GitHub
  • Notes: The license on the dataset only allows the usage for evaluation, not training.

Quickstart

Set PRIME_API_KEY; for team accounts also set PRIME_TEAM_ID or run prime login (team ID is read from ~/.prime/config.json).

# Uses Prime Intellect by default (set PRIME_API_KEY or use prime login)
uv run vf-eval clbench -m openai/gpt-5.2 -s -n 100 -r 1

# Filter by category (use valid pairs from table below)
uv run vf-eval clbench -m openai/gpt-5.2 -a '{"context_category": "Rule System Application", "sub_category": "Game Mechanics"}'

# Filter by multiple sub-categories within same context
uv run vf-eval clbench -m openai/gpt-5.2 -a '{"context_category": "Rule System Application", "sub_category": ["Game Mechanics", "Legal & Regulatory"]}'

Environment Arguments

ArgTypeDefaultDescription
judge_modelstr"openai/gpt-5.2"Judge model
judge_base_urlstr or null"https://api.pinference.ai/api/v1"OpenAI-compatible base URL (Prime Intellect)
judge_api_key_varstr or nullNoneEnv var for judge API key; when null, uses PRIME_API_KEY
context_categorystr or list[str] or nullNoneFilter examples by metadata context_category; pass a string or list of strings to match
sub_categorystr or list[str] or nullNoneFilter examples by metadata sub_category; pass a string or list of strings to match

Valid categories

Only certain context_category / sub_category pairs exist in the dataset. An error is raised if you specify invalid names or a non-existent combination.

context_category (4 values):

  • Domain Knowledge Reasoning
  • Empirical Discovery & Simulation
  • Procedural Task Execution
  • Rule System Application

sub_category (per context_category):

context_categorysub_category
Domain Knowledge ReasoningFinance, Healthcare, Humanities, Legal Advisory, Lifestyle, Management, Science
Empirical Discovery & SimulationExperimental Data, Observational Data, Simulation Environment
Procedural Task ExecutionInstructional Procedures, Operational Procedures, Workflow Orchestration
Rule System ApplicationGame Mechanics, Legal & Regulatory, Mathematical Formalism, Programming Syntax, Technical Standards

Changelog

  • 0.1.0: Environment created.