tau2_infinity_wg

Prime Intellect RL environment — airline customer-service gym backed by world-gen failure artifacts.

An agent plays the airline-support agent role against an LLM-simulated customer across a stateful tool environment. Each episode runs against a failure-*.json artifact that pins the initial world state and a set of objectives the agent must satisfy.

Rewards

Four components (ToolRL decomposition):

outcome — LLM-as-judge (ORM) scores each objective on a continuous 5-bucket rubric; final score is mean across objectives. See verifier.py for the judge determinism treatment.
format — strict tool-call schema compliance, binary.
efficiency — tool-call count vs. optimal, decaying.
compliance — penalties for parallel calls and unauthorized mutations.

See rewards.py for weights and the "Things explicitly deferred" section.

Artifacts

The artifacts/ directory holds world-gen outputs — each file is a task the agent must solve. Artifacts are loaded verbatim by dataset.py; the env never regenerates them at runtime.

Models

Default judge + customer-simulation model: bedrock/us.anthropic.claude-haiku-4-5-20251001-v1:0. Override via load_environment(judge_model=..., customer_model=...).

Required secrets (or env vars) at runtime: AWS_BEARER_TOKEN_BEDROCK, AWS_REGION.