Nemotron-Cascade-2-RL-data

Description

Nemotron-Cascade-2-RL-data is a curated reinforcement learning dataset blend developed by NVIDIA for training the Nemotron-Cascade-2-30B-A3B model. This environment implements 3 variants covering instruction following, multi-domain tasks, and on-policy distillation. All variants use LLM-based grading (gpt-5-mini).

The SWE-RL subset (3,612 software engineering tasks) from the original dataset is excluded from this environment, as those tasks are already available on OpenReward via the dedicated SWE-Gym and R2E-Gym environments.

Capabilities

Following complex instruction-following constraints (sentence counts, keyword placement, formatting rules)
Answering multiple-choice knowledge questions across STEM domains
Executing workplace function calls (email, calendar, analytics, project management)
Generating structured JSON outputs conforming to schemas

Compute Requirements

No sandbox required. Requires OpenAI API key for LLM grading.

License

Open Data Commons Attribution License (ODC-By) v1.0.

Tasks

This environment uses 3 variants (one per dataset subset), each with a train split:

Variant	Tasks	Description
`nemotronifrl`	45,879	Instruction following with verifiable formatting constraints
`nemotronmultidomainrl`	17,592	MCQA, workplace function calling, structured outputs
`nemotronmopd`	6,111	Multi-domain on-policy distillation (mixed: IF, MCQA, function calling, schema)

Total: 69,582 tasks.

615 rows from the original dataset (555 from multi-domain-RL, 60 from MOPD) are excluded because they lack a verifiable grading signal (no expected answer, ground truth, constraints, or schema).

Reward Structure

All variants use binary reward (1.0 correct, 0.0 incorrect):

IF-RL: LLM (gpt-5-mini) checks all instruction constraints are satisfied
Multi-domain-RL: MCQA uses LLM answer matching; function calling uses LLM comparison to ground truth; structured output uses programmatic JSON schema validation
MOPD: Mixed grading depending on task type (same strategies as above)

Data

Data is sourced from nvidia/Nemotron-Cascade-2-RL-data on HuggingFace. This environment uses 3 of the 4 subsets:

IF-RL: Instruction-following tasks derived from nvidia/Nemotron-RL-instruction_following
multi-domain-RL: Knowledge MCQA, workplace assistant, structured outputs
MOPD: Blend from AceReason-Math, instruction following, STEM MCQA, and workplace tasks

Data is stored as parquet files on the OpenReward platform.

Tools

Tool	Description
`answer`	Submit your response. Grading depends on task type (instruction following, MCQA, function calling, schema validation).

Time Horizon

Single-turn. One tool call (answer).

Environment Difficulty

The dataset spans a wide range of difficulty:

IF-RL tasks range from simple formatting (word count) to complex multi-constraint satisfaction
MCQA covers STEM knowledge questions across multiple domains
Function calling tasks require correct tool selection and parameter formatting

Other Environment Requirements

OpenAI API key: Required for LLM grading. Pass via secrets={"openai_api_key": "..."}.

Safety

This environment evaluates instruction following, knowledge recall, and function calling. It does not present direct safety risks.

Citations

@article{Nemotron_Cascade_2,
  title={Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation},
  author={Yang, Zhuolin and Liu, Zihan and Chen, Yang and Dai, Wenliang and Wang, Boxin and Lin, Sheng-Chieh and Lee, Chankyu and Chen, Yangyi and Jiang, Dongfu and He, Jiafan and Pi, Renjie and Lam, Grace and Lee, Nayeon and Bukharin, Alexander and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  year={2026},
  journal={arXiv preprint arXiv:2603.19220}
}