Nemotron-Cascade-2-RL-data
Description
Nemotron-Cascade-2-RL-data is a curated reinforcement learning dataset blend developed by NVIDIA for training the Nemotron-Cascade-2-30B-A3B model. This environment implements 3 variants covering instruction following, multi-domain tasks, and on-policy distillation. All variants use LLM-based grading (gpt-5-mini).
The SWE-RL subset (3,612 software engineering tasks) from the original dataset is excluded from this environment, as those tasks are already available on OpenReward via the dedicated SWE-Gym and R2E-Gym environments.
Capabilities
- Following complex instruction-following constraints (sentence counts, keyword placement, formatting rules)
- Answering multiple-choice knowledge questions across STEM domains
- Executing workplace function calls (email, calendar, analytics, project management)
- Generating structured JSON outputs conforming to schemas
Compute Requirements
No sandbox required. Requires OpenAI API key for LLM grading.
License
Open Data Commons Attribution License (ODC-By) v1.0.
Tasks
This environment uses 3 variants (one per dataset subset), each with a train split:
| Variant | Tasks | Description |
|---|---|---|
nemotronifrl | 45,879 | Instruction following with verifiable formatting constraints |
nemotronmultidomainrl | 17,592 | MCQA, workplace function calling, structured outputs |
nemotronmopd | 6,111 | Multi-domain on-policy distillation (mixed: IF, MCQA, function calling, schema) |
Total: 69,582 tasks.
615 rows from the original dataset (555 from multi-domain-RL, 60 from MOPD) are excluded because they lack a verifiable grading signal (no expected answer, ground truth, constraints, or schema).
Reward Structure
All variants use binary reward (1.0 correct, 0.0 incorrect):
- IF-RL: LLM (gpt-5-mini) checks all instruction constraints are satisfied
- Multi-domain-RL: MCQA uses LLM answer matching; function calling uses LLM comparison to ground truth; structured output uses programmatic JSON schema validation
- MOPD: Mixed grading depending on task type (same strategies as above)
Data
Data is sourced from nvidia/Nemotron-Cascade-2-RL-data on HuggingFace. This environment uses 3 of the 4 subsets:
- IF-RL: Instruction-following tasks derived from nvidia/Nemotron-RL-instruction_following
- multi-domain-RL: Knowledge MCQA, workplace assistant, structured outputs
- MOPD: Blend from AceReason-Math, instruction following, STEM MCQA, and workplace tasks
Data is stored as parquet files on the OpenReward platform.
Tools
| Tool | Description |
|---|---|
answer | Submit your response. Grading depends on task type (instruction following, MCQA, function calling, schema validation). |
Time Horizon
Single-turn. One tool call (answer).
Environment Difficulty
The dataset spans a wide range of difficulty:
- IF-RL tasks range from simple formatting (word count) to complex multi-constraint satisfaction
- MCQA covers STEM knowledge questions across multiple domains
- Function calling tasks require correct tool selection and parameter formatting
Other Environment Requirements
- OpenAI API key: Required for LLM grading. Pass via
secrets={"openai_api_key": "..."}.
Safety
This environment evaluates instruction following, knowledge recall, and function calling. It does not present direct safety risks.
Citations
@article{Nemotron_Cascade_2,
title={Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation},
author={Yang, Zhuolin and Liu, Zihan and Chen, Yang and Dai, Wenliang and Wang, Boxin and Lin, Sheng-Chieh and Lee, Chankyu and Chen, Yangyi and Jiang, Dongfu and He, Jiafan and Pi, Renjie and Lam, Grace and Lee, Nayeon and Bukharin, Alexander and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
year={2026},
journal={arXiv preprint arXiv:2603.19220}
}