0

Nemotron RLHF GenRM V1

Fresh

This dataset is designed to train Generative Reward Models (GenRMs). It leverages reinforcement learning at scale to train accurate and robust GenRMs that generalize better than traditional Bradley-Terry models and reduce the risk of reward hacking.

Type
RL Env
Publisher
NVIDIA
Runtime
ORS
License
unknown
Size
299517 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

Nemotron-RLHF-GenRM-v1

OpenReward Environment Hugging Face Dataset

Description

Nemotron-RLHF-GenRM-v1 is an environment for training Generative Reward Models (GenRMs) that perform pairwise comparison of LLM responses. Given a conversation context and two assistant responses, the agent must evaluate both responses and produce individual helpfulness scores and a comparative ranking.

This environment implements the GenRM training task from NVIDIA's Nemotron 3 Super training recipe. The dataset is sourced from nvidia/Nemotron-RLHF-GenRM-v1, which is based on allenai/WildChat-1M.

Capabilities

  • Pairwise comparison of LLM responses across diverse domains
  • Helpfulness scoring on a 1-5 scale
  • Comparative ranking on a 1-6 scale
  • Safety and refusal evaluation
  • Single-turn evaluation (one submission per task)

License

CC-BY-4.0 (same as the underlying dataset).

Tasks

There is one split:

  • train: 299,517 pairwise comparison tasks

Each task presents a conversation context with two candidate assistant responses. The agent must reason through the strengths and weaknesses of both responses, then produce:

  • score_1: helpfulness score for Response 1 (1-5)
  • score_2: helpfulness score for Response 2 (1-5)
  • ranking: comparative preference (1-6, where 1 = Response 1 far superior, 6 = Response 2 far superior)

Some tasks only have ground-truth ranking (no individual helpfulness scores), typically for clear-cut safety refusal scenarios.

Reward Structure

Rewards are computed using the formula from the Nemotron 3 Nano paper:

R = -C1 * I_format - |P_h1 - G_h1| - |P_h2 - G_h2| - C2 * |P_r - G_r|

Where:

  • C1 = 10: format violation penalty (binary: output must be valid JSON)
  • C2 = 1: ranking deviation weight
  • I_format: 1 if output doesn't parse to valid JSON, 0 otherwise
  • P / G: predicted / ground-truth scores

Normalized to [0, 1]. When ground-truth helpfulness scores are absent, only format and ranking terms apply.

No LLM graders are used; all rewards are rule-based.

Tools

ToolDescription
answerSubmit evaluation as JSON with score_1, score_2, and ranking

Other Environment Requirements

No external API keys are required. This environment uses purely rule-based grading.

Data

Data sourced from nvidia/Nemotron-RLHF-GenRM-v1 on Hugging Face. Run python download_data.py to download and convert to local parquet format.

Citations

@article{nvidia2025nemotron3nano,
  title={Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
  author={NVIDIA},
  journal={arXiv preprint arXiv:2512.20848},
  year={2025}
}