Nemotron-RLHF-GenRM-v1

Description

Nemotron-RLHF-GenRM-v1 is an environment for training Generative Reward Models (GenRMs) that perform pairwise comparison of LLM responses. Given a conversation context and two assistant responses, the agent must evaluate both responses and produce individual helpfulness scores and a comparative ranking.

This environment implements the GenRM training task from NVIDIA's Nemotron 3 Super training recipe. The dataset is sourced from nvidia/Nemotron-RLHF-GenRM-v1, which is based on allenai/WildChat-1M.

Capabilities

Pairwise comparison of LLM responses across diverse domains
Helpfulness scoring on a 1-5 scale
Comparative ranking on a 1-6 scale
Safety and refusal evaluation
Single-turn evaluation (one submission per task)

License

CC-BY-4.0 (same as the underlying dataset).

Tasks

There is one split:

train: 299,517 pairwise comparison tasks

Each task presents a conversation context with two candidate assistant responses. The agent must reason through the strengths and weaknesses of both responses, then produce:

score_1: helpfulness score for Response 1 (1-5)
score_2: helpfulness score for Response 2 (1-5)
ranking: comparative preference (1-6, where 1 = Response 1 far superior, 6 = Response 2 far superior)

Some tasks only have ground-truth ranking (no individual helpfulness scores), typically for clear-cut safety refusal scenarios.

Reward Structure

Rewards are computed using the formula from the Nemotron 3 Nano paper:

R = -C1 * I_format - |P_h1 - G_h1| - |P_h2 - G_h2| - C2 * |P_r - G_r|

Where:

C1 = 10: format violation penalty (binary: output must be valid JSON)
C2 = 1: ranking deviation weight
I_format: 1 if output doesn't parse to valid JSON, 0 otherwise
P / G: predicted / ground-truth scores

Normalized to [0, 1]. When ground-truth helpfulness scores are absent, only format and ranking terms apply.

No LLM graders are used; all rewards are rule-based.

Tools

Tool	Description
`answer`	Submit evaluation as JSON with `score_1`, `score_2`, and `ranking`

Other Environment Requirements

No external API keys are required. This environment uses purely rule-based grading.

Data

Data sourced from nvidia/Nemotron-RLHF-GenRM-v1 on Hugging Face. Run python download_data.py to download and convert to local parquet format.

Citations

@article{nvidia2025nemotron3nano,
  title={Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning},
  author={NVIDIA},
  journal={arXiv preprint arXiv:2512.20848},
  year={2025}
}