0

Nemotron RL Instruction Following MultiTurnChat V1

Fresh

The MultiChallenge Dataset is a rigorous benchmark designed to improve large language models in complex multi-turn conversations by explicitly targeting inference memory, instruction retention, version editing, and self-coherence. It employs a unique "model breaking" methodolo…

Type
RL Env
Publisher
NVIDIA
Runtime
ORS
License
unknown
Size
2011 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

Nemotron IF-Eval

OpenReward Environment Hugging Face Dataset

Description

Nemotron IF-Eval evaluates instruction-following in multi-turn conversations. Based on the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset from NVIDIA, each task presents a multi-turn conversation with a detailed system prompt containing specific constraints, followed by several user/assistant exchanges. The agent must generate the next assistant response adhering to all instructions. Responses are graded by an LLM judge against 1-6 rubric criteria per task.

The dataset is part of the MultiChallenge benchmark, which tests whether models can persistently follow instructions across complex multi-turn interactions.

Capabilities

  • Following complex, multi-constraint instructions across conversation turns
  • Maintaining behavioral consistency over extended multi-turn dialogues
  • Adhering to formatting, tone, length, and content requirements
  • Instruction retention under conversational pressure

Compute Requirements

Nemotron IF-Eval does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There are 2,011 tasks in a single train split. Each task presents a multi-turn conversation prompt containing system, user, and assistant messages (9-32 messages per task, mean 15.2). The agent must provide the next assistant response, which is graded against 1-6 evaluation rubric items per task (mean 2.6). Rubric items test specific instruction-following criteria such as tone, formatting constraints, content requirements, and behavioral consistency.

Reward Structure

This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response. Each rubric item is evaluated independently by an LLM grader (gpt-5-mini) that answers a yes/no question and compares the result to the expected pass criteria. The overall score is the fraction of rubric items passed:

$$\text{Reward} = \frac{\text{rubric items passed}}{\text{total rubric items}}$$

Scores range from 0.0 to 1.0.

Data

Conversations are sourced from the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset by NVIDIA. Data files are stored on the OpenReward platform.

Tools

ToolDescription
answerSubmit a response to continue the conversation. The response is graded by the LLM grader against the rubric criteria. Returns the overall score. Called once per task.

Time Horizon

Nemotron IF-Eval is a single-turn environment. The agent receives a conversation context and submits one response. Each task requires exactly one tool call.

Other Environment Requirements

Nemotron IF-Eval requires an OpenAI API key (openai_api_key secret) for LLM-based grading of responses.

Safety

Agents in Nemotron IF-Eval are asked to respond to multi-turn conversations. The environment does not present direct safety risks, as agents only provide text responses with no access to external systems, tools, or the internet.

Citations

@article{sirdeshmukh2025multichallenge,
  title={MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs},
  author={Sirdeshmukh, Ved and Deshpande, Kaustubh and Mols, Johannes and Jin, Lifeng and Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow and Yue, Summer and Xing, Chen},
  journal={arXiv preprint arXiv:2501.17399},
  year={2025}
}