Nemotron IF-Eval

Description

Nemotron IF-Eval evaluates instruction-following in multi-turn conversations. Based on the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset from NVIDIA, each task presents a multi-turn conversation with a detailed system prompt containing specific constraints, followed by several user/assistant exchanges. The agent must generate the next assistant response adhering to all instructions. Responses are graded by an LLM judge against 1-6 rubric criteria per task.

The dataset is part of the MultiChallenge benchmark, which tests whether models can persistently follow instructions across complex multi-turn interactions.

Capabilities

Following complex, multi-constraint instructions across conversation turns
Maintaining behavioral consistency over extended multi-turn dialogues
Adhering to formatting, tone, length, and content requirements
Instruction retention under conversational pressure

Compute Requirements

Nemotron IF-Eval does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There are 2,011 tasks in a single train split. Each task presents a multi-turn conversation prompt containing system, user, and assistant messages (9-32 messages per task, mean 15.2). The agent must provide the next assistant response, which is graded against 1-6 evaluation rubric items per task (mean 2.6). Rubric items test specific instruction-following criteria such as tone, formatting constraints, content requirements, and behavioral consistency.

Reward Structure

This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response. Each rubric item is evaluated independently by an LLM grader (gpt-5-mini) that answers a yes/no question and compares the result to the expected pass criteria. The overall score is the fraction of rubric items passed:

$$\text{Reward} = \frac{\text{rubric items passed}}{\text{total rubric items}}$$

Scores range from 0.0 to 1.0.

Data

Conversations are sourced from the Nemotron-RL-Instruction-Following-MultiTurnChat-v1 dataset by NVIDIA. Data files are stored on the OpenReward platform.

Tools

Tool	Description
`answer`	Submit a response to continue the conversation. The response is graded by the LLM grader against the rubric criteria. Returns the overall score. Called once per task.

Time Horizon

Nemotron IF-Eval is a single-turn environment. The agent receives a conversation context and submits one response. Each task requires exactly one tool call.

Other Environment Requirements

Nemotron IF-Eval requires an OpenAI API key (openai_api_key secret) for LLM-based grading of responses.

Safety

Agents in Nemotron IF-Eval are asked to respond to multi-turn conversations. The environment does not present direct safety risks, as agents only provide text responses with no access to external systems, tools, or the internet.

Citations

@article{sirdeshmukh2025multichallenge,
  title={MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs},
  author={Sirdeshmukh, Ved and Deshpande, Kaustubh and Mols, Johannes and Jin, Lifeng and Cardona, Ed-Yeremai and Lee, Dean and Kritz, Jeremy and Primack, Willow and Yue, Summer and Xing, Chen},
  journal={arXiv preprint arXiv:2501.17399},
  year={2025}
}