Nemotron-RL-Instruction-Following-Adversarial-v1
Description
Nemotron-RL-Instruction-Following-Adversarial-v1 is an environment for evaluating agents on adversarial instruction-following tasks. It is based on the Inverse IFEval benchmark from NVIDIA, consisting of 1,000 tasks designed to test whether language models can overcome ingrained training patterns to follow unconventional instructions. Tasks include counter-conventional formatting, mid-turn instruction modification, deliberately incorrect answers, counterfactual answering, and question correction. Each task is graded against 3-10 rubric criteria using an LLM judge with strict binary PASS/FAIL evaluation.
Capabilities
- Following counter-conventional formatting requirements (custom delimiters, reversed spelling, vowel replacement)
- Handling mid-turn instruction modifications and conflicting constraints
- Producing deliberately incorrect answers when explicitly requested
- Answering based on counterfactual premises without correction
- Identifying and rejecting flawed questions
Compute Requirements
This environment does not require a sandbox. It has minimal compute requirements.
License
Tasks
There is one split: train (1,000 tasks). Each task presents an adversarial instruction-following prompt. The agent must produce a response that is graded against multiple rubric criteria (3-10 per task, average 3.7). Tasks span several adversarial categories:
- Counter-Conventional Formatting
- Mid-turn Instruction Modification
- Deliberately Incorrect Answers
- Counterfactual Answering
- Question Correction
Reward Structure
This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response, and the environment grades it using an LLM judge (gpt-5-mini). Each rubric criterion is graded independently as binary PASS or FAIL. The overall score is the fraction of criteria passed:
$$\text{Reward} = \frac{\text{number of PASS criteria}}{\text{total criteria}}$$
Scores range from 0.0 to 1.0.
Data
Tasks are sourced from the Nemotron-RL-Instruction-Following-Adversarial-v1 dataset by NVIDIA, which includes 1,000 adversarial instruction-following prompts with per-task judge prompts and rubric criteria. Data files are stored on the OpenReward platform.
Tools
Agents are given a single tool:
answer: Submit a response to the instruction-following task. The response is graded by the LLM judge against the rubric criteria. Returns the overall score and per-criterion results. This tool can only be called once per task.
Time Horizon
This is a single-turn environment. The agent receives an instruction prompt and submits one answer. Each task requires exactly one tool call.
Other Environment Requirements
This environment requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.
Safety
Agents are asked to follow unusual instruction patterns, some of which involve producing deliberately incorrect information. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet. The adversarial tasks test instruction-following compliance, not harmful content generation.
Citations
@dataset{nvidia2024nemotron_adversarial,
title={Nemotron-RL-Instruction-Following-Adversarial-v1},
author={NVIDIA},
year={2024},
publisher={Hugging Face},
license={CC-BY-4.0},
url={https://huggingface.co/datasets/nvidia/Nemotron-RL-Instruction-Following-Adversarial-v1}
}