0

Nemotron RL Instruction Following Adversarial V1

Fresh

Nemotron-RL-Instruction-Following-Adversarial-v1 focuses on adversarial prompts designed to explicitly conflict with an AI model’s standard training instincts - such as writing code without comments or refusing standard helpfulness norms - across 8 distinct "anti-convention" p…

Type
RL Env
Publisher
NVIDIA
Runtime
ORS
License
unknown
Size
1000 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

Nemotron-RL-Instruction-Following-Adversarial-v1

OpenReward Environment Hugging Face Dataset

Description

Nemotron-RL-Instruction-Following-Adversarial-v1 is an environment for evaluating agents on adversarial instruction-following tasks. It is based on the Inverse IFEval benchmark from NVIDIA, consisting of 1,000 tasks designed to test whether language models can overcome ingrained training patterns to follow unconventional instructions. Tasks include counter-conventional formatting, mid-turn instruction modification, deliberately incorrect answers, counterfactual answering, and question correction. Each task is graded against 3-10 rubric criteria using an LLM judge with strict binary PASS/FAIL evaluation.

Capabilities

  • Following counter-conventional formatting requirements (custom delimiters, reversed spelling, vowel replacement)
  • Handling mid-turn instruction modifications and conflicting constraints
  • Producing deliberately incorrect answers when explicitly requested
  • Answering based on counterfactual premises without correction
  • Identifying and rejecting flawed questions

Compute Requirements

This environment does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There is one split: train (1,000 tasks). Each task presents an adversarial instruction-following prompt. The agent must produce a response that is graded against multiple rubric criteria (3-10 per task, average 3.7). Tasks span several adversarial categories:

  • Counter-Conventional Formatting
  • Mid-turn Instruction Modification
  • Deliberately Incorrect Answers
  • Counterfactual Answering
  • Question Correction

Reward Structure

This is a sparse reward environment with continuous scoring. The agent calls the answer tool once with its response, and the environment grades it using an LLM judge (gpt-5-mini). Each rubric criterion is graded independently as binary PASS or FAIL. The overall score is the fraction of criteria passed:

$$\text{Reward} = \frac{\text{number of PASS criteria}}{\text{total criteria}}$$

Scores range from 0.0 to 1.0.

Data

Tasks are sourced from the Nemotron-RL-Instruction-Following-Adversarial-v1 dataset by NVIDIA, which includes 1,000 adversarial instruction-following prompts with per-task judge prompts and rubric criteria. Data files are stored on the OpenReward platform.

Tools

Agents are given a single tool:

  • answer: Submit a response to the instruction-following task. The response is graded by the LLM judge against the rubric criteria. Returns the overall score and per-criterion results. This tool can only be called once per task.

Time Horizon

This is a single-turn environment. The agent receives an instruction prompt and submits one answer. Each task requires exactly one tool call.

Other Environment Requirements

This environment requires an OpenAI API key (OPENAI_API_KEY secret) for LLM-based grading of answers.

Safety

Agents are asked to follow unusual instruction patterns, some of which involve producing deliberately incorrect information. The environment does not present direct safety risks, as agents only provide text answers with no access to external systems, tools, or the internet. The adversarial tasks test instruction-following compliance, not harmful content generation.

Citations

@dataset{nvidia2024nemotron_adversarial,
  title={Nemotron-RL-Instruction-Following-Adversarial-v1},
  author={NVIDIA},
  year={2024},
  publisher={Hugging Face},
  license={CC-BY-4.0},
  url={https://huggingface.co/datasets/nvidia/Nemotron-RL-Instruction-Following-Adversarial-v1}
}