0

Nemotron RL Agentic Conversational Tool Use Pivot V1

Fresh

This environment is for conversational tool-use and utilises existing expert tool-use trajectories. Each assistant step of the trajectory is posed as a separate behavior cloning problem where the policy model is incentivized to match the tool call choices of the expert model.

Type
RL Env
Publisher
NVIDIA
Runtime
ORS
License
unknown
Size
96968 tasks
Published
Mar 2026

Cite

Notes

Only stored in your browser.

Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1

OpenReward Environment Hugging Face Dataset

Description

An environment for evaluating agents on agentic tool-use decision-making in customer service scenarios. Based on the Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 dataset from NVIDIA. Each task presents a multi-turn conversation with available tools at a specific decision point: the agent must predict the correct next action -- either calling a specific function with the right arguments, or responding with a message.

Capabilities

  • Deciding when to call a tool vs. respond with a message
  • Selecting the correct function from a set of available tools
  • Generating correct function arguments as JSON
  • Multi-turn conversation comprehension
  • Following domain-specific policies and constraints
  • Reasoning about tool capabilities relative to user requests

Compute Requirements

This environment does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There is one split with 96,968 tasks:

  • train (96,968 tasks): Pre-extracted decision points from multi-turn customer service conversations. 67.6% function call / 32.4% message.

Domains include senior care services, sports merchandise ordering, test prep platforms, organic farming supplies, renewable energy installations, and more.

Reward Structure

This is a sparse reward environment with continuous scoring. The agent makes a single submission per task:

  • Function call tasks: Reward = 0.5 * (name match) + 0.5 * (argument match). Name match is binary (0 or 1). Argument match is the fraction of key-value pairs that match between expected and submitted arguments, using exact/substring string comparison. Note: ~20% of function call tasks are transfer/escalate functions with a single free-text summary argument. String matching is effectively uninformative for these, so the reward acts as a binary signal on function name correctness (0.0 or 0.5).
  • Message tasks: Reward is computed via LLM grading (gpt-5-mini) if an OpenAI API key is provided, or via keyword overlap fallback otherwise. Scores range from 0.0 to 1.0.
  • Wrong action type: Calling a function when a message was expected (or vice versa) yields reward 0.0.

Data

Decision points are sourced from the Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 dataset by NVIDIA. Each row is a pre-pivoted decision point containing the conversation context (all messages before the decision point), available tools, and the expected action (function call or message). The download_data.py script fetches the JSONL from HuggingFace and converts it to parquet format locally.

Tools

This environment uses task-specific tools. Each task dynamically exposes the actual tools from the dataset (e.g., authenticate_client, check_coverage, schedule_service) via list_task_tools(). The agent interacts with these tools through native function calling.

In addition, there is one shared tool:

  • submit_message: Submit a text message response. Use when no function call is appropriate and the agent should respond directly to the user.

Time Horizon

This is a single-turn environment. The agent receives a conversation context and submits one action. Each task requires exactly one tool call.

Other Environment Requirements

This environment optionally accepts an OpenAI API key (openai_api_key secret) for LLM-based grading of message responses. Without it, a simple keyword-overlap fallback grader is used for message tasks. Function call tasks do not require an API key.

Safety

Agents are asked to predict the next action in a synthetic conversation. The environment does not present direct safety risks, as agents only submit predictions with no access to external systems, real tools, or the internet.

Citations

@dataset{nvidia_nemotron_rl_agentic_pivot_v1,
  author    = {NVIDIA Corporation},
  title     = {Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/nvidia/Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1},
  license   = {CC-BY-4.0}
}