Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1

Description

An environment for evaluating agents on agentic tool-use decision-making in customer service scenarios. Based on the Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 dataset from NVIDIA. Each task presents a multi-turn conversation with available tools at a specific decision point: the agent must predict the correct next action -- either calling a specific function with the right arguments, or responding with a message.

Capabilities

Deciding when to call a tool vs. respond with a message
Selecting the correct function from a set of available tools
Generating correct function arguments as JSON
Multi-turn conversation comprehension
Following domain-specific policies and constraints
Reasoning about tool capabilities relative to user requests

Compute Requirements

This environment does not require a sandbox. It has minimal compute requirements.

License

CC-BY-4.0.

Tasks

There is one split with 96,968 tasks:

train (96,968 tasks): Pre-extracted decision points from multi-turn customer service conversations. 67.6% function call / 32.4% message.

Domains include senior care services, sports merchandise ordering, test prep platforms, organic farming supplies, renewable energy installations, and more.

Reward Structure

This is a sparse reward environment with continuous scoring. The agent makes a single submission per task:

Function call tasks: Reward = 0.5 * (name match) + 0.5 * (argument match). Name match is binary (0 or 1). Argument match is the fraction of key-value pairs that match between expected and submitted arguments, using exact/substring string comparison. Note: ~20% of function call tasks are transfer/escalate functions with a single free-text summary argument. String matching is effectively uninformative for these, so the reward acts as a binary signal on function name correctness (0.0 or 0.5).
Message tasks: Reward is computed via LLM grading (gpt-5-mini) if an OpenAI API key is provided, or via keyword overlap fallback otherwise. Scores range from 0.0 to 1.0.
Wrong action type: Calling a function when a message was expected (or vice versa) yields reward 0.0.

Data

Decision points are sourced from the Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 dataset by NVIDIA. Each row is a pre-pivoted decision point containing the conversation context (all messages before the decision point), available tools, and the expected action (function call or message). The download_data.py script fetches the JSONL from HuggingFace and converts it to parquet format locally.

Tools

This environment uses task-specific tools. Each task dynamically exposes the actual tools from the dataset (e.g., authenticate_client, check_coverage, schedule_service) via list_task_tools(). The agent interacts with these tools through native function calling.

In addition, there is one shared tool:

submit_message: Submit a text message response. Use when no function call is appropriate and the agent should respond directly to the user.

Time Horizon

This is a single-turn environment. The agent receives a conversation context and submits one action. Each task requires exactly one tool call.

Other Environment Requirements

This environment optionally accepts an OpenAI API key (openai_api_key secret) for LLM-based grading of message responses. Without it, a simple keyword-overlap fallback grader is used for message tasks. Function call tasks do not require an API key.

Safety

Agents are asked to predict the next action in a synthetic conversation. The environment does not present direct safety risks, as agents only submit predictions with no access to external systems, real tools, or the internet.

Citations

@dataset{nvidia_nemotron_rl_agentic_pivot_v1,
  author    = {NVIDIA Corporation},
  title     = {Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/nvidia/Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1},
  license   = {CC-BY-4.0}
}