Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1
Description
An environment for evaluating agents on agentic tool-use decision-making in customer service scenarios. Based on the Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 dataset from NVIDIA. Each task presents a multi-turn conversation with available tools at a specific decision point: the agent must predict the correct next action -- either calling a specific function with the right arguments, or responding with a message.
Capabilities
- Deciding when to call a tool vs. respond with a message
- Selecting the correct function from a set of available tools
- Generating correct function arguments as JSON
- Multi-turn conversation comprehension
- Following domain-specific policies and constraints
- Reasoning about tool capabilities relative to user requests
Compute Requirements
This environment does not require a sandbox. It has minimal compute requirements.
License
Tasks
There is one split with 96,968 tasks:
- train (96,968 tasks): Pre-extracted decision points from multi-turn customer service conversations. 67.6% function call / 32.4% message.
Domains include senior care services, sports merchandise ordering, test prep platforms, organic farming supplies, renewable energy installations, and more.
Reward Structure
This is a sparse reward environment with continuous scoring. The agent makes a single submission per task:
- Function call tasks: Reward = 0.5 * (name match) + 0.5 * (argument match). Name match is binary (0 or 1). Argument match is the fraction of key-value pairs that match between expected and submitted arguments, using exact/substring string comparison. Note: ~20% of function call tasks are transfer/escalate functions with a single free-text
summaryargument. String matching is effectively uninformative for these, so the reward acts as a binary signal on function name correctness (0.0 or 0.5). - Message tasks: Reward is computed via LLM grading (gpt-5-mini) if an OpenAI API key is provided, or via keyword overlap fallback otherwise. Scores range from 0.0 to 1.0.
- Wrong action type: Calling a function when a message was expected (or vice versa) yields reward 0.0.
Data
Decision points are sourced from the Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1 dataset by NVIDIA. Each row is a pre-pivoted decision point containing the conversation context (all messages before the decision point), available tools, and the expected action (function call or message). The download_data.py script fetches the JSONL from HuggingFace and converts it to parquet format locally.
Tools
This environment uses task-specific tools. Each task dynamically exposes the actual tools from the dataset (e.g., authenticate_client, check_coverage, schedule_service) via list_task_tools(). The agent interacts with these tools through native function calling.
In addition, there is one shared tool:
submit_message: Submit a text message response. Use when no function call is appropriate and the agent should respond directly to the user.
Time Horizon
This is a single-turn environment. The agent receives a conversation context and submits one action. Each task requires exactly one tool call.
Other Environment Requirements
This environment optionally accepts an OpenAI API key (openai_api_key secret) for LLM-based grading of message responses. Without it, a simple keyword-overlap fallback grader is used for message tasks. Function call tasks do not require an API key.
Safety
Agents are asked to predict the next action in a synthetic conversation. The environment does not present direct safety risks, as agents only submit predictions with no access to external systems, real tools, or the internet.
Citations
@dataset{nvidia_nemotron_rl_agentic_pivot_v1,
author = {NVIDIA Corporation},
title = {Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/nvidia/Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1},
license = {CC-BY-4.0}
}