MedAgentBench V2
Overview
- Environment ID:
medagentbenchv2 - Short description: Tool-calling evaluation environment for clinical EHR tasks using FHIR APIs
- Tags: medical, ehr, tool-calling, multi-turn, clinical, evaluation
Dataset
- Primary dataset: MedAgentBench V2 evaluation tasks
- Source: MedAgentBench GitHub, Paper
- Task variants:
new_patient_tasks: Patient-centric clinical scenarios (default; matches upstreamcollect_agent_responses.py)new_tasks: General clinical taskstest_data_v2: Extended evaluation dataset
- Task categories: 10 task types covering various EHR operations (task1-task10)
Task
- Type: multi-turn tool-calling
- Rubric overview: Binary scoring (0 or 1) based on successful task completion using category-specific evaluation functions
Note
This environment is a light verifiers wrapper around the original MedAgentBenchV2 code. The primary difference is the original code uses OpenAI's Responses API while this environment uses Chat Completions API through verifiers. The MedAgentBenchV2 prompts were lightly modified for generic tool calling versus the original's Responses API format examples.
Prerequisites
Before running evaluations, you must start the FHIR server:
docker pull jyxsu6/medagentbench:latest
docker tag jyxsu6/medagentbench:latest medagentbench
docker run --platform linux/amd64 -e JAVA_TOOL_OPTIONS='-XX:+UseSerialGC -Xms256m -Xmx1024m' -p 8080:8080 medagentbench:latest
Important:
- The trailing slash in the FHIR URL is required
- Replace
localhostwith your actual IP address if running on a remote server - Server connectivity is automatically verified before evaluation begins
Quickstart
Run an evaluation with default settings (requires FHIR server):
prime eval run medagentbenchv2 -m "openai/gpt-5-mini" -n 5 -s -a '{"fhir_api_base": "http://localhost:8080/fhir/"}'
Run a small evaluation on specific task types:
medarc-eval medagentbenchv2 -m "openai/gpt-5-mini" -n 5 -s --fhir-api-base http://localhost:8080/fhir/ --task-types task1 --task-types task2
Notes:
- Use direct environment flags with
medarc-eval(for example,--split validationor--judge-model gpt-5-mini). - The FHIR server must be accessible at the specified URL
- Models should support tool calling for this environment
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
fhir_api_base | str | Required | Base URL for the FHIR server (must include trailing slash) |
tasks_variant | str | "new_patient_tasks" | Task variant to use: "new_patient_tasks", "new_tasks", or "test_data_v2" |
tasks_path | str | None | Optional custom path to tasks JSON file (overrides tasks_variant) |
task_types | list[str] | None | Optional list of task types to filter (e.g., ["task1", "task2"]) |
max_turns | int | 8 | Maximum number of interaction turns per task |
Available Tools
The environment provides FHIR-based tools for clinical operations:
fhir_patient_search: Search for patients in the EHRfhir_observation_search: Query clinical observationsfhir_vitals_search: Search vital signsfhir_vitals_create: Create new vital sign recordsfhir_medication_request_search: Search medication requestsfhir_medication_request_create: Create medication requestsfhir_procedure_search: Query proceduresfhir_condition_search: Search patient conditionsfhir_service_request_create: Create service requestsfinish: Submit the final answer (required to complete tasks)
Metrics
| Metric | Meaning |
|---|---|
medagentbench_reward | (weight 1.0): Binary score (1 if task correctly solved, 0 otherwise) |
Task Categories
The environment evaluates 10 distinct task categories, each with specialized evaluation logic:
- task1-task10: Various EHR operations including patient search, data retrieval, record creation, and clinical decision support