0

Medagentbenchv 2 RL Env (Medarc)

Fresh

MedAgentBench V2 environment for tool-calling evaluation.

Type
RL Env
Publisher
Medarc
Runtime
multi-turn
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MedAgentBench V2

Overview

  • Environment ID: medagentbenchv2
  • Short description: Tool-calling evaluation environment for clinical EHR tasks using FHIR APIs
  • Tags: medical, ehr, tool-calling, multi-turn, clinical, evaluation

Dataset

  • Primary dataset: MedAgentBench V2 evaluation tasks
  • Source: MedAgentBench GitHub, Paper
  • Task variants:
    • new_patient_tasks: Patient-centric clinical scenarios (default; matches upstream collect_agent_responses.py)
    • new_tasks: General clinical tasks
    • test_data_v2: Extended evaluation dataset
  • Task categories: 10 task types covering various EHR operations (task1-task10)

Task

  • Type: multi-turn tool-calling
  • Rubric overview: Binary scoring (0 or 1) based on successful task completion using category-specific evaluation functions

Note

This environment is a light verifiers wrapper around the original MedAgentBenchV2 code. The primary difference is the original code uses OpenAI's Responses API while this environment uses Chat Completions API through verifiers. The MedAgentBenchV2 prompts were lightly modified for generic tool calling versus the original's Responses API format examples.

Prerequisites

Before running evaluations, you must start the FHIR server:

docker pull jyxsu6/medagentbench:latest
docker tag jyxsu6/medagentbench:latest medagentbench
docker run --platform linux/amd64 -e JAVA_TOOL_OPTIONS='-XX:+UseSerialGC -Xms256m -Xmx1024m' -p 8080:8080 medagentbench:latest

Important:

  • The trailing slash in the FHIR URL is required
  • Replace localhost with your actual IP address if running on a remote server
  • Server connectivity is automatically verified before evaluation begins

Quickstart

Run an evaluation with default settings (requires FHIR server):

prime eval run medagentbenchv2 -m "openai/gpt-5-mini" -n 5 -s -a '{"fhir_api_base": "http://localhost:8080/fhir/"}'

Run a small evaluation on specific task types:

medarc-eval medagentbenchv2 -m "openai/gpt-5-mini" -n 5 -s --fhir-api-base http://localhost:8080/fhir/ --task-types task1 --task-types task2

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).
  • The FHIR server must be accessible at the specified URL
  • Models should support tool calling for this environment

Environment Arguments

ArgTypeDefaultDescription
fhir_api_basestrRequiredBase URL for the FHIR server (must include trailing slash)
tasks_variantstr"new_patient_tasks"Task variant to use: "new_patient_tasks", "new_tasks", or "test_data_v2"
tasks_pathstrNoneOptional custom path to tasks JSON file (overrides tasks_variant)
task_typeslist[str]NoneOptional list of task types to filter (e.g., ["task1", "task2"])
max_turnsint8Maximum number of interaction turns per task

Available Tools

The environment provides FHIR-based tools for clinical operations:

  • fhir_patient_search: Search for patients in the EHR
  • fhir_observation_search: Query clinical observations
  • fhir_vitals_search: Search vital signs
  • fhir_vitals_create: Create new vital sign records
  • fhir_medication_request_search: Search medication requests
  • fhir_medication_request_create: Create medication requests
  • fhir_procedure_search: Query procedures
  • fhir_condition_search: Search patient conditions
  • fhir_service_request_create: Create service requests
  • finish: Submit the final answer (required to complete tasks)

Metrics

MetricMeaning
medagentbench_reward(weight 1.0): Binary score (1 if task correctly solved, 0 otherwise)

Task Categories

The environment evaluates 10 distinct task categories, each with specialized evaluation logic:

  • task1-task10: Various EHR operations including patient search, data retrieval, record creation, and clinical decision support