MTSamples Replicate Benchmark
Overview
- Environment ID:
mtsamples_replicate - Short description: Given patient notes with the PLAN section removed, generate a reasonable treatment plan. Evaluation adapted from HELM's MTSamples Replicate scenario.
- Tags: medical, clinical, single-turn, summarization, llm-judge, eval
Datasets
- Primary dataset: MTSamples medical transcription (processed)
- Source links: GitHub, HELM scenario
- Split sizes: Evaluation only (no predefined train/test splits)
Task
- Type: Single-Turn
- Rubric overview: MultiJudgeRubric (LLM-as-a-Judge evaluation adapted from HELM's MTSamples Replicate annotator)
- Task description: Given patient notes with the PLAN section removed (while preserving SUMMARY and FINDINGS), generate a reasonable treatment plan.
- Prompt: "Here are information about a patient, return a reasonable treatment plan for the patient."
- Evaluation dimensions:
- Accuracy (1–5): Does the response provide clinically appropriate and correct treatment guidance?
- Completeness (1–5): Does the response cover the key aspects of care implied by the note?
- Clarity (1–5): Is the response clearly written and well structured for clinical readability?
Quickstart
Run an evaluation with default settings:
prime eval run mtsamples_replicate -m "openai/gpt-5-mini" -n 5 -s
Run a single-judge evaluation:
medarc-eval mtsamples_replicate -m "openai/gpt-5-mini" -n 5 --judge-model "openai/gpt-5-mini"
This environment supports multi-judge via the medarc-verifiers MultiJudge rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. judge_model, judge_base_url, and judge_api_key are all list-valued and correspond positionally; if only one judge_base_url or judge_api_key is provided, it is used for all judge models.
Use configured default judges:
medarc-eval mtsamples_replicate -m "openai/gpt-5-mini" --judge-model "openai/gpt-5-mini" --judge-model "x-ai/grok-4.1-fast"
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
cache_dir | str | Path | None | ~/.cache/medarc/mtsamples_replicate | Local directory to cache downloaded datasets. |
use_think | bool | False | Whether to use chain-of-thought prompting with <think>...</think> |
judge_model | str | list[str] | "gpt-4o-mini" | Model(s) used by the LLM judge |
judge_base_url | str | list[str] | None | None | Custom API base URL(s) for judge model |
judge_api_key | str | list[str] | None | None | API key(s) for judge model |
Data Processing (HELM-aligned)
Following HELM's MTSamples Replicate approach:
- Section Extraction: Extracts the first line following
PLAN:,SUMMARY:, orFINDINGS:(priority order:PLAN > SUMMARY > FINDINGS) - Input Cleaning: Removes only the
PLAN:section from the input text; all other sections (e.g., SUMMARY, FINDINGS, IMPRESSION) are preserved as clinical context. - Reference: The extracted section content is used as the gold reference answer.
Dataset Example
1-year-old Exam – H&P
Input (PLAN removed):
Medical Specialty: Pediatrics - Neonatal
Sample Name: 1-year-old Exam - H&P
Description: Health maintenance exam for 1-year-old female.
...
IMPRESSION:
Routine well child care. Acute conjunctivitis.
Reference Answer (PLAN):
Diagnostic & Lab Orders: Ordered blood lead.
Notes
- The
questionfield contains the full patient note with the PLAN section removed - The
answerfield contains the first line after the selected section header - SUMMARY and FINDINGS may remain in the input, consistent with HELM's Replicate benchmark
- Scores are normalized to
[0, 1]by averaging normalized dimension scores
References
@misc{helm2023,
title={Holistic Evaluation of Language Models},
author={Liang, Percy and Bommasani, Rishi and Lee, Tony and others},
year={2023},
url={https://github.com/stanford-crfm/helm}
}