MTSamples Procedures
Overview
- Environment ID:
mtsamples_procedures - Short description: MTSamples Procedures is a benchmark composed of transcribed operative notes, focused on documenting surgical procedures. Each example presents a brief patient case involving a surgical intervention, and the model is tasked with generating a coherent and clinically accurate procedural summary or treatment plan.
Dataset
- Split sizes:
- Evaluation: ~90 examples (all used for evaluation)
- Note: This is an evaluation-only benchmark with no predefined train/test splits
- Source:
- MTSamples medical transcription repository
- Implementation based on HELM's MTSamples Procedures scenario
Task
- Type: Single-Turn
- Rubric overview: MultiJudgeRubric (LLM-as-a-Judge evaluation adapted from HELM's MTSamples Procedures Annotator)
- Task description: Given patient notes (procedure note with PLAN/SUMMARY/FINDINGS sections removed), generate a reasonable treatment plan
- Prompt: "Here are information about a patient, return a reasonable treatment plan for the patient."
- Evaluation dimensions:
- Accuracy (1-5): Does the response provide correct clinical advice that follows established clinical guidelines?
- Completeness (1-5): Does the response include all important aspects of patient care mentioned in the reference?
- Clarity (1-5): Is the response written clearly and organized in a way that is easy to read for clinicians?
Quickstart
Run an evaluation with default settings:
prime eval run mtsamples_procedures -m "openai/gpt-5-mini" -n 5 -s
Run a single-judge evaluation:
medarc-eval mtsamples_procedures -m "openai/gpt-5-mini" -n 5 --judge-model "openai/gpt-5-mini"
This environment supports multi-judge via the medarc-verifiers MultiJudge rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. judge_model, judge_base_url, and judge_api_key are all list-valued and correspond positionally; if only one judge_base_url or judge_api_key is provided, it is used for all judge models.
Use configured default judges:
medarc-eval mtsamples_procedures -m "openai/gpt-5-mini" --judge-model "openai/gpt-5-mini" --judge-model "x-ai/grok-4.1-fast"
Notes:
- Use direct environment flags with
medarc-eval(for example,--split validationor--judge-model gpt-5-mini).
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
cache_dir | str | Path | None | ~/.cache/medarc/mtsamples_procedures | Local directory to cache downloaded datasets. Can also be set via MTSAMPLES_PROCEDURES_CACHE_DIR environment variable. |
use_think | bool | False | Whether to use chain-of-thought prompting with <think>...</think> tags |
judge_model | str | list[str] | "gpt-4o-mini" | Model identifier(s) for the LLM judge evaluating procedural plans |
judge_base_url | str | list[str] | None | None | Custom API base URL(s) for judge model (defaults to OpenAI API) |
judge_api_key | str | list[str] | None | None | API key(s) for judge model. Falls back to JUDGE_API_KEY environment variable if not provided |
Results Dataset Structure
Core Evaluation Fields
prompt- The patient notes presented to the model (list of message objects withroleandcontent)completion- The model's generated treatment plan (list of message objects)reward- Overall score from 0.0 to 1.0, calculated as the average of normalized dimension scores:(accuracy/5 + completeness/5 + clarity/5) / 3
Example Metadata (info)
Contains all the MTSamples-specific information about each procedure:
filename- Original filename from the GitHub repositoryextracted_section- Which section was used as reference ("PLAN", "SUMMARY", or "FINDINGS")procedure_note- The patient notes with sections removed (same asquestionfield)reference_plan- Gold standard treatment plan/summary (same asanswerfield)judge_feedback- List of judge evaluations with scores and explanations for each dimension
Notes
- The
questionfield contains everything BEFORE the first PLAN/SUMMARY/FINDINGS section (HELM's exact approach) - The
answerfield contains the first line after the prioritized section header (PLAN > SUMMARY > FINDINGS) - Scores are normalized to 0-1 by dividing each dimension score (1-5) by 5 and averaging across dimensions
- If judge response parsing fails, dimension scores default to
Noneand do not contribute to the final reward
Data Processing
Following HELM's exact approach:
- Section Extraction: Extracts the first line after
PLAN:,SUMMARY:, orFINDINGS:headers (priority order: PLAN > SUMMARY > FINDINGS) - Input Cleaning: Takes everything BEFORE the first section header found as the input
- Reference: The extracted section content becomes the gold standard answer
Dataset Examples
Example: AC Separation Revision & Hardware Removal
Patient Notes (input):
Medical Specialty:Orthopedic
Sample Name: AC Separation Revision & Hardware Removal
Description: Removal of the hardware and revision of right AC separation...
PREOPERATIVE DIAGNOSIS: Right AC separation.
POSTOPERATIVE DIAGNOSIS: Right AC separation.
PROCEDURES: Removal of the hardware and revision of right AC separation.
ANESTHESIA: General.
BLOOD LOSS: 100 cc.
COMPLICATIONS: None.
Reference Answer (extracted from SUMMARY section):
After informed consent was obtained and verified, the patient was brought to
the operating room and placed supine on the operating table. After uneventful
general anesthesia was obtained, he was positioned in the beach chair...
Generated Treatment Plan:
1. Postoperative Care: Monitor vital signs and surgical site for signs of infection...
2. Immobilization: Use of a sling or shoulder immobilizer for 2-4 weeks...
3. Physical Therapy: Begin passive range of motion exercises around 2-3 weeks post-op...
4. Follow-up: Schedule follow-up visits at 1-2 weeks post-op for wound check...
References
HELM MTSamples Implementation
@misc{helm2023,
title={Holistic Evaluation of Language Models},
author={Liang, Percy and Bommasani, Rishi and Lee, Tony and others},
year={2023},
url={https://github.com/stanford-crfm/helm}
}