0

Mtsamples Procedures RL Env (Medarc)

Fresh

MTSamples Procedures is a benchmark of medical transcription samples that tests a model's ability to generate coherent and clinically accurate proc...

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MTSamples Procedures

Overview

  • Environment ID: mtsamples_procedures
  • Short description: MTSamples Procedures is a benchmark composed of transcribed operative notes, focused on documenting surgical procedures. Each example presents a brief patient case involving a surgical intervention, and the model is tasked with generating a coherent and clinically accurate procedural summary or treatment plan.

Dataset

Task

  • Type: Single-Turn
  • Rubric overview: MultiJudgeRubric (LLM-as-a-Judge evaluation adapted from HELM's MTSamples Procedures Annotator)
  • Task description: Given patient notes (procedure note with PLAN/SUMMARY/FINDINGS sections removed), generate a reasonable treatment plan
  • Prompt: "Here are information about a patient, return a reasonable treatment plan for the patient."
  • Evaluation dimensions:
    • Accuracy (1-5): Does the response provide correct clinical advice that follows established clinical guidelines?
    • Completeness (1-5): Does the response include all important aspects of patient care mentioned in the reference?
    • Clarity (1-5): Is the response written clearly and organized in a way that is easy to read for clinicians?

Quickstart

Run an evaluation with default settings:

prime eval run mtsamples_procedures -m "openai/gpt-5-mini" -n 5 -s

Run a single-judge evaluation:

medarc-eval mtsamples_procedures -m "openai/gpt-5-mini" -n 5 --judge-model "openai/gpt-5-mini"

This environment supports multi-judge via the medarc-verifiers MultiJudge rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. judge_model, judge_base_url, and judge_api_key are all list-valued and correspond positionally; if only one judge_base_url or judge_api_key is provided, it is used for all judge models.

Use configured default judges:

medarc-eval mtsamples_procedures -m "openai/gpt-5-mini" --judge-model "openai/gpt-5-mini" --judge-model "x-ai/grok-4.1-fast"

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).

Environment Arguments

ArgTypeDefaultDescription
cache_dirstr | Path | None~/.cache/medarc/mtsamples_proceduresLocal directory to cache downloaded datasets. Can also be set via MTSAMPLES_PROCEDURES_CACHE_DIR environment variable.
use_thinkboolFalseWhether to use chain-of-thought prompting with <think>...</think> tags
judge_modelstr | list[str]"gpt-4o-mini"Model identifier(s) for the LLM judge evaluating procedural plans
judge_base_urlstr | list[str] | NoneNoneCustom API base URL(s) for judge model (defaults to OpenAI API)
judge_api_keystr | list[str] | NoneNoneAPI key(s) for judge model. Falls back to JUDGE_API_KEY environment variable if not provided

Results Dataset Structure

Core Evaluation Fields

  • prompt - The patient notes presented to the model (list of message objects with role and content)
  • completion - The model's generated treatment plan (list of message objects)
  • reward - Overall score from 0.0 to 1.0, calculated as the average of normalized dimension scores: (accuracy/5 + completeness/5 + clarity/5) / 3

Example Metadata (info)

Contains all the MTSamples-specific information about each procedure:

  • filename - Original filename from the GitHub repository
  • extracted_section - Which section was used as reference ("PLAN", "SUMMARY", or "FINDINGS")
  • procedure_note - The patient notes with sections removed (same as question field)
  • reference_plan - Gold standard treatment plan/summary (same as answer field)
  • judge_feedback - List of judge evaluations with scores and explanations for each dimension

Notes

  • The question field contains everything BEFORE the first PLAN/SUMMARY/FINDINGS section (HELM's exact approach)
  • The answer field contains the first line after the prioritized section header (PLAN > SUMMARY > FINDINGS)
  • Scores are normalized to 0-1 by dividing each dimension score (1-5) by 5 and averaging across dimensions
  • If judge response parsing fails, dimension scores default to None and do not contribute to the final reward

Data Processing

Following HELM's exact approach:

  1. Section Extraction: Extracts the first line after PLAN:, SUMMARY:, or FINDINGS: headers (priority order: PLAN > SUMMARY > FINDINGS)
  2. Input Cleaning: Takes everything BEFORE the first section header found as the input
  3. Reference: The extracted section content becomes the gold standard answer

Dataset Examples

Example: AC Separation Revision & Hardware Removal

Patient Notes (input):
Medical Specialty:Orthopedic
Sample Name: AC Separation Revision & Hardware Removal
Description: Removal of the hardware and revision of right AC separation...
PREOPERATIVE DIAGNOSIS: Right AC separation.
POSTOPERATIVE DIAGNOSIS: Right AC separation.
PROCEDURES: Removal of the hardware and revision of right AC separation.
ANESTHESIA: General.
BLOOD LOSS: 100 cc.
COMPLICATIONS: None.

Reference Answer (extracted from SUMMARY section):
After informed consent was obtained and verified, the patient was brought to
the operating room and placed supine on the operating table. After uneventful
general anesthesia was obtained, he was positioned in the beach chair...

Generated Treatment Plan:
1. Postoperative Care: Monitor vital signs and surgical site for signs of infection...
2. Immobilization: Use of a sling or shoulder immobilizer for 2-4 weeks...
3. Physical Therapy: Begin passive range of motion exercises around 2-3 weeks post-op...
4. Follow-up: Schedule follow-up visits at 1-2 weeks post-op for wound check...

References

HELM MTSamples Implementation

@misc{helm2023,
  title={Holistic Evaluation of Language Models},
  author={Liang, Percy and Bommasani, Rishi and Lee, Tony and others},
  year={2023},
  url={https://github.com/stanford-crfm/helm}
}