0

Medec RL Env (Medarc)

Fresh

Medical Error Detection and Correction in clinical notes from Ben Abacha et al., 2024

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MEDEC

Evaluation environment for the MEDEC dataset.

Overview

  • Environment ID: medec
  • Short description: A benchmark for medical error detection, extraction, and correction in clinical notes, based on the MEDIQA-CORR 2024 shared task.
  • Tags: medical, clinical, error-detection, error-correction, single-turn, llm-as-judge, evaluation, metrics

Datasets

SplitCountDescription
train_ms2,189MS Training Set
validation_ms574MS Validation Set with Ground Truth
test_ms597MS Test Set with Ground Truth

Task

  • Type: single-turn
  • Rubric overview: Supports three evaluation modes via eval_method:
    • "judge" (default): LLM-as-a-Judge multi-part rubric (No Free Labels inspired multi-axis judge)
    • "metrics" (replication): ROUGE, BERTScore, and BLEURT; primary score is weighted average of flag_accuracy and paper metrics
    • "both" (combined): Judge mode score + paper metrics at weight 0 for analysis

Quickstart

1. Export API Key

export OPENAI_API_KEY="your-openai-api-key"

2. Run Evaluation (Default Judge Mode)

prime eval run medec -m "openai/gpt-5-mini" -n 5 -s
medarc-eval medec -m "openai/gpt-5-mini" -n 10 -s --judge-model "openai/gpt-5-mini"

This environment supports multi-judge via the medarc-verifiers MultiJudge rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. judge_model, judge_base_url, and judge_api_key are all list-valued and correspond positionally; if only one judge_base_url or judge_api_key is provided, it is used for all judge models.

medarc-eval medec -m "openai/gpt-5-mini" -n 10 -s --judge-model "openai/gpt-5-mini" --judge-model "google/gemini-3-flash-preview"

3. Run Evaluation (Paper Replication Mode)

medarc-eval medec -m "openai/gpt-5-mini" --eval-method metrics -n 10 -s

4. Evaluate a Different Model (e.g., Anthropic)

export OPENAI_API_KEY="your-openai-api-key"
export ANTHROPIC_API_KEY="your-anthropic-api-key"

medarc-eval medec -m gpt-5-mini -b https://api.anthropic.com/v1 -k ANTHROPIC_API_KEY --header 'anthropic-version: 2023-06-01' -n 10 -s

Environment Arguments

ArgumentTypeDefaultDescription
judge_modelstr"gpt-4o-mini"Model used for judge-based scoring (in "judge" or "both" mode).
judge_base_urlstrNoneAPI endpoint for the judge model (defaults to OpenAI API).
judge_api_keystrNoneAPI key for the judge model (defaults to OPENAI_API_KEY).
eval_methodstr"judge"Evaluation mode ("judge", "metrics", or "both").
devicestrNoneDevice to use for metrics (cpu, cuda:0, etc.). Defaults to GPU if available.

Metrics

Judge Mode (eval_method="judge")

MetricWeightMeaning
error_flag1/31.0 if predicted error_id matches ground truth; else 0.0.
error_sentence1/31.0 if predicted incorrect_sentence matches ground truth; else 0.0.
error_correction1/31.0 if LLM judge deems correction medically equivalent; else 0.0.

Metrics Mode (eval_method="metrics")

MetricWeightMeaning
error_flag1/31.0 if predicted error_id matches ground truth; else 0.0.
error_sentence1/31.0 if predicted incorrect_sentence matches ground truth; else 0.0.
rouge_score1/6ROUGE-1 F1 score.
bertscore1/6BERTScore F1.
bleurt1/6BLEURT score.

Both Mode (eval_method="both")

Same as Judge mode, plus paper's evaluation metrics with weight 0 (for analysis only):

MetricWeightMeaning
rouge_score0ROUGE-1 F1 score.
bertscore0BERTScore F1.
bleurt0BLEURT score.