MEDEC

Description

MEDEC is an environment for evaluating an agent's ability to detect medical errors in clinical notes, identify the erroneous sentence, and provide a medically accurate correction. Based on the MEDEC-MS dataset, it covers five types of medical errors: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism.

Capabilities

Detecting whether a clinical note contains a medical error
Identifying the specific sentence containing the error
Providing medically accurate corrections
Reasoning about clinical presentations and medical knowledge

Compute Requirements

Agents in MEDEC are given a standard sandbox environment. No special compute resources are required.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

Train: 2,189 clinical texts
Validation: 574 clinical texts
Test: 597 clinical texts

Total: 3,360 tasks

Each task presents a clinical note that either contains a medical error or is correct. The agent must determine whether an error exists and, if so, identify the erroneous sentence and provide a correction.

Reward Structure

This is a single-turn environment with a weighted reward structure (0.0–1.0):

40% — Error detection accuracy (binary: did the agent correctly classify whether an error exists?)
30% — Error sentence identification (LLM-graded via gpt-5-mini: did the agent identify the correct sentence?)
30% — Correction quality (LLM-graded via gpt-5-mini: is the agent's correction medically equivalent to the reference?)

For notes without errors, the reward is 1.0 for correctly identifying no error, and 0.0 for a false positive.

Data

The dataset consists of clinical notes from the MEDEC-MS collection. Each note is either error-free or contains exactly one medical error of one of five types. Data is stored as Parquet files on the OpenReward platform.

Source: MEDEC GitHub Repository

Tools

Tool	Description
`submit_answer`	Submit error detection results: `error_detected` (bool), `error_sentence` (string), `corrected_sentence` (string). Returns weighted reward and detailed feedback.

Time Horizon

MEDEC is a single-turn environment. The agent receives a clinical note and submits one answer.

Environment Difficulty

MEDEC is a challenging benchmark. The original paper reports that Claude 3.5 Sonnet achieves 70.16% accuracy on error flag detection and 65.62% on error sentence detection, while medical doctors outperform all LLMs on these tasks.

Other Environment Requirements

OpenAI API key: Required for LLM-based grading of sentence identification and correction quality. Pass via secrets={"openai_api_key": "..."}.

Safety

MEDEC evaluates an agent's ability to detect errors in clinical notes and should not be used as a substitute for professional medical review. The environment does not involve real patient care or clinical decision-making.

Citations

@inproceedings{BenAbacha2025MEDEC,
  author    = {Ben Abacha, Asma and Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Xia, Fei and Yetisgen, Meliha},
  title     = {MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  year      = {2025},
  url       = {https://arxiv.org/abs/2412.19260}
}