0

MEDEC

Fresh

MEDEC is the first publicly available benchmark for medical error detection and correction in clinical notes, covering five error types (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism).

Type
RL Env
Runtime
ORS
License
unknown
Size
3360 tasks
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MEDEC

⭐ OpenReward Environment

Description

MEDEC is an environment for evaluating an agent's ability to detect medical errors in clinical notes, identify the erroneous sentence, and provide a medically accurate correction. Based on the MEDEC-MS dataset, it covers five types of medical errors: Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism.

Capabilities

  • Detecting whether a clinical note contains a medical error
  • Identifying the specific sentence containing the error
  • Providing medically accurate corrections
  • Reasoning about clinical presentations and medical knowledge

Compute Requirements

Agents in MEDEC are given a standard sandbox environment. No special compute resources are required.

License

CC BY 4.0.

Tasks

There are three splits in this environment:

  • Train: 2,189 clinical texts
  • Validation: 574 clinical texts
  • Test: 597 clinical texts

Total: 3,360 tasks

Each task presents a clinical note that either contains a medical error or is correct. The agent must determine whether an error exists and, if so, identify the erroneous sentence and provide a correction.

Reward Structure

This is a single-turn environment with a weighted reward structure (0.0–1.0):

  • 40% — Error detection accuracy (binary: did the agent correctly classify whether an error exists?)
  • 30% — Error sentence identification (LLM-graded via gpt-5-mini: did the agent identify the correct sentence?)
  • 30% — Correction quality (LLM-graded via gpt-5-mini: is the agent's correction medically equivalent to the reference?)

For notes without errors, the reward is 1.0 for correctly identifying no error, and 0.0 for a false positive.

Data

The dataset consists of clinical notes from the MEDEC-MS collection. Each note is either error-free or contains exactly one medical error of one of five types. Data is stored as Parquet files on the OpenReward platform.

Source: MEDEC GitHub Repository

Tools

ToolDescription
submit_answerSubmit error detection results: error_detected (bool), error_sentence (string), corrected_sentence (string). Returns weighted reward and detailed feedback.

Time Horizon

MEDEC is a single-turn environment. The agent receives a clinical note and submits one answer.

Environment Difficulty

MEDEC is a challenging benchmark. The original paper reports that Claude 3.5 Sonnet achieves 70.16% accuracy on error flag detection and 65.62% on error sentence detection, while medical doctors outperform all LLMs on these tasks.

Other Environment Requirements

  • OpenAI API key: Required for LLM-based grading of sentence identification and correction quality. Pass via secrets={"openai_api_key": "..."}.

Safety

MEDEC evaluates an agent's ability to detect errors in clinical notes and should not be used as a substitute for professional medical review. The environment does not involve real patient care or clinical decision-making.

Citations

@inproceedings{BenAbacha2025MEDEC,
  author    = {Ben Abacha, Asma and Yim, Wen-wai and Fu, Yujuan and Sun, Zhaoyi and Xia, Fei and Yetisgen, Meliha},
  title     = {MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  year      = {2025},
  url       = {https://arxiv.org/abs/2412.19260}
}