0

MED Dialog RL Env (Medarc)

Fresh

MedDialog is a benchmark of real-world doctor-patient conversations focused on health-related concerns and advice and tests a model's ability to su...

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

MedDialog (English)

Overview

  • Environment ID: med_dialog
  • Short description: MedDialog is a benchmark of real-world doctor-patient conversations focused on health-related concerns and advice. Each dialogue is paired with a one-sentence summary that reflects the core patient question or exchange. The benchmark evaluates a model's ability to condense medical dialogue into concise, informative summaries.

Dataset

Task

  • Type: Single-Turn
  • Rubric overview: LLM-as-a-judge evaluation using prompts adapted from MedHELM (single or multi-judge)
  • Evaluation dimensions:
    • Accuracy (1-5): Does the summary correctly capture the main medical issue and clinical details?
    • Completeness (1-5): Does the summary include all important medical information?
    • Clarity (1-5): Is the summary easy to understand for clinical use?

Quickstart

Run an evaluation with default settings:

prime eval run med_dialog -m "openai/gpt-5-mini" -n 5 -s

Judge examples:

medarc-eval med_dialog -m "openai/gpt-5-mini" -n 20 -s --judge-model "openai/gpt-5-mini"

This environment supports multi-judge via the medarc-verifiers MultiJudge rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. judge_model, judge_base_url, and judge_api_key are all list-valued and correspond positionally; if only one judge_base_url or judge_api_key is provided, it is used for all judge models.

medarc-eval med_dialog -m "openai/gpt-5-mini" -n 20 -s --judge-model "openai/gpt-5-mini" --judge-model "x-ai/grok-4.1-fast"

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).

Environment Arguments

Document any supported environment arguments and their meaning:

ArgTypeDefaultDescription
cache_dirstr | Path | None~/.cache/meddialogLocal directory to cache downloaded datasets. Can also be set via MEDDIALOG_CACHE_DIR environment variable.
judge_modelstr | list[str]"gpt-4o-mini"Model identifier(s) for the LLM judge evaluating summaries
judge_base_urlstr | list[str] | NoneNoneCustom API base URL(s) for judge model (defaults to OpenAI API)
judge_api_keystr | list[str] | NoneNoneAPI key(s) for judge model. Falls back to JUDGE_API_KEY environment variable if not provided

Results Dataset Structure

Core Evaluation Fields

  • prompt - The input conversation presented to the model (list of message objects with role and content)
  • completion - The model's generated summary (list of message objects)
  • reward - Overall score from 0.0 to 1.0, calculated as the average of normalized dimension scores: (accuracy/5 + completeness/5 + clarity/5) / 3

Example Metadata (info)

Contains all the MedDialog-specific information about each dialogue:

  • id - Unique identifier for the dialogue
  • conversation - The full patient-doctor conversation text
  • reference_response - Gold standard one-sentence summary
  • subset - Either "healthcaremagic" or "icliniq"
  • index - Original index in the source dataset

Notes

  • The question field in the dataset maps to the full conversation text
  • The answer field contains the gold standard summary (also available as reference_response in info)
  • Scores are normalized to 0-1 by dividing each dimension score (1-5) by 5 and averaging across dimensions
  • If judge response parsing fails, dimension scores default to None and do not contribute to the final reward

Dataset Examples

Patient: I get cramps on top of my left forearm and hand and it causes my hand and
fingers to draw up and it hurts. It mainly does this when I bend my arm. I ve been
told that I have a slight pinch in a nerve in my neck. Could this be a cause? I don t
think so.

Doctor: Hi there. It may sound difficult to believe it, but the nerves which supply
your forearms and hand, start at the level of spinal cord and on their way towards the
forearm and hand regions which they supply, the course of these nerves pass through
difference fascial and muscular planes that can make them susceptible to entrapment
neuropathies...

Summary: Could painful forearms be related to pinched nerve in neck?
Patient: Hello doctor, We are looking for a second opinion on my friend's MRI scan of
both the knee joints as he is experiencing excruciating pain just above the patella.
He has a sudden onset of severe pain on both the knee joints about two weeks ago...

Doctor: Hi. I viewed the right and left knee MRI images. Left knee: The MRI, left knee
joint shows a complex tear in the posterior horn of the medial meniscus area and mild
left knee joint effusion...

Summary: My friend has excruciating knee pain. Please interpret his MRI report

References

MedDialog Dataset

@misc{chen2020meddiag,
  title={MedDialog: a large-scale medical dialogue dataset},
  author={Chen, Shu and Ju, Zeqian and Dong, Xiangyu and Fang, Hongchao and Wang, Sicheng and Yang, Yue and Zeng, Jiaqi and Zhang, Ruisi and Zhang, Ruoyu and Zhou, Meng and Zhu, Penghui and Xie, Pengtao},
  publisher = {arXiv},
  year={2020},
  url = {https://arxiv.org/abs/2004.03329},
}