0

MED HALT RL Env (Medarc)

Fresh

Med-HALT (Reasoning) evaluation environment for medical LLMs

Type
RL Env
Publisher
Medarc
Capabilities
Hallucination
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

Med-HALT

Overview

  • Environment ID: med_halt
  • Short description: Evaluate clinical reasoning hallucinations using Med-HALT’s FCT (False Confidence Test) and NOTA (None of the Above) tasks.
  • Tags: medical, hallucination, single-turn, multiple-choice, clinical reasoning, evaluation

Datasets

  • Primary dataset(s): Med-HALT (reasoning subset)

  • Source links: Paper, HF Dataset The upstream dataset ships only a single split, but internally includes a split_type field (train / val / test).
    This environment filters to the val subset, consistent with MedARC evaluation standards.

  • Split sizes:

    Test TypeExamplesDescription
    reasoning_FCT5,154False Confidence Test - evaluates if models can correctly assess proposed answers
    reasoning_nota5,154None of the Above Test - tests if models can identify when no option is correct
    reasoning_fakeNot usedFake Questions Test - assesses if models can recognize nonsensical questions (excluded by default)

Notes:

  • FCT’s validation subset is purposely imbalanced: almost all examples contain incorrect student answers.
  • This is expected and documented behavior of the source dataset.

Task

  • Type: single-turn
  • Rubric overview: Binary scoring based on correct letter choice (A, B, C, D, etc.)

Quickstart

Run an evaluation with default settings (FCT split):

prime eval run med_halt -m "openai/gpt-5-mini" -n 5 -s

Configure model and sampling:

medarc-eval med_halt -m "openai/gpt-5-mini" -n 100 -s --question-type reasoning_nota --num-few-shot 3 --no-shuffle-answers

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).

Environment Arguments

ArgTypeDefaultDescription
system_promptstr | NoneNoneCustom system prompt (defaults to standard XML/BOXED prompt based on answer_format)
shuffle_answersboolFalseWhether to shuffle answer choices
shuffle_seedint | None1618Random seed for reproducible answer shuffling
question_typestr"reasoning_fct"Question family to evaluate: "reasoning_fct" or "reasoning_nota"
num_few_shotint2Number of few-shot examples to include in prompts

Metrics

MetricMeaning
accuracy1.0 if parsed letter matches the gold letter, else 0.0

Credits

Dataset:

@article{jeblick2023medhalt,
  title={Med-HALT: Medical Domain Hallucination Test for Large Language Models},
  author={Jeblick, Konstantin and Schachtner, Balthasar and Dexl, Jakob and 
          Mittermeier, Andreas and St{\"u}ber, Anna Theresa and Topalis, Johanna and
          Weber, Tobias and Wesp, Philipp and Sabel, Bastian Oliver and 
          Ricke, Jens and Ingrisch, Michael},
  journal={arXiv preprint arXiv:2307.15343},
  year={2023}
}