medhallu
Overview
- Environment ID:
medhallu - Short description: Medical hallucination detection benchmark evaluating whether models can identify factual vs. hallucinated medical answers.
- Tags: hallucination-detection, medical, classification, single-turn
Datasets Information
- Paper:: MedHallu: A Comprehensive Benchmark for Detecting Medical Hallucinations in Large Language Models.
- Source links: UTAustin-AIHealth/MedHallu
- Split sizes:
pqa_labeled: ~1k high-quality human-labeled examplespqa_artificial: ~9k synthetically generated examples
Task
- Type: single-turn
- Rubric overview:
+1.0for correct classification (matching target0or1)+0.01for abstaining with2(unsure) (configurable viaunsure_reward)0.0for incorrect classification or malformed answer
The model is presented with a medical question and an answer, then must judge:
0= Answer is factual1= Answer is hallucinated2= Unsure (partial credit)
Differences verses MedHallu paper
This environment intentionally differs from the MedHallu paper’s evaluation protocol:
- We evaluate both options per item: for each dataset row, we create two evaluation examples — one pairing the question with the Ground Truth answer (label
0) and one pairing it with the Hallucinated Answer (label1). The paper’s implementation samples one of the two. - F1 is computed via postprocessing: the paper reports F1 (treating hallucination as the positive class). In this repo, you should compute F1 by postprocessing the
results.jsonloutput and dropping\boxed{2}(unsure) predictions.
Quickstart
Run an evaluation with default settings:
prime eval run medhallu -m "openai/gpt-5-mini" -n 5 -s
Configure model and sampling:
medarc-eval medhallu -m "openai/gpt-5-mini" -n 20 --subset pqa_labeled
Notes:
- Use direct environment flags with
medarc-eval(for example,--split validationor--judge-model gpt-5-mini).
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
subset | str | "pqa_labeled" | Dataset subset: "pqa_labeled" (1k high-quality) or "pqa_artificial" (9k generated) |
difficulty | str | "all" | Filter by difficulty: "easy", "medium", "hard", or "all" |
use_knowledge | bool | False | If True, includes the "Knowledge" field in the prompt as additional context |
unsure_reward | float | 0.01 | Reward assigned when the model outputs \boxed{2} |
Metrics
| Metric | Meaning |
|---|---|
reward | Scalar reward used by Verifiers (see rubric overview above) |
accuracy | Exact match on the target label (0 or 1) |
precision/recall/f1 | Not produced by the environment directly; compute via postprocessing (below) |
Postprocessing (F1)
After you run an eval, compute paper-style F1 (positive label 1) and update the run’s metadata.json:
python environments/medhallu/postprocess.py /path/to/results.jsonl
This script:
- extracts
\boxed{0|1|2}from completions - drops missing/malformed answers
- drops
\boxed{2}(unsure) - computes
accuracy,precision,recall,f1(with1as the positive class)
Hallucination Types
The model is trained to detect these hallucination categories:
- Misinterpretation of Question: Off-topic or irrelevant responses due to misunderstanding
- Incomplete Information: Pointing out what's false without providing correct information
- Mechanism and Pathway Misattribution: False attribution of biological mechanisms or disease processes
- Methodological and Evidence Fabrication: Invented research methods, statistics, or clinical outcomes