0

Medhallu RL Env (Medarc)

Fresh

Medical hallucination detection benchmark

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

medhallu

Overview

  • Environment ID: medhallu
  • Short description: Medical hallucination detection benchmark evaluating whether models can identify factual vs. hallucinated medical answers.
  • Tags: hallucination-detection, medical, classification, single-turn

Datasets Information

Task

  • Type: single-turn
  • Rubric overview:
    • +1.0 for correct classification (matching target 0 or 1)
    • +0.01 for abstaining with 2 (unsure) (configurable via unsure_reward)
    • 0.0 for incorrect classification or malformed answer

The model is presented with a medical question and an answer, then must judge:

  • 0 = Answer is factual
  • 1 = Answer is hallucinated
  • 2 = Unsure (partial credit)

Differences verses MedHallu paper

This environment intentionally differs from the MedHallu paper’s evaluation protocol:

  • We evaluate both options per item: for each dataset row, we create two evaluation examples — one pairing the question with the Ground Truth answer (label 0) and one pairing it with the Hallucinated Answer (label 1). The paper’s implementation samples one of the two.
  • F1 is computed via postprocessing: the paper reports F1 (treating hallucination as the positive class). In this repo, you should compute F1 by postprocessing the results.jsonl output and dropping \boxed{2} (unsure) predictions.

Quickstart

Run an evaluation with default settings:

prime eval run medhallu -m "openai/gpt-5-mini" -n 5 -s

Configure model and sampling:

medarc-eval medhallu -m "openai/gpt-5-mini" -n 20 --subset pqa_labeled

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).

Environment Arguments

ArgTypeDefaultDescription
subsetstr"pqa_labeled"Dataset subset: "pqa_labeled" (1k high-quality) or "pqa_artificial" (9k generated)
difficultystr"all"Filter by difficulty: "easy", "medium", "hard", or "all"
use_knowledgeboolFalseIf True, includes the "Knowledge" field in the prompt as additional context
unsure_rewardfloat0.01Reward assigned when the model outputs \boxed{2}

Metrics

MetricMeaning
rewardScalar reward used by Verifiers (see rubric overview above)
accuracyExact match on the target label (0 or 1)
precision/recall/f1Not produced by the environment directly; compute via postprocessing (below)

Postprocessing (F1)

After you run an eval, compute paper-style F1 (positive label 1) and update the run’s metadata.json:

python environments/medhallu/postprocess.py /path/to/results.jsonl

This script:

  • extracts \boxed{0|1|2} from completions
  • drops missing/malformed answers
  • drops \boxed{2} (unsure)
  • computes accuracy, precision, recall, f1 (with 1 as the positive class)

Hallucination Types

The model is trained to detect these hallucination categories:

  • Misinterpretation of Question: Off-topic or irrelevant responses due to misunderstanding
  • Incomplete Information: Pointing out what's false without providing correct information
  • Mechanism and Pathway Misattribution: False attribution of biological mechanisms or disease processes
  • Methodological and Evidence Fabrication: Invented research methods, statistics, or clinical outcomes