LongHealth
Overview
- Environment ID:
longhealth - Short description: Evaluates LLM ability to process and extract information from long clinical documents (5K-6.7K words per case)
- Tags: medical, clinical, single-turn, multiple-choice, long-context, eval
Datasets
- Paper: https://arxiv.org/abs/2401.14490
- Dataset / Project: https://github.com/kbressem/LongHealth
- Split sizes: 400 questions total (20 patients × 20 questions each). No official train/val splits
- Context length: 5,090 to 6,754 words per patient case
Benchmarks Tasks
Task 1: Information Extraction
- Tests ability to extract correct information from long clinical documents
- Answer is ALWAYS present in provided documents
- 5-option MCQ (A/B/C/D/E)
Task 2: Negation Detection & Hallucination Prevention
- Tests ability to identify when information is NOT available
- Creates pairs of examples:
- Negation: Only distractor documents (should answer F: "Cannot be answered")
- Identification: Answer docs + distractors (should answer correctly)
- 6-option MCQ (A/B/C/D/E/F)
Task 3: Temporal Reasoning
- Embedded within Task 2 framework
- Questions focus on chronological ordering and temporal relationships
- Tests understanding of event sequences in medical timelines
Quickstart
Run an evaluation with default settings (Task 1, first 10 examples):
prime eval run longhealth -m "openai/gpt-5-mini" -n 5 -s
Configure model and task:
medarc-eval longhealth -m "openai/gpt-5-mini" -n 20 -s --task task2
medarc-eval longhealth -m "openai/gpt-5-mini" -n 10 -s --task all --doc-shuffle-seed 2718 --max-context-tokens 30000
Environment Arguments
| Arg | Type | Default | Description |
|---|---|---|---|
task | str | "task1" | Which task(s): "task1" (extraction), "task2" (negation), "all" (both) |
max_context_tokens | int | 16000 | Maximum tokens for document context |
shuffle_docs | bool | True | Shuffle document order to test positional bias |
doc_shuffle_seed | int | None | -1 | Seed for document shuffling (-1 for nondeterministic order each run) |
shuffle_answers | bool | False | Shuffle answer options |
shuffle_seed | int | None | 1618 | Seed for answer shuffling |
max_examples | int | -1 | Limit number of examples (-1 for all) |
Metrics
| Metric | Meaning |
|---|---|
reward | Exact match accuracy (1.0 if correct letter, 0.0 otherwise) |
info.task | Which sub-task: task1, task2_negation, or task2_identification |
info.has_answer_docs | Whether answer-containing documents were included |
info.num_docs | Number of documents in the context |
Example Usage
import verifiers as vf
# Load Task 1
env = vf.load_environment("longhealth", task="task1")
# Load Task 2 with custom settings
env = vf.load_environment(
"longhealth",
task="task2",
max_context_tokens=14000,
shuffle_docs=True
)
# Run evaluation programmatically
from openai import AsyncOpenAI
client = AsyncOpenAI()
results = await env.evaluate(client, "gpt-4.1-mini", num_examples=10)
Authors
This environment has been put together by:
Shamus Sim Zi Yang - (@ss8319)