0

Longhealth RL Env (Medarc)

Fresh

LongHealth: A Question Answering Benchmark with Long Clinical Documents - 20 patients, 400 MCQ questions

Type
RL Env
Publisher
Medarc
Capabilities
Long Context
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

LongHealth

Overview

  • Environment ID: longhealth
  • Short description: Evaluates LLM ability to process and extract information from long clinical documents (5K-6.7K words per case)
  • Tags: medical, clinical, single-turn, multiple-choice, long-context, eval

Datasets

Benchmarks Tasks

Task 1: Information Extraction

  • Tests ability to extract correct information from long clinical documents
  • Answer is ALWAYS present in provided documents
  • 5-option MCQ (A/B/C/D/E)

Task 2: Negation Detection & Hallucination Prevention

  • Tests ability to identify when information is NOT available
  • Creates pairs of examples:
    • Negation: Only distractor documents (should answer F: "Cannot be answered")
    • Identification: Answer docs + distractors (should answer correctly)
  • 6-option MCQ (A/B/C/D/E/F)

Task 3: Temporal Reasoning

  • Embedded within Task 2 framework
  • Questions focus on chronological ordering and temporal relationships
  • Tests understanding of event sequences in medical timelines

Quickstart

Run an evaluation with default settings (Task 1, first 10 examples):

prime eval run longhealth -m "openai/gpt-5-mini" -n 5 -s

Configure model and task:

medarc-eval longhealth -m "openai/gpt-5-mini" -n 20 -s --task task2

medarc-eval longhealth -m "openai/gpt-5-mini" -n 10 -s --task all --doc-shuffle-seed 2718 --max-context-tokens 30000

Environment Arguments

ArgTypeDefaultDescription
taskstr"task1"Which task(s): "task1" (extraction), "task2" (negation), "all" (both)
max_context_tokensint16000Maximum tokens for document context
shuffle_docsboolTrueShuffle document order to test positional bias
doc_shuffle_seedint | None-1Seed for document shuffling (-1 for nondeterministic order each run)
shuffle_answersboolFalseShuffle answer options
shuffle_seedint | None1618Seed for answer shuffling
max_examplesint-1Limit number of examples (-1 for all)

Metrics

MetricMeaning
rewardExact match accuracy (1.0 if correct letter, 0.0 otherwise)
info.taskWhich sub-task: task1, task2_negation, or task2_identification
info.has_answer_docsWhether answer-containing documents were included
info.num_docsNumber of documents in the context

Example Usage

import verifiers as vf

# Load Task 1
env = vf.load_environment("longhealth", task="task1")

# Load Task 2 with custom settings
env = vf.load_environment(
    "longhealth",
    task="task2",
    max_context_tokens=14000,
    shuffle_docs=True
)

# Run evaluation programmatically
from openai import AsyncOpenAI
client = AsyncOpenAI()
results = await env.evaluate(client, "gpt-4.1-mini", num_examples=10)

Authors

This environment has been put together by:

Shamus Sim Zi Yang - (@ss8319)