0

QA V 2 RL Env (Medarc)

Fresh

HEAD-QA v2 environment

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

HEAD-QA v2

Evaluation environment for the HEAD-QA v2 dataset.

Overview

  • Environment ID: head-qa-v2
  • Short description: Single-turn medical multiple-choice QA
  • Tags: medical, single-turn, multiple-choice, eval

Datasets

  • Primary dataset(s): HEAD-QA v2 (HF datasets)
  • Source links: alesi12/head_qa_v2
  • Split sizes: Uses the provided train split for evaluation

Task

  • Type: Single-turn
  • Rubric overview: Binary scoring (1.0 / 0.0), based on correct answer
  • Reward function: accuracy — returns 1.0 if the predicted answer matches, else 0.0.

Quickstart

Run an evaluation with default settings:

prime eval run head-qa-v2 -m "openai/gpt-5-mini" -n 5 -s

Usage

To run an evaluation using medarc-eval:

medarc-eval head-qa-v2 -m "openai/gpt-5-mini" -n 5 -s --shuffle-answers --shuffle-seed 42

Authors

This environment has been put together by:

Ratna Sagari Grandhi - (@sagarigrandhi)

Credits

Dataset:

@inproceedings{vilares-gomez-rodriguez-2019-head,
    title = "{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning",
    author = "Vilares, David  and
      G{\'o}mez-Rodr{\'i}guez, Carlos",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1092",
    doi = "10.18653/v1/P19-1092",
    pages = "960--966",
    abstract = "We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.",
}