HEAD-QA v2

Evaluation environment for the HEAD-QA v2 dataset.

Overview

Environment ID: head-qa-v2
Short description: Single-turn medical multiple-choice QA
Tags: medical, single-turn, multiple-choice, eval

Datasets

Primary dataset(s): HEAD-QA v2 (HF datasets)
Source links: alesi12/head_qa_v2
Split sizes: Uses the provided train split for evaluation

Task

Type: Single-turn
Rubric overview: Binary scoring (1.0 / 0.0), based on correct answer
Reward function: accuracy — returns 1.0 if the predicted answer matches, else 0.0.

Quickstart

Run an evaluation with default settings:

prime eval run head-qa-v2 -m "openai/gpt-5-mini" -n 5 -s

Usage

To run an evaluation using medarc-eval:

medarc-eval head-qa-v2 -m "openai/gpt-5-mini" -n 5 -s --shuffle-answers --shuffle-seed 42

Authors

This environment has been put together by:

Ratna Sagari Grandhi - (@sagarigrandhi)

Credits

Dataset:

@inproceedings{vilares-gomez-rodriguez-2019-head,
    title = "{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning",
    author = "Vilares, David  and
      G{\'o}mez-Rodr{\'i}guez, Carlos",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1092",
    doi = "10.18653/v1/P19-1092",
    pages = "960--966",
    abstract = "We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.",
}