HEAD-QA v2
Evaluation environment for the HEAD-QA v2 dataset.
Overview
- Environment ID:
head-qa-v2 - Short description: Single-turn medical multiple-choice QA
- Tags: medical, single-turn, multiple-choice, eval
Datasets
- Primary dataset(s): HEAD-QA v2 (HF datasets)
- Source links: alesi12/head_qa_v2
- Split sizes: Uses the provided train split for evaluation
Task
- Type: Single-turn
- Rubric overview: Binary scoring (1.0 / 0.0), based on correct answer
- Reward function:
accuracy— returns 1.0 if the predicted answer matches, else 0.0.
Quickstart
Run an evaluation with default settings:
prime eval run head-qa-v2 -m "openai/gpt-5-mini" -n 5 -s
Usage
To run an evaluation using medarc-eval:
medarc-eval head-qa-v2 -m "openai/gpt-5-mini" -n 5 -s --shuffle-answers --shuffle-seed 42
Authors
This environment has been put together by:
Ratna Sagari Grandhi - (@sagarigrandhi)
Credits
Dataset:
@inproceedings{vilares-gomez-rodriguez-2019-head,
title = "{HEAD}-{QA}: A Healthcare Dataset for Complex Reasoning",
author = "Vilares, David and
G{\'o}mez-Rodr{\'i}guez, Carlos",
booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/P19-1092",
doi = "10.18653/v1/P19-1092",
pages = "960--966",
abstract = "We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.",
}