0

Pubhealthbench RL Env (Medarc)

Fresh

Evaluation environment for the Joshua-Harris/PubHealthBench public health MCQ dataset

Type
RL Env
Publisher
Medarc
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

PubHealthBench

Evaluation environment for the Joshua-Harris/PubHealthBench dataset.

Overview

  • Environment ID: pubhealthbench
  • Short description: Public health MCQ and free-form evaluation derived from UK Health Security Agency (UKHSA) guidance documents
  • Tags: medical, public-health, single-turn, multiple-choice, llm-judge, eval

Datasets

PubHealthBench contains public health questions derived from UK Health Security Agency (UKHSA) guidance documents. Questions cover topics including:

  • Gastro/food safety
  • Chemicals/toxicology
  • Vaccine-preventable diseases and immunisation
  • And more

Splits

SplitTypeQuestionsDescription
fullMCQ7,929Full test set
validationMCQ161Validation set
reviewedMCQ760Human-reviewed questions (default)
freeformLLM-as-judge760Reviewed set with open-ended evaluation
freeform_validLLM-as-judge161Validation set with open-ended evaluation

Quickstart

Install:

vf-install pubhealthbench

Run MCQ evaluation (default: reviewed split):

prime eval run pubhealthbench -m "openai/gpt-5-mini" -n 5 -s

Use full test split:

medarc-eval pubhealthbench --split full -m "openai/gpt-5-mini" -n 10

With answer shuffling:

medarc-eval pubhealthbench --shuffle-answers -m "openai/gpt-5-mini" -n 10

Freeform (LLM-as-judge) single-judge evaluation:

medarc-eval pubhealthbench --split freeform -m "openai/gpt-5-mini" -n 10 --judge-model "openai/gpt-5-mini"

This environment supports multi-judge via the medarc-verifiers MultiJudge rubric. When multiple judge models are specified, each scores independently and the final reward is their mean. judge_model, judge_base_url, and judge_api_key are all list-valued and correspond positionally; if only one judge_base_url or judge_api_key is provided, it is used for all judge models.

medarc-eval pubhealthbench --split freeform -m "openai/gpt-5-mini" -n 10 --judge-model "openai/gpt-5-mini" --judge-model "google/gemini-3-flash-preview"

Environment Arguments

ParameterTypeDefaultDescription
splitstr"reviewed"Dataset split (see table above)
shuffle_answersboolFalseRandomize answer option order (MCQ only)
shuffle_seedint1618Seed for deterministic shuffling (MCQ only)
answer_formatstr"xml"Answer format: "xml" or "boxed" (MCQ only)
judge_modelstr | list[str]"gpt-4o-mini"Judge model(s) for freeform evaluation
judge_base_urlstr | list[str]NoneBase URL(s) for judge API
judge_api_keystr | list[str]NoneAPI key(s) for judge

Authors

This environment has been put together by:

Benjamin Warner - (@warner-benjamin)

Citation

Dataset:

@misc{harris2025healthyllmsbenchmarkingllm,
      title={Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information},
      author={Joshua Harris and Fan Grayson and Felix Feldman and Timothy Laurence and Toby Nonnenmacher and Oliver Higgins and Leo Loman and Selina Patel and Thomas Finnie and Samuel Collins and Michael Borowitz},
      year={2025},
      eprint={2505.06046},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.06046},
}