0

Medxpertqa RL Env (Medarc)

Fresh

MedXpertQA is a highly challenging and comprehensive benchmark designed to evaluate expert-level medical knowledge and advanced reasoning capabilit...

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

medxpertqa

Overview

  • Environment ID: medxpertqa
  • Short description: MedXpertQA is a highly challenging and comprehensive benchmark designed to evaluate expert-level medical knowledge and advanced reasoning capabilities. We only use the text subset for now.
  • Tags: mcq

Datasets

  • Primary dataset(s): TsinghuaC3I/MedXpertQA
  • Source links: HuggingFace
  • Split sizes: test subset - 2.45k rows

Task

  • Type: single-turn
  • Rubric overview: Binary scoring (1.0 / 0.0) based on correct letter or answer text match

Quickstart

Run an evaluation with default settings:

prime eval run medxpertqa -m "openai/gpt-5-mini" -n 5 -s

Configure model and sampling:

medarc-eval medxpertqa -m "openai/gpt-5-mini" -n 20 --answer-format boxed

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).

Environment Arguments

ArgTypeDefaultDescription
question_typestr"all"Question subset to evaluate (e.g., all, text-only subset variants supported by the environment).
use_thinkboolFalseWhether to expect reasoning in <think>...</think> with boxed answers.
shuffle_answersboolFalseWhether to shuffle answer options per question.
shuffle_seedint | None1618Seed for deterministic answer shuffling.
answer_formatstr"xml"Output format parser to use (xml or boxed).

Metrics

MetricMeaning
rewardMain scalar reward (weighted sum of criteria)
accuracyExact match on target answer