0

Supergpqa Medicine RL Env (Medarc)

Fresh

Single-turn medicine MCQ

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.0
Published
Feb 2026

Cite

Notes

Only stored in your browser.

SuperGPQA Medicine

Overview

  • Environment ID: supergpqa_medicine
  • Short description: Filtered medicine split from SuperGPQA
  • Tags: medicine, single-turn, multiple-choice, test, evaluation, supergpqa

Datasets

  • Primary dataset: m-a-p/SuperGPQA (train split, Medicine discipline only)

  • Source links: Paper, GitHub, HF Dataset

  • Split sizes:

    Split (by difficulty)ChoicesCount
    allA-J2755
    easyA-J909
    middleA-J1629
    hardA-J217

Task

  • Type: single-turn
  • Rubric overview: Binary scoring based on correctly boxed letter choice and optional think tag formatting

Quickstart

Run an evaluation with default settings:

prime eval run supergpqa_medicine -m "openai/gpt-5-mini" -n 5 -s

Enable few-shot prompting and filter to a field/difficulty using medarc-eval:

medarc-eval supergpqa_medicine -m "openai/gpt-5-mini" -n -1 --few-shot --field clinical_medicine --difficulty hard

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).
  • The dataset does have a validation split with 3 rows, but these are used as few-shot examples, following the official MMLU-Pro eval code.
  • Set few_shot=true to include the fixed five-shot examples from the official setup.

Environment Arguments

ArgType / ChoicesDefaultDescription
field"all" or one of basic_medicine, clinical_medicine, pharmacy, public_health_and_preventive_medicine, stomatology, traditional_chinese_medicineallFilter by medical field.
difficulty"all", "easy", "middle", "hard"allFilter by question difficulty.
few_shotboolFalseInclude fixed five-shot examples in prompts when True.
shuffle_answersboolFalseShuffle answer choices per row.
shuffle_seedint or null1618Seed for deterministic shuffling when enabled.
jitter_ageboolFalseAdd small decimal jitter (~±2 weeks) to age mentions.

Metrics

MetricMeaning
accuracy(weight 1.0): 1.0 if parsed letter is correct, else 0.0