0

Medbullets RL Env (Medarc)

Fresh

Single-turn medical MCQ

Type
RL Env
Publisher
Medarc
Runtime
single-turn
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

medbullets

Overview

  • Environment ID: medbullets
  • Short description: USMLE-style multiple-choice questions from Medbullets.
  • Tags: medical, clinical, single-turn, multiple-choice, USMLE, train, evaluation

Datasets

  • Primary dataset(s): Medbullets-4 and Medbullets-5

  • Source links: Paper, Github, HF Dataset

  • Split sizes:

    SplitChoicesCount
    op4_test{A, B, C, D}308
    op5_test{A, B, C, D, E}308

    op5_test contains the same content as op4_test, but with one additional answer choice to increase difficulty. Note that while the content is the same, the letter choice corresponding to the correct answer is sometimes different between these splits.

Task

  • Type: single-turn
  • Rubric overview: Binary scoring based on correctly boxed letter choice and optional think tag formatting

Quickstart

Run an evaluation with default settings:

prime eval run medbullets -m "openai/gpt-5-mini" -n 5 -s

Configure model and sampling:

medarc-eval medbullets -m "openai/gpt-5-mini" -n -1 --num-options 5 --shuffle-answers --shuffle-seed 1618 --answer-format boxed

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).

Environment Arguments

Document any supported environment arguments and their meaning. Example:

ArgTypeDefaultDescription
num_optionsint4Number of options: 4 → {A, B, C, D}; 5 → {A, B, C, D, E}
use_thinkboolFalseWhether to check for <think>...</think> formatting with ThinkParser
shuffle_answersboolFalseWhether to shuffle answer choices
shuffle_seedint | None1618Seed for deterministic answer shuffling
answer_formatstr"boxed"Output parser format: "boxed" or "xml"

Metrics

Summarize key metrics your rubric emits and how they’re interpreted.

MetricMeaning
correct_answer_reward_func(weight 1.0): 1.0 if parsed letter is correct, else 0.0
parser.get_format_reward_func()(weight 0.0): optional format adherence (not counted)