0

Medcalc Bench RL Env (Medarc)

Fresh

MedCalc-Bench clinical calculator evaluation

Type
RL Env
Publisher
Medarc
Runtime
multi-turn
License
unknown
Size
v0.1.1
Published
Dec 2025

Cite

Notes

Only stored in your browser.

MedCalc-Bench

Overview

  • Environment ID: medcalc-bench
  • Short description: Evaluate clinical calculator reasoning and numeric/date outputs. Optionally equips the model with a Python execution tool or a calculator tool.
  • Tags: medical, clinical, single-turn, numeric, date, evaluation

Dataset

Two dataset variants are available:

  • verified (default): nsk7153/MedCalc-Bench-Verified
  • v1.2: ncbi/MedCalc-Bench-v1.2
Splitv1.2verified
train10,54310,538
test1,1001,100

Each example includes a Patient Note, Question, Calculator ID, Ground Truth, Lower Bound, and Upper Bound.

Task

  • Type: single-turn, multi-turn with tool use
  • Prompt: _build_prompt(patient_note, question) instructs <think>...</think> and <answer>...</answer>.
  • Rubric: check_correctness validates by calculator type:
    • IDs 13, 68: date equality (MM/DD/YYYY)
    • ID 69: tuple (weeks, days) equality
    • Integer IDs: integer equality (with rounding as needed)
    • Decimal IDs: numeric value within [lower_bound, upper_bound]

Quickstart

Run an evaluation with default settings:

prime eval run medcalc-bench -m "openai/gpt-5-mini" -n 5 -s

Configure model and sampling:

medarc-eval medcalc-bench -m "openai/gpt-5-mini" -n 5 -s --one-shot --add-python-tool

Notes:

  • Use direct environment flags with medarc-eval (for example, --split validation or --judge-model gpt-5-mini).
  • Setting use_think to True works best with one_shot set to True, so that the LLM can learn exactly how it should format its answer.
  • The packaged medarc_verifiers XMLParser suppresses the upstream warning about <think> and still parses <answer> even if <think> is malformed.
  • Tool safety: The Python tool uses RestrictedPython for sandboxed execution with limited builtins (only math, numpy, scipy imports allowed). The calculator tool uses simpleeval with only safe math operations.

Environment Arguments

ArgTypeDefaultDescription
one_shotboolFalseWhether to use the one-shot prompt
add_python_toolboolFalseAdd the Python code execution tool (uses restricted Python with limited builtins)
add_calculator_toolboolFalseAdd the calculator tool (uses simple eval with safe math operations)
max_turnsint20Maximum number of turns in tool use environment
versionstr"verified"Dataset variant: "verified" (default) or "1.2"
answer_formatstr"xml"Answer format: "xml" (default) or "boxed"
use_thinkboolFalseWhether to instruct <think>...</think> formatting
system_promptstrNoneCustom system prompt (defaults to standard XML/BOXED prompt based on answer_format)

Metrics

MetricMeaning
check_correctness(weight 1.0): validates numeric/date/tuple answers per calc ID

Adjustments

Adjusted the prompt to output the step-by-step thinking and final answer with the and tags instead of responding with a JSON.

References

@misc{khandekar2024medcalcbench,
      title={MedCalc-Bench: Evaluating Large Language Models for Medical Calculations},
      author={Nikhil Khandekar and Qiao Jin and Guangzhi Xiong and Soren Dunn and Serina S Applebaum and Zain Anwar and Maame Sarfo-Gyamfi and Conrad W Safranek and Abid A Anwar and Andrew Zhang and Aidan Gilson and Maxwell B Singer and Amisha Dave and Andrew Taylor and Aidong Zhang and Qingyu Chen and Zhiyong Lu},
      year={2024},
      eprint={2406.12036},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.12036},
}