0

Medqa Followup

Multi-turn robustness evaluation for medical LLMs - tests whether models maintain correct answers when challenged with follow-up interventions

Domain
rl-env
License
unknown
Published
Dec 2025

Cite

Notes

Only stored in your browser.

Attribution

Leaderboard scores
prime-hub
Attribution policy →

Top score 10.8% by GPT-4o - 2 models reporting (2 frontier)

Score history

2
0%25%50%75%100%Nov 24Jan 25Mar 25May 25GPT-4o

Top models

2
Medqa FollowupBar chart with 2 bars. Highest value: GPT-4o at 10.8.
2 models

Related tools

1
View all

Implementations, trainers, datasets and scaffolds linked to this eval.

FAQ

What is Medqa Followup?
Multi-turn robustness evaluation for medical LLMs - tests whether models maintain correct answers when challenged with follow-up interventions
What is the current top score on Medqa Followup?
The top reported score is 10.8% by GPT-4o, across 2 models reporting (2 from frontier labs).
How can a model improve its Medqa Followup score?
Tools linked to Medqa Followup on Sophon include Medqa Followup RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.
What license is Medqa Followup under?
Medqa Followup is available under unknown.