Question 1

What is HealthBench: Evaluating Large Language Models Towards Improved Human Health?

Accepted Answer

A comprehensive evaluation benchmark designed to assess language models' medical capabilities across a wide range of healthcare scenarios.

Question 2

What is the current top score on HealthBench: Evaluating Large Language Models Towards Improved Human Health?

Accepted Answer

The top reported score is 66.0% by Claude Fable 5, across 4 models reporting (4 from frontier labs).

Question 3

How can a model improve its HealthBench: Evaluating Large Language Models Towards Improved Human Health score?

Accepted Answer

Tools linked to HealthBench: Evaluating Large Language Models Towards Improved Human Health on Sophon include Healthbench RL Env (Medarc), VF Openbench RL Env (Community) - RL environments, datasets, and scaffolds that target this eval.

Question 4

What license is HealthBench: Evaluating Large Language Models Towards Improved Human Health under?

Accepted Answer

HealthBench: Evaluating Large Language Models Towards Improved Human Health is available under mit.

HealthBench: Evaluating Large Language Models Towards Improved Human Health

Score history

Top models

Related tools

Healthbench RL Env (Medarc)

VF Openbench RL Env (Community)

FAQ