0

MedRepBench: A Comprehensive Benchmark for Medical Report Interpretation

Medical report understanding from real-world document images is essential for generating patient-facing explanations and enabling structured information exchange in clinical systems.

Preview
Year
2025
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2508.16674ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Medical report understanding from real-world document images is essential for generating patient-facing explanations and enabling structured information exchange in clinical systems. Existing VLMs and LLMs have shown strong performance on document understanding, but structured understanding of medical reports remains insufficiently benchmarked. Therefore, we introduce MedRepBench, a benchmark with 1,925 de-identified Chinese medical report images spanning diverse departments, patient demographics, and acquisition formats. In MedRepBench, we mainly focus on report-grounded interpretation rather than evaluating diagnostic reasoning, treatment recommendation, or the integration of patient history. The interpretation is defined as structured extraction of report fields (e.g., item, value, unit, reference range, abnormal flag) plus a patient-facing explanation grounded strictly in the report content. The benchmark primarily evaluates end-to-end VLMs, and also includes a controlled text-only setting (high-quality OCR + LLM) to approximate an upper bound when character recognition errors are minimized. Our evaluation framework provides two complementary protocols: (1) an objective protocol measuring field-level recall of structured items, and (2) an automated subjective protocol that uses an LLM-based judge to score factuality, interpretability, and reasoning quality under a fixed prompt. Using the objective metric as a reward signal, we also provide a lightweight GRPO-based alignment baseline for a mid-sized VLM, which improves field-level recall by up to 6%. Finally, we analyze practical limitations of OCR+LLM pipelines, including layout-related errors and additional system latency, showing the need for robust end-to-end vision-based medical report understanding. The dataset and evaluation resources are publicly available on https://huggingface.co/datasets/MedRepBench/MedRepBench.