Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context. We introduce DiagramBank, a quality-audited dataset of 57,100 schematic diagrams curated from OpenReview-hosted AI/ML venues. Each record links a diagram image to its paper title, abstract, figure caption, in-text figure-reference spans, venue/year metadata, provenance fields, and filtering labels. DiagramBank is a reusable resource for scientific-document understanding, diagram retrieval, corpus analysis, and future benchmark construction. We describe its extraction and cascade-filtering pipeline, release schema, confidence-controlled views, dataset card, and indexing utilities. A manual blind audit of the released cascade-filtered records estimates 93.67% precision, and a separate CLIP threshold analysis characterizes the precision--coverage trade-off for simpler filtering views. We further provide lightweight metadata-indexing and authoring examples to illustrate downstream protocols without treating these utilities as standalone methods. The code is public at: https://github.com/csml-rpi/DiagramBank.
DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context
Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context.
- Year
- 2026
- Venue
- arXiv 2026
- Authors
- 4
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2604.20857ARXIV-DEFAULT
- TL;DR
- Semantic Scholar