0

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

InstructDr, a novel instruction-based document reading and understanding model, leverages document images and large language models with a trainable bridging module to achieve high generalization across diverse VDU tasks and datasets.

Year
2024
Venue
arXiv 2024
Authors
5
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2401.13313ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

Authors

5