0

LANISTR: Multimodal Learning from Structured and Unstructured Data

A new multimodal pretraining framework, LANISTR, improves performance on a combination of structured and unstructured data using an attention-based approach and a similarity-based multimodal masking loss.

Year
2023
Venue
arXiv 2023
Authors
4
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2305.16556v3ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Multimodal large-scale pretraining has shown impressive performance for unstructured data such as language and image. However, a prevalent real-world scenario involves structured data types, tabular and time-series, along with unstructured data. Such scenarios have been understudied. To bridge this gap, we propose LANISTR, an attention-based framework to learn from LANguage, Image, and STRuctured data. The core of LANISTR's methodology is rooted in \textit{masking-based} training applied across both unimodal and multimodal levels. In particular, we introduce a new similarity-based multimodal masking loss that enables it to learn cross-modal relations from large-scale multimodal data with missing modalities. On two real-world datasets, MIMIC-IV (from healthcare) and Amazon Product Review (from retail), LANISTR demonstrates remarkable improvements, 6.6% (in AUROC) and 14% (in accuracy) when fine-tuned with 0.1% and 0.01% of labeled data, respectively, compared to the state-of-the-art alternatives. Notably, these improvements are observed even with very high ratio of samples (35.7% and 99.8% respectively) not containing all modalities, underlining the robustness of LANISTR to practical missing modality challenge. Our code and models will be available at https://github.com/google-research/lanistr

Authors

4