0

How2: A Large-scale Dataset for Multimodal Language Understanding

A multimodal instructional dataset with English subtitles and Portuguese translations is introduced, along with sequence-to-sequence baselines for various tasks, promoting research on multimodality in language processing.

Year
2018
Venue
arXiv 2018
Authors
7
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/1811.00347v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.

Authors

7