How2: A Large-scale Dataset for Multimodal Language Understanding

A multimodal instructional dataset with English subtitles and Portuguese translations is introduced, along with sequence-to-sequence baselines for various tasks, promoting research on multimodality in language processing.

Open

Preview
Year: 2018
Venue: arXiv 2018
ArXiv: arxiv.org/abs/1811.00347
Authors: 7
Hosting: Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text: arxiv.org/abs/1811.00347v2ARXIV-DEFAULT
TL;DR: Semantic Scholar

Attribution policy →

Abstract

In this paper, we introduce How2, a multimodal collection of instructional videos with English subtitles and crowdsourced Portuguese translations. We also present integrated sequence-to-sequence baselines for machine translation, automatic speech recognition, spoken language translation, and multimodal summarization. By making available data and code for several multimodal natural language tasks, we hope to stimulate more research on these and similar challenges, to obtain a deeper understanding of multimodality in language processing.

Authors

Loïc Barrault Desmond Elliott Florian Metze Ramon Sanabria Ozan Caglayan Shruti Palaskar Lucia Specia