We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.
A Corpus for Reasoning About Natural Language Grounded in Photographs
A new dataset for reasoning about conjunctions of natural language and images, emphasizing semantic diversity, compositionality, and visual reasoning challenges.
- Year
- 2018
- Venue
- a-corpus-for-reasoning-about-natural-language-1
- Authors
- 6
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/1811.00491v3ARXIV-DEFAULT
- TL;DR
- Semantic Scholar