We introduce The Benchmark of Linguistic Minimal Pairs (shortened to BLiMP), a challenge set for evaluating what language models (LMs) know about major grammatical phenomena in English. BLiMP consists of 67 sub-datasets, each containing 1000 minimal pairs isolating specific contrasts in syntax, morphology, or semantics. The data is automatically generated according to expert-crafted grammars, and aggregate human agreement with the labels is 96.4%. We use it to evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs. We find that state-of-the-art models identify morphological contrasts reliably, but they struggle with semantic restrictions on the distribution of quantifiers and negative polarity items and subtle syntactic phenomena such as extraction islands.
BLiMP: The Benchmark of Linguistic Minimal Pairs for English
BLiMP evaluates language models on grammatical phenomena in English using minimal pairs, showing that n-gram, LSTM, and Transformer models perform well on morphology but struggle with semantic and subtle syntactic phenomena.
- Year
- 2019
- Venue
- blimp-the-benchmark-of-linguistic-minimal
- Authors
- 7
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/1912.00582v4ARXIV-DEFAULT
- TL;DR
- Semantic Scholar