0

Typhoon: Towards an Effective Task-Specific Masking Strategy for Pre-trained Language Models

The choice of \emph{which} tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance.

Year
2023
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2303.15619CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

The choice of which tokens to mask is a central, under-examined design decision in masked language modeling (MLM). Standard pretraining masks tokens uniformly at random, but several studies show that more informative masking targets can improve downstream performance. We study masking as a task-adaptive component of the fine-tuning pipeline and introduce Typhoon, a masking strategy that uses the gradient of the task loss with respect to one-hot token inputs to estimate, online, how much each token type contributes to the objective. Typhoon maintains an exponential moving average of per-token-type saliency and calibrates these scores into a masking distribution whose expected masking rate matches a target budget, under a token-independence approximation. We formalize the method and evaluate it against random masking and whole-word masking on two GLUE tasks, MRPC and CoLA, across three BERT-family backbones (TinyBERT, DistilBERT, and BERT-base) and five random seeds per configuration (90 training runs in total). Our main finding is that, once seed variance is accounted for, no masking strategy is reliably better than the others on these tasks: on MRPC the gap between Typhoon and the best baseline stays within 0.004 F_1, across all twelve Typhoon comparisons no paired test reaches significance, and every 95% confidence interval contains zero. Typhoon's apparent advantage in single-run experiments does not survive this more careful evaluation. We read this as a cautionary, reproducibility-focused result -- gradient-based task-adaptive masking is competitive but not clearly better than resource-free random masking at this scale -- and we describe a clean modern reimplementation to support follow-up work.