In this paper, we propose $FastDoc$ (Fast Continual Pre-training Technique using Document Level Metadata and Taxonomy), a novel, compute-efficient framework that utilizes Document metadata and Domain-Specific Taxonomy as supervision signals to continually pre-train transformer encoder on a domain-specific corpus. The main innovation is that during domain-specific pretraining, an open-domain encoder is continually pre-trained using sentence-level embeddings as inputs (to accommodate long documents), however, fine-tuning is done with token-level embeddings as inputs to this encoder. We perform such domain-specific pre-training on three different domains namely customer support, scientific, and legal domains, and compare performance on 6 different downstream tasks and 9 different datasets. The novel use of document-level supervision along with sentence-level embedding input for pre-training reduces pre-training compute by around $1,000$, $4,500$, and $500$ times compared to MLM and/or NSP in Customer Support, Scientific, and Legal Domains, respectively. The reduced training time does not lead to a deterioration in performance. In fact we show that $FastDoc$ either outperforms or performs on par with several competitive transformer-based baselines in terms of character-level F1 scores and other automated metrics in the Customer Support, Scientific, and Legal Domains. Moreover, reduced training aids in mitigating the risk of catastrophic forgetting. Thus, unlike baselines, $FastDoc$ shows a negligible drop in performance on open domain.
$FastDoc$: Domain-Specific Fast Continual Pre-training Technique using Document-Level Metadata and Taxonomy
A fast pre-training technique using document-level metadata reduces pre-training compute for domain-specific transformers while maintaining performance on both domain-specific and open-domain benchmarks.
- Year
- 2023
- Venue
- arXiv 2023
- Authors
- 6
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2306.06190v3ARXIV-DEFAULT
- TL;DR
- Semantic Scholar