0

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings

DNABERT-S, a specialized genome foundation model, enhances DNA embedding through Manifold Instance Mixup and Curriculum Contrastive Learning, achieving superior performance in species classification and clustering in metagenomics binning.

Year
2024
Venue
arXiv 2024
Authors
8
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2402.08777v3ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.

Authors

8