Researchers from InstaDeep and NVIDIA have open-sourced Nucleotide Transformers (NT), a set of foundation models for genomics data. The largest NT model has 2.5 billion parameters and was trained on genetic sequence data from 850 species. It outperforms other state-of-the-art genomics foundation models on several genomics benchmarks.
The InstaDeep published a technical description of the models in Nature. NT uses an encoder-only Transformer architecture and is pre-trained using the same masked language model objective as BERT. The pre-trained NT models can be used in two ways: to produce embeddings for use as features in smaller models, or fine-tuned with a task-specific head replacing the language model head. InstaDeep evaluated NT on 18 downstream tasks, such as epigenetic marks prediction and promoter sequence prediction, and compared it to three baseline models. NT achieved the “highest overall performance across tasks” and outperformed all other models on promoter and splicing tasks. According to InstaDeep:
The Nucleotide Transformer opens doors to novel applications in genomics. Intriguingly, even probing of intermediate layers reveals rich contextual embeddings that capture key genomic features, such as promoters and enhancers, despite no supervision during training. [We] show that the zero-shot learning capabilities of NT enable [predicting] the impact of genetic mutations, offering potentially new tools for understanding disease mechanisms.
The best-performing NT model, Multispecies 2.5B, contains 2.5 billion parameters and was trained on data from 850 species of “diverse phyla,” including bacteria, fungi, and invertebrates as well as mammals such as mice and humans. Because this model outperformed a 2.5B parameter NT model trained only on human data, InstaDeep says that the multi-species data is “key to improving our understanding of the human genome.”
InstaDeep compared Multispecies 2.5B’s performance to three other genomics foundational models: Enformer, HyenaDNA, and DNABERT-2. All models were fine-tuned for each of the 18 downstream tasks. While Enformer had the best performance on enhancer prediction and “some” chromatin tasks, NT was the best overall. It outperformed HyenaDNA on all tasks, even though HyenaDNA was trained on the “human reference genome.”
Besides its use on downstream tasks, InstaDeep also investigated the model’s ability to predict the severity of genetic mutations. This was done using “zero-shot scores” of sequences, calculated using cosine distances in embedding space. They noted that this score produced a “moderate” correlation with severity.
An Instadeep employee BioGeek joined a Hacker News discussion about the work, pointing out example use cases in a Huggingface notebook. BioGeek also mentioned a previous Instadeep model called ChatNT:
[Y]ou can ask natural language questions like “Determine the degradation rate of the human RNA sequence @myseq.fna on a scale from -5 to 5.” and the ChatNT will answer with “The degradation rate for this sequence is 1.83.”
Another user said:
I’ve been trialing a bunch of these models at work. They basically learn where the DNA has important functions, and what those functions are. It’s very approximate, but up to now that’s been very hard to do from just the sequence and no other data.
The Nucleotide Transformers code is available on GitHub. The model files can be downloaded from Huggingface.