ProkBERT PhaStyle: Accurate Phage Lifestyle Prediction with Pretrained Genomic Language Models DOI

Julianna Juhász,

Bodnár Babett,

János Juhász

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 8, 2024

Abstract Background Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. from metagenome metavirome assemblies are often fragmented, the diversity of environmental phages not well known. Current computational approaches rely on database comparisons machine learning algorithms that require significant effort expertise to update. We propose using genomic language models for classification, allowing efficient direct analysis nucleotide without need sophisticated preprocessing pipelines manually curated databases. Methods trained three (DNABERT-2, Nucleotide Transformer, ProkBERT) datasets short, fragmented sequences. These were then compared with dedicated prediction methods (PhaTYP, DeePhage, BACPHLIP) terms accuracy, speed, generalization capability. Results ProkBERT PhaStyle consistently outperforms existing various scenarios. It generalizes out-of-sample data, accurately classifies extreme environments, also demonstrates high inference speed. Despite having up 20 times fewer parameters, it proved be better performing than much larger models. Conclusions Genomic offer a simple computationally alternative solving complex classification tasks, such prediction. PhaStyle’s simplicity, performance suggest its utility clinical

Language: Английский

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics DOI Creative Commons

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Jan. 15, 2023

Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, prediction of molecular phenotypes from DNA sequences alone remains limited inaccurate, often driven by scarcity annotated data inability to transfer learnings tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50M up 2.5B parameters integrating 3,202 diverse human genomes, as well 850 genomes selected across phyla, including both model non-model organisms. These transformer yield transferable, context-specific representations nucleotide which allow for accurate phenotype even low-data settings. We show that developed can be fine-tuned at low cost despite available regime solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements, those regulate gene expression, such enhancers. Lastly, demonstrate utilizing improve prioritization functional variants. The training application foundational explored this provide widely applicable stepping stone bridge sequence. Code weights at: https://github.com/instadeepai/nucleotide-transformer Jax https://huggingface.co/InstaDeepAI Pytorch. Example notebooks apply these any downstream task are https://huggingface.co/docs/transformers/notebooks#pytorch-bio.

Language: Английский

Citations

103

Sequence modeling and design from molecular to genome scale with Evo DOI

Eric Nguyen,

Michael Poli, Matthew G. Durrant

et al.

Science, Journal Year: 2024, Volume and Issue: 386(6723)

Published: Nov. 14, 2024

The genome is a sequence that encodes the DNA, RNA, and proteins orchestrate an organism’s function. We present Evo, long-context genomic foundation model with frontier architecture trained on millions of prokaryotic phage genomes, report scaling laws DNA to complement observations in language vision. Evo generalizes across proteins, enabling zero-shot function prediction competitive domain-specific models generation functional CRISPR-Cas transposon systems, representing first examples protein-RNA protein-DNA codesign model. also learns how small mutations affect whole-organism fitness generates megabase-scale sequences plausible architecture. These capabilities span molecular scales complexity, advancing our understanding control biology.

Language: Английский

Citations

55

Sequence modeling and design from molecular to genome scale with Evo DOI Creative Commons
Éric Nguyen, Michael Poli, Matthew G. Durrant

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Feb. 27, 2024

The genome is a sequence that completely encodes the DNA, RNA, and proteins orchestrate function of whole organism. Advances in machine learning combined with massive datasets genomes could enable biological foundation model accelerates mechanistic understanding generative design complex molecular interactions. We report Evo, genomic enables prediction generation tasks from to scale. Using an architecture based on advances deep signal processing, we scale Evo 7 billion parameters context length 131 kilobases (kb) at single-nucleotide, byte resolution. Trained prokaryotic genomes, can generalize across three fundamental modalities central dogma biology perform zero-shot competitive with, or outperforms, leading domain-specific language models. also excels multi-element tasks, which demonstrate by generating synthetic CRISPR-Cas complexes entire transposable systems for first time. information learned over predict gene essentiality nucleotide resolution generate coding-rich sequences up 650 kb length, orders magnitude longer than previous methods. multi-modal multi-scale provides promising path toward improving our control multiple levels complexity.

Language: Английский

Citations

52

Nucleotide Transformer: building and evaluating robust foundation models for human genomics DOI Creative Commons

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

et al.

Nature Methods, Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 28, 2024

The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50 million up 2.5 billion parameters integrating information 3,202 human genomes 850 diverse species. These transformer yield context-specific representations nucleotide which allow for accurate predictions even low-data settings. We show that developed can be fine-tuned at low cost solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements used improve prioritization genetic variants. training application foundational provides widely applicable approach phenotype sequence. Transformer is series different parameter sizes datasets applied various downstream tasks fine-tuning.

Language: Английский

Citations

36

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants DOI
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

et al.

Nature Biotechnology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 2, 2025

Language: Английский

Citations

4

Scientific Large Language Models: A Survey on Biological & Chemical Domains DOI Open Access
Qiang Zhang, Keyan Ding, Tingting Lv

et al.

ACM Computing Surveys, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 26, 2025

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized systems developed within various scientific disciplines. This growing interest has led to the advent LLMs, novel subclass specifically engineered for facilitating discovery. As burgeoning area community AI Science, warrant comprehensive exploration. However, systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor methodically delineate concept “scientific language”, whilst providing thorough review latest advancements LLMs. Given expansive realm disciplines, our analysis adopts focused lens, concentrating on biological chemical domains. includes an in-depth examination textual knowledge, small molecules, macromolecular proteins, genomic sequences, their combinations, analyzing terms model architectures, capabilities, datasets, evaluation. Finally, critically examine prevailing challenges point out promising research directions along with advances By offering overview technical developments field, aspires be invaluable resource researchers navigating intricate landscape

Language: Английский

Citations

2

Species-aware DNA language models capture regulatory elements and their evolution DOI Creative Commons
Alexander Karollus, Johannes Hingerl,

Dennis Gankin

et al.

Genome biology, Journal Year: 2024, Volume and Issue: 25(1)

Published: April 2, 2024

Abstract Background The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, algorithms are needed that can leverage conservation capture elements while accounting for their evolution. Results Here, we introduce species-aware DNA language models, which trained more than 800 species spanning over 500 million years Investigating ability predict masked nucleotides from context, show models distinguish transcription factor and RNA-binding protein motifs background non-coding sequence. Owing flexibility, conserved much further evolutionary distances sequence alignment would allow. Remarkably, reconstruct motif instances bound in vivo better unbound ones account the evolution sequences positional constraints, showing these functional high-order context. We training yields improved representations endogenous MPRA-based expression prediction, as well discovery. Conclusions Collectively, results demonstrate a powerful, flexible, scalable tool integrate information large compendia highly diverged genomes.

Language: Английский

Citations

14

Large language models in plant biology DOI
Hilbert Yuen In Lam, Xing Er Ong, Marek Mutwil

et al.

Trends in Plant Science, Journal Year: 2024, Volume and Issue: 29(10), P. 1145 - 1155

Published: May 26, 2024

Language: Английский

Citations

14

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Language: Английский

Citations

12

Human Genome Book: Words, Sentences and Paragraphs DOI Creative Commons
Liang Wang

Published: Feb. 19, 2025

Since the completion of human genome sequencing project in 2001, significant progress has been made areas such as gene regulation editing and protein structure prediction. However, given vast amount genomic data, segments that can be fully annotated understood remain relatively limited. If we consider a book, constructing its equivalents words, sentences, paragraphs long-standing popular research direction. Recently, studies on transfer learning large language models have provided novel approach to this challenge. Multilingual ability, which assesses how well fine-tuned source applied other languages, extensively studied multilingual pre-trained models. Similarly, natural capabilities “DNA language” also validated. Building upon these findings, first trained foundational model capable transferring linguistic from English DNA sequences. Using model, constructed vocabulary words mapped their equivalents. Subsequently, using datasets for paragraphing sentence segmentation develop segmenting sequences into sentences paragraphs. Leveraging models, processed GRCh38.p14 by segmenting, tokenizing, organizing it “book” comprised “words,” “sentences,” “paragraphs.” Additionally, based DNA-to-English mapping, created an “English version” book. This study offers perspective understanding provides exciting possibilities developing innovative tools search, generation, analysis.

Language: Английский

Citations

0