Cited by ProkBERT PhaStyle: Accurate Phage Lifestyle Prediction with Pretrained Genomic Language Models

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics DOI

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Янв. 15, 2023

Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, prediction of molecular phenotypes from DNA sequences alone remains limited inaccurate, often driven by scarcity annotated data inability to transfer learnings tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50M up 2.5B parameters integrating 3,202 diverse human genomes, as well 850 genomes selected across phyla, including both model non-model organisms. These transformer yield transferable, context-specific representations nucleotide which allow for accurate phenotype even low-data settings. We show that developed can be fine-tuned at low cost despite available regime solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements, those regulate gene expression, such enhancers. Lastly, demonstrate utilizing improve prioritization functional variants. The training application foundational explored this provide widely applicable stepping stone bridge sequence. Code weights at: https://github.com/instadeepai/nucleotide-transformer Jax https://huggingface.co/InstaDeepAI Pytorch. Example notebooks apply these any downstream task are https://huggingface.co/docs/transformers/notebooks#pytorch-bio.

Язык: Английский

Процитировано

103

Sequence modeling and design from molecular to genome scale with Evo DOI

Eric Nguyen,

Michael Poli, Matthew G. Durrant

и другие.

Science, Год журнала: 2024, Номер 386(6723)

Опубликована: Ноя. 14, 2024

The genome is a sequence that encodes the DNA, RNA, and proteins orchestrate an organism’s function. We present Evo, long-context genomic foundation model with frontier architecture trained on millions of prokaryotic phage genomes, report scaling laws DNA to complement observations in language vision. Evo generalizes across proteins, enabling zero-shot function prediction competitive domain-specific models generation functional CRISPR-Cas transposon systems, representing first examples protein-RNA protein-DNA codesign model. also learns how small mutations affect whole-organism fitness generates megabase-scale sequences plausible architecture. These capabilities span molecular scales complexity, advancing our understanding control biology.

Язык: Английский

Процитировано

Sequence modeling and design from molecular to genome scale with Evo DOI

Éric Nguyen, Michael Poli, Matthew G. Durrant

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Фев. 27, 2024

The genome is a sequence that completely encodes the DNA, RNA, and proteins orchestrate function of whole organism. Advances in machine learning combined with massive datasets genomes could enable biological foundation model accelerates mechanistic understanding generative design complex molecular interactions. We report Evo, genomic enables prediction generation tasks from to scale. Using an architecture based on advances deep signal processing, we scale Evo 7 billion parameters context length 131 kilobases (kb) at single-nucleotide, byte resolution. Trained prokaryotic genomes, can generalize across three fundamental modalities central dogma biology perform zero-shot competitive with, or outperforms, leading domain-specific language models. also excels multi-element tasks, which demonstrate by generating synthetic CRISPR-Cas complexes entire transposable systems for first time. information learned over predict gene essentiality nucleotide resolution generate coding-rich sequences up 650 kb length, orders magnitude longer than previous methods. multi-modal multi-scale provides promising path toward improving our control multiple levels complexity.

Язык: Английский

Процитировано

Nucleotide Transformer: building and evaluating robust foundation models for human genomics DOI

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

и другие.

Nature Methods, Год журнала: 2024, Номер unknown

Опубликована: Ноя. 28, 2024

The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50 million up 2.5 billion parameters integrating information 3,202 human genomes 850 diverse species. These transformer yield context-specific representations nucleotide which allow for accurate predictions even low-data settings. We show that developed can be fine-tuned at low cost solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements used improve prioritization genetic variants. training application foundational provides widely applicable approach phenotype sequence. Transformer is series different parameter sizes datasets applied various downstream tasks fine-tuning.

Язык: Английский

Процитировано

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants DOI

Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

и другие.

Nature Biotechnology, Год журнала: 2025, Номер unknown

Опубликована: Янв. 2, 2025

Язык: Английский

Процитировано

Scientific Large Language Models: A Survey on Biological & Chemical Domains DOI

Qiang Zhang, Keyan Ding, Tingting Lv

и другие.

ACM Computing Surveys, Год журнала: 2025, Номер unknown

Опубликована: Янв. 26, 2025

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized systems developed within various scientific disciplines. This growing interest has led to the advent LLMs, novel subclass specifically engineered for facilitating discovery. As burgeoning area community AI Science, warrant comprehensive exploration. However, systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor methodically delineate concept “scientific language”, whilst providing thorough review latest advancements LLMs. Given expansive realm disciplines, our analysis adopts focused lens, concentrating on biological chemical domains. includes an in-depth examination textual knowledge, small molecules, macromolecular proteins, genomic sequences, their combinations, analyzing terms model architectures, capabilities, datasets, evaluation. Finally, critically examine prevailing challenges point out promising research directions along with advances By offering overview technical developments field, aspires be invaluable resource researchers navigating intricate landscape

Язык: Английский

Процитировано

Species-aware DNA language models capture regulatory elements and their evolution DOI

Alexander Karollus, Johannes Hingerl,

Dennis Gankin

и другие.

Genome biology, Год журнала: 2024, Номер 25(1)

Опубликована: Апрель 2, 2024

Abstract Background The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, algorithms are needed that can leverage conservation capture elements while accounting for their evolution. Results Here, we introduce species-aware DNA language models, which trained more than 800 species spanning over 500 million years Investigating ability predict masked nucleotides from context, show models distinguish transcription factor and RNA-binding protein motifs background non-coding sequence. Owing flexibility, conserved much further evolutionary distances sequence alignment would allow. Remarkably, reconstruct motif instances bound in vivo better unbound ones account the evolution sequences positional constraints, showing these functional high-order context. We training yields improved representations endogenous MPRA-based expression prediction, as well discovery. Conclusions Collectively, results demonstrate a powerful, flexible, scalable tool integrate information large compendia highly diverged genomes.

Язык: Английский

Процитировано

Large language models in plant biology DOI

Hilbert Yuen In Lam, Xing Er Ong, Marek Mutwil

и другие.

Trends in Plant Science, Год журнала: 2024, Номер 29(10), С. 1145 - 1155

Опубликована: Май 26, 2024

Язык: Английский

Процитировано

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI

Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Март 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Язык: Английский

Процитировано

Human Genome Book: Words, Sentences and Paragraphs DOI

Liang Wang

Опубликована: Фев. 19, 2025

Since the completion of human genome sequencing project in 2001, significant progress has been made areas such as gene regulation editing and protein structure prediction. However, given vast amount genomic data, segments that can be fully annotated understood remain relatively limited. If we consider a book, constructing its equivalents words, sentences, paragraphs long-standing popular research direction. Recently, studies on transfer learning large language models have provided novel approach to this challenge. Multilingual ability, which assesses how well fine-tuned source applied other languages, extensively studied multilingual pre-trained models. Similarly, natural capabilities “DNA language” also validated. Building upon these findings, first trained foundational model capable transferring linguistic from English DNA sequences. Using model, constructed vocabulary words mapped their equivalents. Subsequently, using datasets for paragraphing sentence segmentation develop segmenting sequences into sentences paragraphs. Leveraging models, processed GRCh38.p14 by segmenting, tokenizing, organizing it “book” comprised “words,” “sentences,” “paragraphs.” Additionally, based DNA-to-English mapping, created an “English version” book. This study offers perspective understanding provides exciting possibilities developing innovative tools search, generation, analysis.

Язык: Английский

Процитировано