GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction DOI Creative Commons
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Окт. 11, 2023

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially complex genomes such as that humans. To address this challenge, we here introduce GPN-MSA, novel framework leverages whole-genome sequence alignments across multiple species and takes only few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), population genomic data (gnomAD), our model human genome achieves outstanding performance deleteriousness prediction both coding non-coding variants.

Язык: Английский

Simulating 500 million years of evolution with a language model DOI
Thomas Hayes, Roshan Rao, Halil Akin

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 2, 2024

Abstract More than three billion years of evolution have produced an image biology encoded into the space natural proteins. Here we show that language models trained on tokens generated by can act as evolutionary simulators to generate functional proteins are far away from known We present ESM3, a frontier multimodal generative model reasons over sequence, structure, and function ESM3 follow complex prompts combining its modalities is highly responsive biological alignment. prompted fluorescent with chain thought. Among generations synthesized, found bright protein at distance (58% identity) Similarly distant separated five hundred million evolution.

Язык: Английский

Процитировано

87

Machine learning for functional protein design DOI
Pascal Notin, Nathan Rollins, Yarin Gal

и другие.

Nature Biotechnology, Год журнала: 2024, Номер 42(2), С. 216 - 228

Опубликована: Фев. 1, 2024

Язык: Английский

Процитировано

84

Sequence modeling and design from molecular to genome scale with Evo DOI

Eric Nguyen,

Michael Poli, Matthew G. Durrant

и другие.

Science, Год журнала: 2024, Номер 386(6723)

Опубликована: Ноя. 14, 2024

The genome is a sequence that encodes the DNA, RNA, and proteins orchestrate an organism’s function. We present Evo, long-context genomic foundation model with frontier architecture trained on millions of prokaryotic phage genomes, report scaling laws DNA to complement observations in language vision. Evo generalizes across proteins, enabling zero-shot function prediction competitive domain-specific models generation functional CRISPR-Cas transposon systems, representing first examples protein-RNA protein-DNA codesign model. also learns how small mutations affect whole-organism fitness generates megabase-scale sequences plausible architecture. These capabilities span molecular scales complexity, advancing our understanding control biology.

Язык: Английский

Процитировано

58

Sequence modeling and design from molecular to genome scale with Evo DOI Creative Commons
Éric Nguyen, Michael Poli, Matthew G. Durrant

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Фев. 27, 2024

The genome is a sequence that completely encodes the DNA, RNA, and proteins orchestrate function of whole organism. Advances in machine learning combined with massive datasets genomes could enable biological foundation model accelerates mechanistic understanding generative design complex molecular interactions. We report Evo, genomic enables prediction generation tasks from to scale. Using an architecture based on advances deep signal processing, we scale Evo 7 billion parameters context length 131 kilobases (kb) at single-nucleotide, byte resolution. Trained prokaryotic genomes, can generalize across three fundamental modalities central dogma biology perform zero-shot competitive with, or outperforms, leading domain-specific language models. also excels multi-element tasks, which demonstrate by generating synthetic CRISPR-Cas complexes entire transposable systems for first time. information learned over predict gene essentiality nucleotide resolution generate coding-rich sequences up 650 kb length, orders magnitude longer than previous methods. multi-modal multi-scale provides promising path toward improving our control multiple levels complexity.

Язык: Английский

Процитировано

52

Simulating 500 million years of evolution with a language model DOI
Thomas Hayes, Roshan Rao, Halil Akin

и другие.

Science, Год журнала: 2025, Номер unknown

Опубликована: Янв. 16, 2025

More than three billion years of evolution have produced an image biology encoded into the space natural proteins. Here we show that language models trained at scale on evolutionary data can generate functional proteins are far away from known We present ESM3, a frontier multimodal generative model reasons over sequence, structure, and function ESM3 follow complex prompts combining its modalities is highly responsive to alignment improve fidelity. prompted fluorescent Among generations synthesized, found bright protein distance (58% sequence identity) proteins, which estimate equivalent simulating five hundred million evolution.

Язык: Английский

Процитировано

35

Protein language models are biased by unequal sequence sampling across the tree of life DOI Creative Commons
Frances Ding, Jacob Steinhardt

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Март 12, 2024

Abstract Protein language models (pLMs) trained on large protein sequence databases have been used to understand disease and design novel proteins. In tasks, the likelihood of a under pLM is often as proxy for fitness, so it critical what signals likelihoods capture. this work we find that unintentionally encode species bias: sequences from certain are systematically higher, independent in question. We quantify bias show arises part because unequal representation popular databases. further can be detrimental some applications, such enhancing thermostability. These results highlight importance understanding curating training data mitigate biases improve capabilities under-explored parts space.

Язык: Английский

Процитировано

23

Guiding questions to avoid data leakage in biological machine learning applications DOI
Judith Bernett, David B. Blumenthal, Dominik G. Grimm

и другие.

Nature Methods, Год журнала: 2024, Номер 21(8), С. 1444 - 1453

Опубликована: Авг. 1, 2024

Язык: Английский

Процитировано

20

Rapid in silico directed evolution by a protein language model with EVOLVEpro DOI
Kaiyi Jiang, Zhaoqing Yan, Matteo Di Bernardo

и другие.

Science, Год журнала: 2024, Номер unknown

Опубликована: Ноя. 21, 2024

Directed protein evolution is central to biomedical applications but faces challenges like experimental complexity, inefficient multi-property optimization, and local maxima traps. While in silico methods using language models (PLMs) can provide modeled fitness landscape guidance, they struggle generalize across diverse families map activity. We present EVOLVEpro, a few-shot active learning framework that combines PLMs regression rapidly improve EVOLVEpro surpasses current methods, yielding up 100-fold improvements desired properties. demonstrate its effectiveness six proteins RNA production, genome editing, antibody binding applications. These results highlight the advantages of with minimal data over zero-shot predictions. opens new possibilities for AI-guided engineering biology medicine.

Язык: Английский

Процитировано

20

PTM-Mamba: A PTM-Aware Protein Language Model with Bidirectional Gated Mamba Blocks DOI Creative Commons
Zhangzhi Peng,

Benjamin Schussheim,

Pranam Chatterjee

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Фев. 29, 2024

A bstract Proteins serve as the workhorses of living organisms, orchestrating a wide array vital functions. Post-translational modifications (PTMs) their amino acids greatly influence structural and functional diversity different protein types uphold proteostasis, allowing cells to swiftly respond environmental changes intricately regulate complex biological processes. To this point, efforts model features proteins have involved training large expressive language models (pLMs) such ESM-2 ProtT5, which accurately encode structural, functional, physicochemical properties input sequences. However, over 200 million sequences that these pLMs were trained on merely scratch surface proteomic diversity, they neither nor account for effects PTMs. In work, we fill major gap in sequence modeling by introducing PTM tokens into pLM regime. We then leverage recent advancements structured state space (SSMs), specifically Mamba, utilizes efficient hardware-aware primitives overcome quadratic time complexities Transformers. After adding comprehensive set vocabulary, train bidirectional Mamba blocks whose outputs are fused with state-of-the-art embeddings via novel gating mechanism. demonstrate our resultant PTM-aware pLM, PTM-Mamba , improves upon ESM-2’s performance various PTM-specific tasks. is first only can uniquely represent both wild-type sequences, motivating downstream design applications specific post-translationally modified proteins. facilitate applications, made available at: https://huggingface.co/ChatterjeeLab/PTM-Mamba .

Язык: Английский

Процитировано

17

Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering DOI Creative Commons
Kerr Ding, M. A. Chin, Yunlong Zhao

и другие.

Nature Communications, Год журнала: 2024, Номер 15(1)

Опубликована: Июль 29, 2024

Abstract The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm learns from natural protein sequences infer evolutionarily plausible mutations predict fitness. MODIFY co-optimizes predicted sequence starting libraries, prioritizing high-fitness variants while ensuring broad coverage. In silico evaluation shows outperforms state-of-the-art unsupervised methods zero-shot prediction enables ML-guided directed evolution with enhanced efficiency. Using we engineer generalist biocatalysts derived thermostable cytochrome c achieve enantioselective C-B C-Si bond formation via new-to-nature carbene transfer mechanism, leading six away previously developed enzymes exhibiting superior comparable activities. These results demonstrate MODIFY’s potential solving challenging problems beyond reach classic evolution.

Язык: Английский

Процитировано

16