Using protein language models for protein interaction hot spot prediction with limited data DOI Creative Commons
Karen Sargsyan, Carmay Lim

BMC Bioinformatics, Journal Year: 2024, Volume and Issue: 25(1)

Published: March 16, 2024

Protein language models, inspired by the success of large models in deciphering human language, have emerged as powerful tools for unraveling intricate code life inscribed within protein sequences. They gained significant attention their promising applications across various areas, including sequence-based prediction secondary and tertiary structure, discovery new functional sequences/folds, assessment mutational impact on fitness. However, utility learning to predict residue properties based scant datasets, such protein-protein interaction (PPI)-hotspots whose mutations significantly impair PPIs, remained unclear. Here, we explore feasibility using language-learned representations features machine PPI-hotspots a dataset containing 414 experimentally confirmed 504 PPI-nonhot spots.

Language: Английский

Machine learning for functional protein design DOI
Pascal Notin, Nathan Rollins, Yarin Gal

et al.

Nature Biotechnology, Journal Year: 2024, Volume and Issue: 42(2), P. 216 - 228

Published: Feb. 1, 2024

Language: Английский

Citations

94

Simulating 500 million years of evolution with a language model DOI
Thomas Hayes, Roshan Rao, Halil Akin

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: July 2, 2024

Abstract More than three billion years of evolution have produced an image biology encoded into the space natural proteins. Here we show that language models trained on tokens generated by can act as evolutionary simulators to generate functional proteins are far away from known We present ESM3, a frontier multimodal generative model reasons over sequence, structure, and function ESM3 follow complex prompts combining its modalities is highly responsive biological alignment. prompted fluorescent with chain thought. Among generations synthesized, found bright protein at distance (58% identity) Similarly distant separated five hundred million evolution.

Language: Английский

Citations

89

Simulating 500 million years of evolution with a language model DOI
Thomas Hayes, Roshan Rao, Halil Akin

et al.

Science, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 16, 2025

More than three billion years of evolution have produced an image biology encoded into the space natural proteins. Here we show that language models trained at scale on evolutionary data can generate functional proteins are far away from known We present ESM3, a frontier multimodal generative model reasons over sequence, structure, and function ESM3 follow complex prompts combining its modalities is highly responsive to alignment improve fidelity. prompted fluorescent Among generations synthesized, found bright protein distance (58% sequence identity) proteins, which estimate equivalent simulating five hundred million evolution.

Language: Английский

Citations

50

Sparks of function by de novo protein design DOI
Alexander E. Chu, Tianyu Lu, Po‐Ssu Huang

et al.

Nature Biotechnology, Journal Year: 2024, Volume and Issue: 42(2), P. 203 - 215

Published: Feb. 1, 2024

Language: Английский

Citations

33

Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models DOI Creative Commons
Francesca-Zhoufan Li, Ava P. Amini, Yisong Yue

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Feb. 8, 2024

Abstract Large pretrained protein language models (PLMs) have improved property and structure prediction from sequences via transfer learning, in which weights representations PLMs are repurposed for downstream tasks. Although shown great promise, currently there is little understanding of how the features learned by pretraining relate to useful We perform a systematic analysis learning using PLMs, conducting 370 experiments across comprehensive suite factors including different tasks, architectures, model sizes, depths, time. observe that while almost all down-stream tasks do benefit compared naive sequence representations, majority performance does not scale with pretraining, instead relies on low-level early pretraining. Our results point mismatch between current PLM paradigms most applications these models, indicating need better methods.

Language: Английский

Citations

26

Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning DOI Creative Commons
Ziyi Zhou, Liang Zhang, Yuanxi Yu

et al.

Nature Communications, Journal Year: 2024, Volume and Issue: 15(1)

Published: July 2, 2024

Accurately modeling the protein fitness landscapes holds great importance for engineering. Pre-trained language models have achieved state-of-the-art performance in predicting without wet-lab experimental data, but their accuracy and interpretability remain limited. On other hand, traditional supervised deep learning require abundant labeled training examples improvements, posing a practical barrier. In this work, we introduce FSFP, strategy that can effectively optimize under extreme data scarcity prediction. By combining meta-transfer learning, to rank, parameter-efficient fine-tuning, FSFP significantly boost of various using merely tens single-site mutants from target protein. silico benchmarks across 87 mutational scanning datasets demonstrate FSFP's superiority over both unsupervised baselines. Furthermore, successfully apply engineer Phi29 DNA polymerase through experiments, achieving 25% increase positive rate. These results underscore potential our approach aiding AI-guided

Language: Английский

Citations

24

PTM-Mamba: a PTM-aware protein language model with bidirectional gated Mamba blocks DOI Creative Commons
Peng Fei,

Chentong Wang,

Tong Chen

et al.

Nature Methods, Journal Year: 2025, Volume and Issue: unknown

Published: April 10, 2025

Language: Английский

Citations

2

Generative artificial intelligence for de novo protein design DOI Creative Commons

Adam Winnifrith,

Carlos Outeiral, Brian Hie

et al.

Current Opinion in Structural Biology, Journal Year: 2024, Volume and Issue: 86, P. 102794 - 102794

Published: April 24, 2024

Engineering new molecules with desirable functions and properties has the potential to extend our ability engineer proteins beyond what nature so far evolved. Advances in so-called 'de novo' design problem have recently been brought forward by developments artificial intelligence. Generative architectures, such as language models diffusion processes, seem adept at generating novel, yet realistic that display perform specified functions. State-of-the-art protocols now achieve experimental success rates nearing 20%, thus widening access de novo designed proteins. Despite extensive progress, there are clear field-wide challenges, for example, determining best silico metrics prioritise designs testing, designing can undergo large conformational changes or be regulated post-translational modifications. With an increase number of being developed, this review provides a framework understand how these tools fit into overall process protein design. Throughout, we highlight power incorporating biochemical knowledge improve performance interpretability.

Language: Английский

Citations

17

Zero-shot prediction of mutation effects with multimodal deep representation learning guides protein engineering DOI Creative Commons
Peng Cheng,

Cong Mao,

Jin Tang

et al.

Cell Research, Journal Year: 2024, Volume and Issue: 34(9), P. 630 - 647

Published: July 5, 2024

Abstract Mutations in amino acid sequences can provoke changes protein function. Accurate and unsupervised prediction of mutation effects is critical biotechnology biomedicine, but remains a fundamental challenge. To resolve this challenge, here we present Pro tein M utational E ffect P redictor (ProMEP), general multiple sequence alignment-free method that enables zero-shot effects. A multimodal deep representation learning model embedded ProMEP was developed to comprehensively learn both structure contexts from ~160 million proteins. achieves state-of-the-art performance mutational effect accomplishes tremendous improvement speed, enabling efficient intelligent engineering. Specifically, accurately forecasts consequences on the gene-editing enzymes TnpB TadA, successfully guides development high-performance tools with their engineered variants. The efficiency 5-site mutant reaches up 74.04% (vs 24.66% for wild type); base editing tool basis TadA 15-site (in addition A106V/D108N double renders deoxyadenosine deaminase activity TadA) exhibits an A-to-G conversion frequency 77.27% 69.80% ABE8e, previous TadA-based adenine editor) significantly reduced bystander off-target compared ABE8e. not only showcases superior predicting proteins also demonstrates great capability guide Therefore, exploration gigantic space facilitates practical design proteins, thereby advancing studies biomedicine synthetic biology.

Language: Английский

Citations

14

Multi-modal deep learning enables efficient and accurate annotation of enzymatic active sites DOI Creative Commons
Xiaorui Wang, Xiaodan Yin, Dejun Jiang

et al.

Nature Communications, Journal Year: 2024, Volume and Issue: 15(1)

Published: Aug. 26, 2024

Annotating active sites in enzymes is crucial for advancing multiple fields including drug discovery, disease research, enzyme engineering, and synthetic biology. Despite the development of numerous automated annotation algorithms, a significant trade-off between speed accuracy limits their large-scale practical applications. We introduce EasIFA, an site algorithm that fuses latent representations from Protein Language Model 3D structural encoder, then aligns protein-level information with knowledge enzymatic reactions using multi-modal cross-attention framework. EasIFA outperforms BLASTp 10-fold increase improved recall, precision, f1 score, MCC by 7.57%, 13.08%, 9.68%, 0.1012, respectively. It also surpasses empirical-rule-based other state-of-the-art deep learning method based on PSSM features, achieving ranging 650 to 1400 times while enhancing quality. This makes suitable replacement conventional tools both industrial academic settings. can effectively transfer gained coarsely annotated databases smaller, high-precision datasets, highlighting its ability model sparse high-quality databases. Additionally, shows potential as catalytic monitoring tool designing desired functions beyond natural distribution. Wang et al. propose efficient algorithm, advance various

Language: Английский

Citations

12