Computational scoring and experimental evaluation of enzymes generated by neural networks DOI Creative Commons
Sean R. Johnson, Xiaozhi Fu, Sandra Viknander

и другие.

Nature Biotechnology, Год журнала: 2024, Номер unknown

Опубликована: Апрель 23, 2024

In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics assess the quality enzyme sequences produced by three contrasting models: ancestral reconstruction, adversarial network language model. Focusing on two families, we expressed purified over 500 natural with 70-90% identity most similar benchmark for in vitro activity. Over rounds experiments, filter that improved rate experimental success 50-150%. The proposed drive engineering research serving as helping select active variants testing.

Язык: Английский

ProtGPT2 is a deep unsupervised language model for protein design DOI Creative Commons
Noelia Ferruz, Steffen Schmidt, Birte Höcker

и другие.

Nature Communications, Год журнала: 2022, Номер 13(1)

Опубликована: Июль 27, 2022

Protein design aims to build novel proteins customized for specific purposes, thereby holding the potential tackle many environmental and biomedical problems. Recent progress in Transformer-based architectures has enabled implementation of language models capable generating text with human-like capabilities. Here, motivated by this success, we describe ProtGPT2, a model trained on protein space that generates de novo sequences following principles natural ones. The generated display amino acid propensities, while disorder predictions indicate 88% ProtGPT2-generated are globular, line sequences. Sensitive sequence searches databases show ProtGPT2 distantly related ones, similarity networks further demonstrate is sampling unexplored regions space. AlphaFold prediction ProtGPT2-sequences yields well-folded non-idealized structures embodiments large loops reveals topologies not captured current structure databases. matter seconds freely available.

Язык: Английский

Процитировано

437

DeepLoc 2.0: multi-label subcellular localization prediction using protein language models DOI Creative Commons

Vineet Thumuluri,

José Juan Almagro Armenteros, Alexander Rosenberg Johansen

и другие.

Nucleic Acids Research, Год журнала: 2022, Номер 50(W1), С. W228 - W234

Опубликована: Апрель 19, 2022

The prediction of protein subcellular localization is great relevance for proteomics research. Here, we propose an update to the popular tool DeepLoc with multi-localization and improvements in both performance interpretability. For training validation, curate eukaryotic human multi-location datasets stringent homology partitioning enriched sorting signal information compiled from literature. We achieve state-of-the-art 2.0 by using a pre-trained language model. It has further advantage that it uses sequence input rather than relying on slower profiles. provide two means better interpretability: attention output along highly accurate nine different types signals. find correlates well position webserver available at services.healthtech.dtu.dk/service.php?DeepLoc-2.0.

Язык: Английский

Процитировано

429

Genome-wide prediction of disease variant effects with a deep protein language model DOI Creative Commons
Nadav Brandes,

Grant Goldman,

Charlotte H. Wang

и другие.

Nature Genetics, Год журнала: 2023, Номер 55(9), С. 1512 - 1522

Опубликована: Авг. 10, 2023

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all due to dependency on close homologs or software limitations. Here we developed workflow using ESM1b, 650-million-parameter protein language model, predict ~450 million possible missense in human genome, and made predictions available web portal. ESM1b outperformed existing methods classifying ~150,000 ClinVar/HGMD as pathogenic benign predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 damaging only specific isoforms, demonstrating importance considering isoforms when effects. Our approach also generalizes more complex such in-frame indels stop-gains. Together, these results establish an effective, accurate general

Язык: Английский

Процитировано

222

BepiPred‐3.0: Improved B‐cell epitope prediction using protein language models DOI Creative Commons
Joakim Nøddeskov Clifford, Magnus Haraldson Høie, Sebastian Deleuran

и другие.

Protein Science, Год журнала: 2022, Номер 31(12)

Опубликована: Ноя. 11, 2022

B-cell epitope prediction tools are of great medical and commercial interest due to their practical applications in vaccine development disease diagnostics. The introduction protein language models (LMs), trained on unprecedented large datasets sequences structures, tap into a powerful numeric representation that can be exploited accurately predict local global structural features from amino acid only. In this paper, we present BepiPred-3.0, sequence-based tool that, by exploiting LM embeddings, greatly improves the accuracy for both linear conformational several independent test sets. Furthermore, carefully selecting additional input variables residue annotation strategy, performance was further improved, thus achieving predictive power. Our epitopes across hundreds minutes. It is freely available as web server standalone package at https://services.healthtech.dtu.dk/service.php?BepiPred-3.0 with user-friendly interface navigate results.

Язык: Английский

Процитировано

124

ProteInfer, deep neural networks for protein functional inference DOI Creative Commons
Theo Sanderson, Maxwell L. Bileschi, David Belanger

и другие.

eLife, Год журнала: 2023, Номер 12

Опубликована: Фев. 27, 2023

Predicting the function of a protein from its amino acid sequence is long-standing challenge in bioinformatics. Traditional approaches use alignment to compare query either thousands models families or large databases individual sequences. Here we introduce ProteInfer, which instead employs deep convolutional neural networks directly predict variety functions – Enzyme Commission (EC) numbers and Gene Ontology (GO) terms an unaligned sequence. This approach provides precise predictions complement alignment-based methods, computational efficiency single network permits novel lightweight software interfaces, demonstrate with in-browser graphical interface for prediction all computation performed on user’s personal computer no data uploaded remote servers. Moreover, these place full-length sequences into generalised functional space, facilitating downstream analysis interpretation. To read interactive version this paper, please visit https://google-research.github.io/proteinfer/ .

Язык: Английский

Процитировано

111

Transformer-based deep learning for predicting protein properties in the life sciences DOI Creative Commons
Abel Chandra, Laura Tünnermann, Tommy Löfstedt

и другие.

eLife, Год журнала: 2023, Номер 12

Опубликована: Янв. 18, 2023

Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough life science applications, particular protein property prediction. There is hope that learning can close the gap between proteins and known properties based on lab experiments. Language models from field natural language processing gained popularity for predictions new computational revolution biology, where old prediction results are being improved regularly. Such learn useful multipurpose representations large open repositories sequences be used, instance, predict properties. The growing quickly because class model-the Transformer model. We review recent use large-scale applications predicting characteristics how such used predict, example, post-translational modifications. shortcomings other explain proven very promising way unravel information hidden amino acids.

Язык: Английский

Процитировано

99

Machine learning for functional protein design DOI
Pascal Notin, Nathan Rollins, Yarin Gal

и другие.

Nature Biotechnology, Год журнала: 2024, Номер 42(2), С. 216 - 228

Опубликована: Фев. 1, 2024

Язык: Английский

Процитировано

98

Applications of transformer-based language models in bioinformatics: a survey DOI Creative Commons
Shuang Zhang, Rui Fan, Yuti Liu

и другие.

Bioinformatics Advances, Год журнала: 2023, Номер 3(1)

Опубликована: Янв. 1, 2023

Abstract Summary The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural processing (NLP). Since there are inherent similarities between various biological sequences languages, remarkable interpretability adaptability these models prompted a new wave their application bioinformatics research. To provide timely comprehensive review, we introduce key developments by describing detailed structure transformers summarize contribution to wide range research from basic sequence analysis drug discovery. While applications diverse multifaceted, identify discuss common challenges, heterogeneity training data, computational expense model interpretability, opportunities context We hope that broader community NLP researchers, bioinformaticians biologists will be brought together foster future development inspire novel unattainable traditional methods. Supplementary information data available at Bioinformatics Advances online.

Язык: Английский

Процитировано

96

Pre-trained Language Models in Biomedical Domain: A Systematic Survey DOI Open Access
Benyou Wang, Qianqian Xie, Jiahuan Pei

и другие.

ACM Computing Surveys, Год журнала: 2023, Номер 56(3), С. 1 - 52

Опубликована: Авг. 1, 2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural processing tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science communities propose various PLMs trained on datasets, e.g., text, electronic health records, protein, DNA sequences However, cross-discipline characteristics of hinder their spreading among communities; some existing works are isolated each other without comprehensive comparison discussions. It is nontrivial to make a survey that not only systematically reviews recent advances in applications but standardizes terminology benchmarks. article summarizes progress pre-trained domain downstream Particularly, we discuss motivations introduce key concepts models. We then taxonomy categorizes them perspectives systematically. Plus, tasks exhaustively discussed, respectively. Last, illustrate limitations future trends, which aims provide inspiration research.

Язык: Английский

Процитировано

94

Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies DOI Creative Commons
Rahmad Akbar, Habib Bashour, Puneet Rawat

и другие.

mAbs, Год журнала: 2022, Номер 14(1)

Опубликована: Март 16, 2022

Although the therapeutic efficacy and commercial success of monoclonal antibodies (mAbs) are tremendous, design discovery new candidates remain a time cost-intensive endeavor. In this regard, progress in generation data describing antigen binding developability, computational methodology, artificial intelligence may pave way for era silico on-demand immunotherapeutics discovery. Here, we argue that main necessary machine learning (ML) components an mAb sequence generator are: understanding rules mAb-antigen binding, capacity to modularly combine parameters, algorithms unconstrained parameter-driven synthesis. We review current toward realization these discuss challenges must be overcome allow ML-based fit-for-purpose candidates.

Язык: Английский

Процитировано

86