Machine learning in RNA structure prediction: Advances and challenges DOI
Sicheng Zhang, Jun Li, Shi‐Jie Chen

et al.

Biophysical Journal, Journal Year: 2024, Volume and Issue: 123(17), P. 2647 - 2657

Published: Jan. 30, 2024

Language: Английский

ProteinBERT: a universal deep-learning model of protein sequence and function DOI Creative Commons
Nadav Brandes, Dan Ofer,

Yam Peleg

et al.

Bioinformatics, Journal Year: 2022, Volume and Issue: 38(8), P. 2102 - 2110

Published: Jan. 8, 2022

Self-supervised deep language modeling has shown unprecedented success across natural tasks, and recently been repurposed to biological sequences. However, existing models pretraining methods are designed optimized for text analysis. We introduce ProteinBERT, a model specifically proteins. Our scheme combines with novel task of Gene Ontology (GO) annotation prediction. architectural elements that make the highly efficient flexible long The architecture ProteinBERT consists both local global representations, allowing end-to-end processing these types inputs outputs. obtains near state-of-the-art performance, sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including structure, post-translational modifications biophysical attributes), despite using far smaller faster than competing deep-learning methods. Overall, provides an framework rapidly training predictors, even limited labeled data.

Language: Английский

Citations

510

Genome-wide prediction of disease variant effects with a deep protein language model DOI Creative Commons
Nadav Brandes,

Grant Goldman,

Charlotte H. Wang

et al.

Nature Genetics, Journal Year: 2023, Volume and Issue: 55(9), P. 1512 - 1522

Published: Aug. 10, 2023

Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all due to dependency on close homologs or software limitations. Here we developed workflow using ESM1b, 650-million-parameter protein language model, predict ~450 million possible missense in human genome, and made predictions available web portal. ESM1b outperformed existing methods classifying ~150,000 ClinVar/HGMD as pathogenic benign predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 damaging only specific isoforms, demonstrating importance considering isoforms when effects. Our approach also generalizes more complex such in-frame indels stop-gains. Together, these results establish an effective, accurate general

Language: Английский

Citations

221

Rhea, the reaction knowledgebase in 2022 DOI Creative Commons
Parit Bansal, Anne Morgat, Kristian B. Axelsen

et al.

Nucleic Acids Research, Journal Year: 2021, Volume and Issue: 50(D1), P. D693 - D700

Published: Nov. 9, 2021

Abstract Rhea (https://www.rhea-db.org) is an expert-curated knowledgebase of biochemical reactions based on the chemical ontology ChEBI (Chemical Entities Biological Interest) (https://www.ebi.ac.uk/chebi). In this paper, we describe a number key developments in since our last report database issue Nucleic Acids Research 2019. These include improved reaction coverage Rhea, adoption as reference vocabulary for enzyme annotation UniProt UniProtKB (https://www.uniprot.org), development new website, and designation ELIXIR Core Data Resource. We hope that these other will enhance utility resource to study engineer enzymes metabolic systems which they function.

Language: Английский

Citations

161

Embeddings from protein language models predict conservation and variant effects DOI Creative Commons
Céline Marquet, Michael Heinzinger, Tobias Olenyi

et al.

Human Genetics, Journal Year: 2021, Volume and Issue: 141(10), P. 1629 - 1647

Published: Dec. 30, 2021

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret effect single amino acid (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue expand our understanding mutational landscape proteins, results challenge analyses. Protein Language Models (pLMs) use latest deep learning (DL) algorithms leverage growing databases sequences. These methods learn predict missing or masked acids from context entire sequence regions. Here, we used pLM representations (embeddings) conservation and SAV effects without multiple alignments (MSAs). Embeddings alone predicted residue almost as accurately sequences ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings 0.596 ± 0.006 vs. 0.608 ConSeq). Inputting prediction along with BLOSUM62 substitution scores mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble Variant Effect Score Prediction Alignments (VESPA) magnitude any optimization DMS data. Comparing predictions standard set 39 experiments other (incl. ESM-1v, DeepSequence, GEMME) revealed approach competitive state-of-the-art (SOTA) MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently performance measure applied (Spearman Pearson correlation). Finally, investigated binary four human proteins. Overall, embedding-based have become relying at fraction costs in computing/energy. Our proteome (~ 20 k proteins) within 40 min one Nvidia Quadro RTX 8000. All data are freely available local online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , PredictProtein.

Language: Английский

Citations

110

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction DOI Creative Commons
Konstantin Weißenow, Michael Heinzinger, Burkhard Rost

et al.

Structure, Journal Year: 2022, Volume and Issue: 30(8), P. 1169 - 1177.e4

Published: May 23, 2022

Language: Английский

Citations

102

Transformer-based deep learning for predicting protein properties in the life sciences DOI Creative Commons
Abel Chandra, Laura Tünnermann, Tommy Löfstedt

et al.

eLife, Journal Year: 2023, Volume and Issue: 12

Published: Jan. 18, 2023

Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough life science applications, particular protein property prediction. There is hope that learning can close the gap between proteins and known properties based on lab experiments. Language models from field natural language processing gained popularity for predictions new computational revolution biology, where old prediction results are being improved regularly. Such learn useful multipurpose representations large open repositories sequences be used, instance, predict properties. The growing quickly because class model-the Transformer model. We review recent use large-scale applications predicting characteristics how such used predict, example, post-translational modifications. shortcomings other explain proven very promising way unravel information hidden amino acids.

Language: Английский

Citations

98

Applications of transformer-based language models in bioinformatics: a survey DOI Creative Commons
Shuang Zhang, Rui Fan, Yuti Liu

et al.

Bioinformatics Advances, Journal Year: 2023, Volume and Issue: 3(1)

Published: Jan. 1, 2023

Abstract Summary The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural processing (NLP). Since there are inherent similarities between various biological sequences languages, remarkable interpretability adaptability these models prompted a new wave their application bioinformatics research. To provide timely comprehensive review, we introduce key developments by describing detailed structure transformers summarize contribution to wide range research from basic sequence analysis drug discovery. While applications diverse multifaceted, identify discuss common challenges, heterogeneity training data, computational expense model interpretability, opportunities context We hope that broader community NLP researchers, bioinformaticians biologists will be brought together foster future development inspire novel unattainable traditional methods. Supplementary information data available at Bioinformatics Advances online.

Language: Английский

Citations

95

Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies DOI Creative Commons
Rahmad Akbar, Habib Bashour, Puneet Rawat

et al.

mAbs, Journal Year: 2022, Volume and Issue: 14(1)

Published: March 16, 2022

Although the therapeutic efficacy and commercial success of monoclonal antibodies (mAbs) are tremendous, design discovery new candidates remain a time cost-intensive endeavor. In this regard, progress in generation data describing antigen binding developability, computational methodology, artificial intelligence may pave way for era silico on-demand immunotherapeutics discovery. Here, we argue that main necessary machine learning (ML) components an mAb sequence generator are: understanding rules mAb-antigen binding, capacity to modularly combine parameters, algorithms unconstrained parameter-driven synthesis. We review current toward realization these discuss challenges must be overcome allow ML-based fit-for-purpose candidates.

Language: Английский

Citations

83

Contrastive learning on protein embeddings enlightens midnight zone DOI Creative Commons
Michael Heinzinger, Maria Littmann, Ian Sillitoe

et al.

NAR Genomics and Bioinformatics, Journal Year: 2022, Volume and Issue: 4(2)

Published: March 31, 2022

Abstract Experimental structures are leveraged through multiple sequence alignments, or more generally homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to query without any annotation. A recent alternative expands concept HBI sequence-distance lookup embedding-based (EAT). These embeddings derived Language Models (pLMs). Here, we introduce using single representations pLMs for contrastive learning. This learning procedure creates new set that optimizes constraints captured by hierarchical classifications 3D defined CATH resource. The approach, dubbed ProtTucker, has an improved ability recognize distant homologous relationships than traditional techniques such as threading fold recognition. Thus, these have allowed comparison step into ‘midnight zone’ similarity, i.e. region in which distantly related sequences seemingly random pairwise similarity. novelty this work is particular combination tools and sampling ascertained good performance comparable better existing state-of-the-art methods. Additionally, since method does not need generate alignments it also orders magnitudes faster. code available at https://github.com/Rostlab/EAT.

Language: Английский

Citations

78

From sequence to function through structure: Deep learning for protein design DOI Creative Commons
Noelia Ferruz, Michael Heinzinger, Mehmet Akdel

et al.

Computational and Structural Biotechnology Journal, Journal Year: 2022, Volume and Issue: 21, P. 238 - 250

Published: Nov. 19, 2022

The process of designing biomolecules, in particular proteins, is witnessing a rapid change available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional controllable protein design, researchers at the interface artificial intelligence biology leverage advances natural language processing (NLP) computer vision techniques, coupled with computing hardware learn patterns growing biological databases, curated annotations thereof, or both. Once learned, these can be used provide novel insights into mechanistic biomolecules. However, navigating understanding practical applications for many recent tools complex. facilitate this, we 1) document deep learning (DL) assisted last three years, 2) present pipeline that allows go

Language: Английский

Citations

78