Biophysical Journal, Journal Year: 2024, Volume and Issue: 123(17), P. 2647 - 2657
Published: Jan. 30, 2024
Language: Английский
Biophysical Journal, Journal Year: 2024, Volume and Issue: 123(17), P. 2647 - 2657
Published: Jan. 30, 2024
Language: Английский
Bioinformatics, Journal Year: 2022, Volume and Issue: 38(8), P. 2102 - 2110
Published: Jan. 8, 2022
Self-supervised deep language modeling has shown unprecedented success across natural tasks, and recently been repurposed to biological sequences. However, existing models pretraining methods are designed optimized for text analysis. We introduce ProteinBERT, a model specifically proteins. Our scheme combines with novel task of Gene Ontology (GO) annotation prediction. architectural elements that make the highly efficient flexible long The architecture ProteinBERT consists both local global representations, allowing end-to-end processing these types inputs outputs. obtains near state-of-the-art performance, sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including structure, post-translational modifications biophysical attributes), despite using far smaller faster than competing deep-learning methods. Overall, provides an framework rapidly training predictors, even limited labeled data.
Language: Английский
Citations
510Nature Genetics, Journal Year: 2023, Volume and Issue: 55(9), P. 1512 - 1522
Published: Aug. 10, 2023
Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all due to dependency on close homologs or software limitations. Here we developed workflow using ESM1b, 650-million-parameter protein language model, predict ~450 million possible missense in human genome, and made predictions available web portal. ESM1b outperformed existing methods classifying ~150,000 ClinVar/HGMD as pathogenic benign predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 damaging only specific isoforms, demonstrating importance considering isoforms when effects. Our approach also generalizes more complex such in-frame indels stop-gains. Together, these results establish an effective, accurate general
Language: Английский
Citations
221Nucleic Acids Research, Journal Year: 2021, Volume and Issue: 50(D1), P. D693 - D700
Published: Nov. 9, 2021
Abstract Rhea (https://www.rhea-db.org) is an expert-curated knowledgebase of biochemical reactions based on the chemical ontology ChEBI (Chemical Entities Biological Interest) (https://www.ebi.ac.uk/chebi). In this paper, we describe a number key developments in since our last report database issue Nucleic Acids Research 2019. These include improved reaction coverage Rhea, adoption as reference vocabulary for enzyme annotation UniProt UniProtKB (https://www.uniprot.org), development new website, and designation ELIXIR Core Data Resource. We hope that these other will enhance utility resource to study engineer enzymes metabolic systems which they function.
Language: Английский
Citations
161Human Genetics, Journal Year: 2021, Volume and Issue: 141(10), P. 1629 - 1647
Published: Dec. 30, 2021
The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret effect single amino acid (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue expand our understanding mutational landscape proteins, results challenge analyses. Protein Language Models (pLMs) use latest deep learning (DL) algorithms leverage growing databases sequences. These methods learn predict missing or masked acids from context entire sequence regions. Here, we used pLM representations (embeddings) conservation and SAV effects without multiple alignments (MSAs). Embeddings alone predicted residue almost as accurately sequences ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings 0.596 ± 0.006 vs. 0.608 ConSeq). Inputting prediction along with BLOSUM62 substitution scores mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble Variant Effect Score Prediction Alignments (VESPA) magnitude any optimization DMS data. Comparing predictions standard set 39 experiments other (incl. ESM-1v, DeepSequence, GEMME) revealed approach competitive state-of-the-art (SOTA) MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently performance measure applied (Spearman Pearson correlation). Finally, investigated binary four human proteins. Overall, embedding-based have become relying at fraction costs in computing/energy. Our proteome (~ 20 k proteins) within 40 min one Nvidia Quadro RTX 8000. All data are freely available local online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , PredictProtein.
Language: Английский
Citations
110Structure, Journal Year: 2022, Volume and Issue: 30(8), P. 1169 - 1177.e4
Published: May 23, 2022
Language: Английский
Citations
102eLife, Journal Year: 2023, Volume and Issue: 12
Published: Jan. 18, 2023
Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough life science applications, particular protein property prediction. There is hope that learning can close the gap between proteins and known properties based on lab experiments. Language models from field natural language processing gained popularity for predictions new computational revolution biology, where old prediction results are being improved regularly. Such learn useful multipurpose representations large open repositories sequences be used, instance, predict properties. The growing quickly because class model-the Transformer model. We review recent use large-scale applications predicting characteristics how such used predict, example, post-translational modifications. shortcomings other explain proven very promising way unravel information hidden amino acids.
Language: Английский
Citations
98Bioinformatics Advances, Journal Year: 2023, Volume and Issue: 3(1)
Published: Jan. 1, 2023
Abstract Summary The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural processing (NLP). Since there are inherent similarities between various biological sequences languages, remarkable interpretability adaptability these models prompted a new wave their application bioinformatics research. To provide timely comprehensive review, we introduce key developments by describing detailed structure transformers summarize contribution to wide range research from basic sequence analysis drug discovery. While applications diverse multifaceted, identify discuss common challenges, heterogeneity training data, computational expense model interpretability, opportunities context We hope that broader community NLP researchers, bioinformaticians biologists will be brought together foster future development inspire novel unattainable traditional methods. Supplementary information data available at Bioinformatics Advances online.
Language: Английский
Citations
95mAbs, Journal Year: 2022, Volume and Issue: 14(1)
Published: March 16, 2022
Although the therapeutic efficacy and commercial success of monoclonal antibodies (mAbs) are tremendous, design discovery new candidates remain a time cost-intensive endeavor. In this regard, progress in generation data describing antigen binding developability, computational methodology, artificial intelligence may pave way for era silico on-demand immunotherapeutics discovery. Here, we argue that main necessary machine learning (ML) components an mAb sequence generator are: understanding rules mAb-antigen binding, capacity to modularly combine parameters, algorithms unconstrained parameter-driven synthesis. We review current toward realization these discuss challenges must be overcome allow ML-based fit-for-purpose candidates.
Language: Английский
Citations
83NAR Genomics and Bioinformatics, Journal Year: 2022, Volume and Issue: 4(2)
Published: March 31, 2022
Abstract Experimental structures are leveraged through multiple sequence alignments, or more generally homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to query without any annotation. A recent alternative expands concept HBI sequence-distance lookup embedding-based (EAT). These embeddings derived Language Models (pLMs). Here, we introduce using single representations pLMs for contrastive learning. This learning procedure creates new set that optimizes constraints captured by hierarchical classifications 3D defined CATH resource. The approach, dubbed ProtTucker, has an improved ability recognize distant homologous relationships than traditional techniques such as threading fold recognition. Thus, these have allowed comparison step into ‘midnight zone’ similarity, i.e. region in which distantly related sequences seemingly random pairwise similarity. novelty this work is particular combination tools and sampling ascertained good performance comparable better existing state-of-the-art methods. Additionally, since method does not need generate alignments it also orders magnitudes faster. code available at https://github.com/Rostlab/EAT.
Language: Английский
Citations
78Computational and Structural Biotechnology Journal, Journal Year: 2022, Volume and Issue: 21, P. 238 - 250
Published: Nov. 19, 2022
The process of designing biomolecules, in particular proteins, is witnessing a rapid change available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional controllable protein design, researchers at the interface artificial intelligence biology leverage advances natural language processing (NLP) computer vision techniques, coupled with computing hardware learn patterns growing biological databases, curated annotations thereof, or both. Once learned, these can be used provide novel insights into mechanistic biomolecules. However, navigating understanding practical applications for many recent tools complex. facilitate this, we 1) document deep learning (DL) assisted last three years, 2) present pipeline that allows go
Language: Английский
Citations
78