Pre-trained Language Models in Biomedical Domain: A Systematic Survey DOI Open Access
Benyou Wang, Qianqian Xie, Jiahuan Pei

et al.

ACM Computing Surveys, Journal Year: 2023, Volume and Issue: 56(3), P. 1 - 52

Published: Aug. 1, 2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural processing tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science communities propose various PLMs trained on datasets, e.g., text, electronic health records, protein, DNA sequences However, cross-discipline characteristics of hinder their spreading among communities; some existing works are isolated each other without comprehensive comparison discussions. It is nontrivial to make a survey that not only systematically reviews recent advances in applications but standardizes terminology benchmarks. article summarizes progress pre-trained domain downstream Particularly, we discuss motivations introduce key concepts models. We then taxonomy categorizes them perspectives systematically. Plus, tasks exhaustively discussed, respectively. Last, illustrate limitations future trends, which aims provide inspiration research.

Language: Английский

Deep Learning in Protein Structural Modeling and Design DOI Creative Commons
Wenhao Gao, Sai Pooja Mahajan, Jeremias Sulam

et al.

Patterns, Journal Year: 2020, Volume and Issue: 1(9), P. 100142 - 100142

Published: Nov. 12, 2020

Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein modeling, such as predicting structure from amino acid sequence evolutionary information, designing proteins toward desirable functionality, or properties behavior of protein, critical to understand engineer biological systems at the molecular level. In this review, we summarize recent advances in applying deep techniques tackle problems modeling design. We dissect emerging approaches using for discuss challenges that must be addressed. argue central importance structure, following "sequence → function" paradigm. This review directed help both biologists gain familiarity with methods applied computer scientists perspective on biologically meaningful may benefit techniques.

Language: Английский

Citations

188

Rhea, the reaction knowledgebase in 2022 DOI Creative Commons
Parit Bansal, Anne Morgat, Kristian B. Axelsen

et al.

Nucleic Acids Research, Journal Year: 2021, Volume and Issue: 50(D1), P. D693 - D700

Published: Nov. 9, 2021

Abstract Rhea (https://www.rhea-db.org) is an expert-curated knowledgebase of biochemical reactions based on the chemical ontology ChEBI (Chemical Entities Biological Interest) (https://www.ebi.ac.uk/chebi). In this paper, we describe a number key developments in since our last report database issue Nucleic Acids Research 2019. These include improved reaction coverage Rhea, adoption as reference vocabulary for enzyme annotation UniProt UniProtKB (https://www.uniprot.org), development new website, and designation ELIXIR Core Data Resource. We hope that these other will enhance utility resource to study engineer enzymes metabolic systems which they function.

Language: Английский

Citations

161

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information DOI
Nguyen Quoc Khanh Le, Quang‐Thai Ho, Trinh‐Trung‐Duong Nguyen

et al.

Briefings in Bioinformatics, Journal Year: 2021, Volume and Issue: 22(5)

Published: Jan. 4, 2021

Recently, language representation models have drawn a lot of attention in the natural processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven be simple, yet powerful model that achieved novel state-of-the-art performance. BERT adopted concept contextualized word embedding capture semantics and context words which they appeared. In this study, we present technique by incorporating BERT-based multilingual bioinformatics represent information DNA sequences. We treated sequences as sentences then used transform them into fixed-length numerical matrices. As case applied our method enhancer prediction, is well-known challenging problem field. observed features improved more than 5-10% terms sensitivity, specificity, accuracy Matthews correlation coefficient compared current bioinformatics. Moreover, advanced experiments show deep learning (as represented 2D convolutional neural networks; CNN) holds potential better other traditional machine techniques. conclusion, suggest CNNs could open new avenue biological modeling using sequence information.

Language: Английский

Citations

147

Embeddings from deep learning transfer GO annotations beyond homology DOI Creative Commons
Maria Littmann, Michael Heinzinger, Christian Dallago

et al.

Scientific Reports, Journal Year: 2021, Volume and Issue: 11(1)

Published: Jan. 13, 2021

Abstract Knowing protein function is crucial to advance molecular and medical biology, yet experimental annotations through the Gene Ontology (GO) exist for fewer than 0.5% of all known proteins. Computational methods bridge this sequence-annotation gap typically homology-based annotation transfer by identifying sequence-similar proteins with or prediction using evolutionary information. Here, we propose predicting GO terms based on proximity in SeqVec embedding rather sequence space. These embeddings originate from deep learned language models (LMs) sequences (SeqVec) transferring knowledge gained next amino acid 33 million sequences. Replicating conditions CAFA3, our method reaches an F max 37 ± 2%, 50 3%, 57 2% BPO, MFO, CCO, respectively. Numerically, appears close top ten CAFA3 methods. When restricting < 20% pairwise identity query, performance drops (F BPO MFO 43 CCO 53 2%); still outperforms naïve sequence-based transfer. Preliminary results CAFA4 appear confirm these findings. Overall, new concept likely change proteins, particular smaller families intrinsically disordered regions.

Language: Английский

Citations

144

Language models generalize beyond natural proteins DOI Creative Commons
Robert Verkuil, Ori Kabeli,

Yilun Du

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: Dec. 22, 2022

Abstract Learning the design patterns of proteins from sequences across evolution may have promise toward generative protein design. However it is unknown whether language models, trained on natural proteins, will be capable more than memorization existing families. Here we show that models generalize beyond to generate de novo proteins. We focus two tasks: fixed backbone where structure specified, and unconstrained generation sampled model. Remarkably although are only sequences, find they designing structure. A total 228 generated evaluated experimentally with high overall success rates (152/228 or 67%) in producing a soluble monomeric species by size exclusion chromatography. Out 152 successful designs, 35 no significant sequence match known Of remaining 117, identity nearest at median 27%, below 20% for 6 as low 18% 3 designs. For design, model generates designs each eight artificially created targets. generation, cover diverse topologies secondary compositions, experimental rate (71/129 55%). The reflect deep linking structure, including motifs occur related structures, not observed similar structural contexts results though learn grammar enables extending

Language: Английский

Citations

128

Learning meaningful representations of protein sequences DOI Creative Commons
Nicki Skafte Detlefsen, Søren Hauberg, Wouter Boomsma

et al.

Nature Communications, Journal Year: 2022, Volume and Issue: 13(1)

Published: April 8, 2022

How we choose to represent our data has a fundamental impact on ability subsequently extract information from them. Machine learning promises automatically determine efficient representations large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes these machine models yield drastically different result biological interpretations of data. This begs the question what even constitutes most meaningful representation. Here, approach this for protein sequences, which have received considerable attention recent literature. We explore two key contexts naturally arise: transfer and interpretable learning. In first context, demonstrate several contemporary practices suboptimal performance, latter taking representation geometry into account significantly improves interpretability lets reveal is otherwise obscured.

Language: Английский

Citations

117

Prediction of protein–protein interaction using graph neural networks DOI Creative Commons
Kanchan Jha, Sriparna Saha,

Hiteshi Singh

et al.

Scientific Reports, Journal Year: 2022, Volume and Issue: 12(1)

Published: May 19, 2022

Proteins are the essential biological macromolecules required to perform nearly all processes, and cellular functions. rarely carry out their tasks in isolation but interact with other proteins (known as protein-protein interaction) present surroundings complete activities. The knowledge of interactions (PPIs) unravels behavior its functionality. computational methods automate prediction PPI less expensive than experimental terms resources time. So far, most works on have mainly focused sequence information. Here, we use graph convolutional network (GCN) attention (GAT) predict interaction between by utilizing protein's structural information features. We build graphs from PDB files, which contain 3D coordinates atoms. protein represents amino acid network, also known residue contact where each node is a residue. Two nodes connected if they pair atoms (one node) within threshold distance. To extract node/residue features, language model. input model sequence, output feature vector for underlying sequence. validate predictive capability proposed graph-based approach two datasets: Human S. cerevisiae. Obtained results demonstrate effectiveness it outperforms previous leading methods. source code training data train available at https://github.com/JhaKanchan15/PPI_GNN.git .

Language: Английский

Citations

111

Embeddings from protein language models predict conservation and variant effects DOI Creative Commons
Céline Marquet, Michael Heinzinger, Tobias Olenyi

et al.

Human Genetics, Journal Year: 2021, Volume and Issue: 141(10), P. 1629 - 1647

Published: Dec. 30, 2021

The emergence of SARS-CoV-2 variants stressed the demand for tools allowing to interpret effect single amino acid (SAVs) on protein function. While Deep Mutational Scanning (DMS) sets continue expand our understanding mutational landscape proteins, results challenge analyses. Protein Language Models (pLMs) use latest deep learning (DL) algorithms leverage growing databases sequences. These methods learn predict missing or masked acids from context entire sequence regions. Here, we used pLM representations (embeddings) conservation and SAV effects without multiple alignments (MSAs). Embeddings alone predicted residue almost as accurately sequences ConSeq using MSAs (two-state Matthews Correlation Coefficient-MCC-for ProtT5 embeddings 0.596 ± 0.006 vs. 0.608 ConSeq). Inputting prediction along with BLOSUM62 substitution scores mask reconstruction probabilities into a simplistic logistic regression (LR) ensemble Variant Effect Score Prediction Alignments (VESPA) magnitude any optimization DMS data. Comparing predictions standard set 39 experiments other (incl. ESM-1v, DeepSequence, GEMME) revealed approach competitive state-of-the-art (SOTA) MSA input. No method outperformed all others, neither consistently nor statistically significantly, independently performance measure applied (Spearman Pearson correlation). Finally, investigated binary four human proteins. Overall, embedding-based have become relying at fraction costs in computing/energy. Our proteome (~ 20 k proteins) within 40 min one Nvidia Quadro RTX 8000. All data are freely available local online execution through bioembeddings.com, https://github.com/Rostlab/VESPA , PredictProtein.

Language: Английский

Citations

110

Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction DOI Creative Commons
Konstantin Weißenow, Michael Heinzinger, Burkhard Rost

et al.

Structure, Journal Year: 2022, Volume and Issue: 30(8), P. 1169 - 1177.e4

Published: May 23, 2022

Language: Английский

Citations

102

Transformer-based deep learning for predicting protein properties in the life sciences DOI Creative Commons
Abel Chandra, Laura Tünnermann, Tommy Löfstedt

et al.

eLife, Journal Year: 2023, Volume and Issue: 12

Published: Jan. 18, 2023

Recent developments in deep learning, coupled with an increasing number of sequenced proteins, have led to a breakthrough life science applications, particular protein property prediction. There is hope that learning can close the gap between proteins and known properties based on lab experiments. Language models from field natural language processing gained popularity for predictions new computational revolution biology, where old prediction results are being improved regularly. Such learn useful multipurpose representations large open repositories sequences be used, instance, predict properties. The growing quickly because class model-the Transformer model. We review recent use large-scale applications predicting characteristics how such used predict, example, post-translational modifications. shortcomings other explain proven very promising way unravel information hidden amino acids.

Language: Английский

Citations

98