Pre-trained Language Models in Biomedical Domain: A Systematic Survey DOI Open Access
Benyou Wang, Qianqian Xie, Jiahuan Pei

et al.

ACM Computing Surveys, Journal Year: 2023, Volume and Issue: 56(3), P. 1 - 52

Published: Aug. 1, 2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural processing tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science communities propose various PLMs trained on datasets, e.g., text, electronic health records, protein, DNA sequences However, cross-discipline characteristics of hinder their spreading among communities; some existing works are isolated each other without comprehensive comparison discussions. It is nontrivial to make a survey that not only systematically reviews recent advances in applications but standardizes terminology benchmarks. article summarizes progress pre-trained domain downstream Particularly, we discuss motivations introduce key concepts models. We then taxonomy categorizes them perspectives systematically. Plus, tasks exhaustively discussed, respectively. Last, illustrate limitations future trends, which aims provide inspiration research.

Language: Английский

Highly accurate protein structure prediction with AlphaFold DOI Creative Commons
John Jumper, Richard Evans, Alexander Pritzel

et al.

Nature, Journal Year: 2021, Volume and Issue: 596(7873), P. 583 - 589

Published: July 15, 2021

Abstract Proteins are essential to life, and understanding their structure can facilitate a mechanistic of function. Through an enormous experimental effort 1–4 , the structures around 100,000 unique proteins have been determined 5 but this represents small fraction billions known protein sequences 6,7 . Structural coverage is bottlenecked by months years painstaking required determine single structure. Accurate computational approaches needed address gap enable large-scale structural bioinformatics. Predicting three-dimensional that will adopt based solely on its amino acid sequence—the prediction component ‘protein folding problem’ 8 —has important open research problem for more than 50 9 Despite recent progress 10–14 existing methods fall far short atomic accuracy, especially when no homologous available. Here we provide first method regularly predict with accuracy even in cases which similar known. We validated entirely redesigned version our neural network-based model, AlphaFold, challenging 14th Critical Assessment Structure Prediction (CASP14) 15 demonstrating competitive majority greatly outperforming other methods. Underpinning latest AlphaFold novel machine learning approach incorporates physical biological knowledge about structure, leveraging multi-sequence alignments, into design deep algorithm.

Language: Английский

Citations

31214

Evolutionary-scale prediction of atomic-level protein structure with a language model DOI Creative Commons
Zeming Lin, Halil Akin, Roshan Rao

et al.

Science, Journal Year: 2023, Volume and Issue: 379(6637), P. 1123 - 1130

Published: March 16, 2023

Recent advances in machine learning have leveraged evolutionary information multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level structure from primary using a large language model. As models sequences are scaled up 15 billion parameters, an atomic-resolution picture emerges the learned representations. This results order-of-magnitude acceleration high-resolution prediction, which enables large-scale structural characterization metagenomic proteins. apply this capability construct ESM Metagenomic Atlas by predicting structures for >617 million sequences, including >225 that predicted with high confidence, gives view into vast breadth and diversity natural

Language: Английский

Citations

2210

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences DOI Creative Commons
Alexander Rives, Joshua Meier, Tom Sercu

et al.

Proceedings of the National Academy of Sciences, Journal Year: 2021, Volume and Issue: 118(15)

Published: April 5, 2021

Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental of proteins such as secondary structure, contacts, activity. show are useful across benchmarks remote homology detection, prediction long-range residue–residue mutational effect. Unsupervised representation enables state-of-the-art supervised effect structure improves features contact prediction.

Language: Английский

Citations

1983

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning DOI Creative Commons
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago

et al.

IEEE Transactions on Pattern Analysis and Machine Intelligence, Journal Year: 2021, Volume and Issue: 44(10), P. 7112 - 7127

Published: July 7, 2021

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken Natural Processing (NLP). These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The (pLMs) were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality reduction revealed that raw pLM- embeddings unlabeled captured some biophysical features of sequences. We validated advantage as exclusive input several subsequent tasks: (1) a per-residue (per-token) secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions sub-cellular location (ten-state accuracy: Q10=81%) membrane versus water-soluble (2-state Q2=91%). For structure, most informative embeddings (ProtT5) first time outperformed state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, results implied pLMs learned xmlns:xlink="http://www.w3.org/1999/xlink">grammar xmlns:xlink="http://www.w3.org/1999/xlink">language life . All our are available through https://github.com/agemagician/ProtTrans

Language: Английский

Citations

1098

ProteinBERT: a universal deep-learning model of protein sequence and function DOI Creative Commons
Nadav Brandes, Dan Ofer,

Yam Peleg

et al.

Bioinformatics, Journal Year: 2022, Volume and Issue: 38(8), P. 2102 - 2110

Published: Jan. 8, 2022

Self-supervised deep language modeling has shown unprecedented success across natural tasks, and recently been repurposed to biological sequences. However, existing models pretraining methods are designed optimized for text analysis. We introduce ProteinBERT, a model specifically proteins. Our scheme combines with novel task of Gene Ontology (GO) annotation prediction. architectural elements that make the highly efficient flexible long The architecture ProteinBERT consists both local global representations, allowing end-to-end processing these types inputs outputs. obtains near state-of-the-art performance, sometimes exceeds it, on multiple benchmarks covering diverse protein properties (including structure, post-translational modifications biophysical attributes), despite using far smaller faster than competing deep-learning methods. Overall, provides an framework rapidly training predictors, even limited labeled data.

Language: Английский

Citations

510

Evaluating Protein Transfer Learning with TAPE DOI Creative Commons
Roshan Rao, Nicholas Bhattacharya, Neil Thomas

et al.

arXiv (Cornell University), Journal Year: 2019, Volume and Issue: unknown

Published: Jan. 1, 2019

Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised for proteins has emerged as important paradigm due the high cost acquiring supervised labels, but current literature fragmented when it comes datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce Tasks Assessing Protein Embeddings (TAPE), a set five biologically relevant semi-supervised tasks spread across different domains biology. We curate into specific training, validation, test splits ensure that each task tests generalization transfers real-life scenarios. benchmark range approaches representation learning, which span recent work well canonical sequence find self-supervised pretraining helpful almost all models on tasks, more than doubling performance some cases. Despite increase, several cases features learned by still lag behind extracted state-of-the-art non-neural This gap suggests huge opportunity innovative architecture design improved modeling paradigms better capture signal biological sequences. TAPE will help machine community focus effort scientifically problems. Toward end, data code used run these experiments are available at https://github.com/songlab-cal/tape.

Language: Английский

Citations

312

Single-sequence protein structure prediction using a language model and deep learning DOI
Ratul Chowdhury, Nazim Bouatta, Surojit Biswas

et al.

Nature Biotechnology, Journal Year: 2022, Volume and Issue: 40(11), P. 1617 - 1623

Published: Oct. 3, 2022

Language: Английский

Citations

301

The language of proteins: NLP, machine learning & protein sequences DOI Creative Commons
Dan Ofer, Nadav Brandes, Michal Linial

et al.

Computational and Structural Biotechnology Journal, Journal Year: 2021, Volume and Issue: 19, P. 1750 - 1758

Published: Jan. 1, 2021

Natural language processing (NLP) is a field of computer science concerned with automated text and analysis. In recent years, following series breakthroughs in deep machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise pitfalls applying algorithms to study proteins. Proteins, which can be represented as strings amino-acid letters, are natural fit many methods. We explore conceptual similarities differences between proteins language, range protein-related tasks amenable learning. present for encoding information analyzing it methods, reviewing classic concepts such bag-of-words, k-mers/n-grams search, well modern techniques word embedding, contextualized learning neural models. particular, focus on innovations masked modeling, self-supervised attention-based Finally, discuss trends challenges intersection protein research.

Language: Английский

Citations

283

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences DOI Creative Commons
Alexander Rives, Joshua Meier, Tom Sercu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2019, Volume and Issue: unknown

Published: April 29, 2019

Abstract In the field of artificial intelligence, a combination scale in data and model capacity enabled by un-supervised learning has led to major advances representation statistical generation. life sciences, anticipated growth sequencing promises unprecedented on natural sequence diversity. Protein language modeling at evolution is logical step toward predictive generative intelligence for biology. To this end we use unsupervised train deep contextual 86 billion amino acids across 250 million protein sequences spanning evolutionary The resulting contains information about biological properties its representations. representations are learned from alone. space multi-scale organization reflecting structure level biochemical remote homology proteins. Information secondary tertiary encoded can be identified linear projections. Representation produces features that generalize range applications, enabling state-of-the-art supervised prediction mutational effect structure, improving long-range contact prediction.

Language: Английский

Citations

239

PredictProtein - Predicting Protein Structure and Function for 29 Years DOI Creative Commons
Michael Bernhofer, Christian Dallago,

Tim Karl

et al.

Nucleic Acids Research, Journal Year: 2021, Volume and Issue: 49(W1), P. W535 - W540

Published: May 11, 2021

Abstract Since 1992 PredictProtein (https://predictprotein.org) is a one-stop online resource for protein sequence analysis with its main site hosted at the Luxembourg Centre Systems Biomedicine (LCSB) and queried monthly by over 3,000 users in 2020. was first Internet server predictions. It pioneered combining evolutionary information machine learning. Given as input, outputs multiple alignments, predictions of structure 1D 2D (secondary structure, solvent accessibility, transmembrane segments, disordered regions, flexibility, disulfide bridges) function (functional effects variation or point mutations, Gene Ontology (GO) terms, subcellular localization, protein-, RNA-, DNA binding). PredictProtein's infrastructure has moved to LCSB increasing throughput; use MMseqs2 search reduced runtime five-fold (apparently without lowering performance prediction methods); user interface elements improved usability, new methods were added. recently included from deep learning embeddings (GO secondary structure) method proteins residues binding DNA, RNA, other proteins. PredictProtein.org aspires provide reliable computational experimental biologists alike. All scripts are freely available offline execution high-throughput settings.

Language: Английский

Citations

222