Protein sequence design with deep generative models DOI Creative Commons
Zachary Wu, Kadina E. Johnston, Frances H. Arnold

et al.

Current Opinion in Chemical Biology, Journal Year: 2021, Volume and Issue: 65, P. 18 - 27

Published: May 26, 2021

Protein engineering seeks to identify protein sequences with optimized properties. When guided by machine learning, sequence generation methods can draw on prior knowledge and experimental efforts improve this process. In review, we highlight recent applications of learning generate sequences, focusing the emerging field deep generative methods.

Language: Английский

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences DOI Creative Commons
Alexander Rives, Joshua Meier, Tom Sercu

et al.

Proceedings of the National Academy of Sciences, Journal Year: 2021, Volume and Issue: 118(15)

Published: April 5, 2021

Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental of proteins such as secondary structure, contacts, activity. show are useful across benchmarks remote homology detection, prediction long-range residue–residue mutational effect. Unsupervised representation enables state-of-the-art supervised effect structure improves features contact prediction.

Language: Английский

Citations

2006

ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning DOI Creative Commons
Ahmed Elnaggar, Michael Heinzinger, Christian Dallago

et al.

IEEE Transactions on Pattern Analysis and Machine Intelligence, Journal Year: 2021, Volume and Issue: 44(10), P. 7112 - 7127

Published: July 7, 2021

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken Natural Processing (NLP). These LMs reach new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) four auto-encoder (BERT, Albert, Electra, T5) on UniRef BFD containing up to 393 billion amino acids. The (pLMs) were the Summit supercomputer using 5616 GPUs TPU Pod up-to 1024 cores. Dimensionality reduction revealed that raw pLM- embeddings unlabeled captured some biophysical features of sequences. We validated advantage as exclusive input several subsequent tasks: (1) a per-residue (per-token) secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions sub-cellular location (ten-state accuracy: Q10=81%) membrane versus water-soluble (2-state Q2=91%). For structure, most informative embeddings (ProtT5) first time outperformed state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, results implied pLMs learned xmlns:xlink="http://www.w3.org/1999/xlink">grammar xmlns:xlink="http://www.w3.org/1999/xlink">language life . All our are available through https://github.com/agemagician/ProtTrans

Language: Английский

Citations

1098

Machine-learning-guided directed evolution for protein engineering DOI
Kevin Yang, Zachary Wu, Frances H. Arnold

et al.

Nature Methods, Journal Year: 2019, Volume and Issue: 16(8), P. 687 - 694

Published: July 15, 2019

Language: Английский

Citations

870

Modeling aspects of the language of life through transfer-learning protein sequences DOI Creative Commons
Michael Heinzinger, Ahmed Elnaggar, Yu Wang

et al.

BMC Bioinformatics, Journal Year: 2019, Volume and Issue: 20(1)

Published: Dec. 1, 2019

Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning evolutionary information. However, some applications retrieving related proteins becoming too time-consuming. Additionally, information less powerful small families, e.g. the Dark Proteome. Both these problems are addressed by new methodology introduced here.We a novel way to represent sequences as continuous vectors (embeddings) using language model ELMo taken natural processing. By modeling sequences, effectively captured biophysical properties of life unlabeled big data (UniRef50). We refer embeddings SeqVec (Sequence-to-Vector) demonstrate their effectiveness training simple neural networks two different tasks. At per-residue level, secondary (Q3 = 79% ± 1, Q8 68% 1) regions with intrinsic disorder (MCC 0.59 0.03) were predicted significantly better than through one-hot encoding or Word2vec-like approaches. per-protein subcellular localization was in ten classes (Q10 membrane-bound distinguished water-soluble (Q2 87% 1). Although generated best predictions single no solution improved over existing method Nevertheless, our approach popular methods even did beat best. Thus, they prove condense underlying principles sequences. Overall, novelty speed: where lightning-fast HHblits needed on average about minutes generate target protein, created 0.03 s. As this speed-up independent size growing databases, provides highly scalable analysis proteomics, i.e. microbiome metaproteome analysis.Transfer-learning succeeded extract databases relevant various prediction modeled life, namely any features suggested textbooks methods. The exception information, however, that not available level sequence.

Language: Английский

Citations

514

Evaluating Protein Transfer Learning with TAPE DOI Open Access
Roshan Rao, Nicholas Bhattacharya, Neil Thomas

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2019, Volume and Issue: unknown

Published: June 20, 2019

Abstract Protein modeling is an increasingly popular area of machine learning research. Semi-supervised has emerged as important paradigm in protein due to the high cost acquiring supervised labels, but current literature fragmented when it comes datasets and standardized evaluation techniques. To facilitate progress this field, we introduce Tasks Assessing Embeddings (TAPE), a set five biologically relevant semi-supervised tasks spread across different domains biology. We curate into specific training, validation, test splits ensure that each task tests generalization transfers real-life scenarios. bench-mark range approaches representation learning, which span recent work well canonical sequence find self-supervised pretraining helpful for almost all models on tasks, more than doubling performance some cases. Despite increase, several cases features learned by still lag behind extracted state-of-the-art non-neural This gap suggests huge opportunity innovative architecture design improved paradigms better capture signal biological sequences. TAPE will help community focus effort scientifically problems. Toward end, data code used run these experiments are available at https://github.com/songlab-cal/tape .

Language: Английский

Citations

472

Learning the protein language: Evolution, structure, and function DOI Creative Commons
Tristan Bepler, Bonnie Berger

Cell Systems, Journal Year: 2021, Volume and Issue: 12(6), P. 654 - 669.e3

Published: June 1, 2021

Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available data alone, these discover evolutionary, structural, and functional organization across space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural properties, well evaluate the evolutionary fitness of variants. We discuss recent advances in modeling applications to downstream property prediction problems. then consider how be enriched with prior biological knowledge introduce an encoding learned representations. The distilled by allows us improve function through transfer learning. Deep are revolutionizing biology. They suggest new ways therapeutic design. However, further developments needed strong priors increase accessibility broader community.

Language: Английский

Citations

359

The language of proteins: NLP, machine learning & protein sequences DOI Creative Commons
Dan Ofer, Nadav Brandes, Michal Linial

et al.

Computational and Structural Biotechnology Journal, Journal Year: 2021, Volume and Issue: 19, P. 1750 - 1758

Published: Jan. 1, 2021

Natural language processing (NLP) is a field of computer science concerned with automated text and analysis. In recent years, following series breakthroughs in deep machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise pitfalls applying algorithms to study proteins. Proteins, which can be represented as strings amino-acid letters, are natural fit many methods. We explore conceptual similarities differences between proteins language, range protein-related tasks amenable learning. present for encoding information analyzing it methods, reviewing classic concepts such bag-of-words, k-mers/n-grams search, well modern techniques word embedding, contextualized learning neural models. particular, focus on innovations masked modeling, self-supervised attention-based Finally, discuss trends challenges intersection protein research.

Language: Английский

Citations

284

Protein design and variant prediction using autoregressive generative models DOI Creative Commons
Jung-Eun Shin, Adam J. Riesselman, Aaron W. Kollasch

et al.

Nature Communications, Journal Year: 2021, Volume and Issue: 12(1)

Published: April 23, 2021

Abstract The ability to design functional sequences and predict effects of variation is central protein engineering biotherapeutics. State-of-art computational methods rely on models that leverage evolutionary information but are inadequate for important applications where multiple sequence alignments not robust. Such include the prediction variant indels, disordered proteins, proteins such as antibodies due highly variable complementarity determining regions. We introduce a deep generative model adapted from natural language processing diverse without need alignments. performs state-of-art missense indel we successfully test 10 5 -nanobody library shows better expression than 1000-fold larger synthetic library. Our results demonstrate power alignment-free autoregressive in generalizing regions space traditionally considered beyond reach design.

Language: Английский

Citations

284

High-resolutionde novostructure prediction from primary sequence DOI Creative Commons
Ruidong Wu,

Fan Ding,

Rui Wang

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: July 22, 2022

Abstract Recent breakthroughs have used deep learning to exploit evolutionary information in multiple sequence alignments (MSAs) accurately predict protein structures. However, MSAs of homologous proteins are not always available, such as with orphan or fast-evolving like antibodies, and a typically folds natural setting from its primary amino acid into three-dimensional structure, suggesting that should be necessary protein’s folded form. Here, we introduce OmegaFold, the first computational method successfully high-resolution structure single alone. Using new combination language model allows us make predictions sequences geometry-inspired transformer trained on structures, OmegaFold outperforms RoseTTAFold achieves similar prediction accuracy AlphaFold2 recently released enables accurate do belong any functionally characterized family antibodies tend noisy due fast evolution. Our study fills much-encountered gap brings step closer understanding folding nature.

Language: Английский

Citations

284

Expanding functional protein sequence spaces using generative adversarial networks DOI
Donatas Repecka, Vykintas Jauniškis, Laurynas Karpus

et al.

Nature Machine Intelligence, Journal Year: 2021, Volume and Issue: 3(4), P. 324 - 333

Published: March 4, 2021

Language: Английский

Citations

283