Pre-trained Language Models in Biomedical Domain: A Systematic Survey DOI Open Access
Benyou Wang, Qianqian Xie, Jiahuan Pei

et al.

ACM Computing Surveys, Journal Year: 2023, Volume and Issue: 56(3), P. 1 - 52

Published: Aug. 1, 2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural processing tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science communities propose various PLMs trained on datasets, e.g., text, electronic health records, protein, DNA sequences However, cross-discipline characteristics of hinder their spreading among communities; some existing works are isolated each other without comprehensive comparison discussions. It is nontrivial to make a survey that not only systematically reviews recent advances in applications but standardizes terminology benchmarks. article summarizes progress pre-trained domain downstream Particularly, we discuss motivations introduce key concepts models. We then taxonomy categorizes them perspectives systematically. Plus, tasks exhaustively discussed, respectively. Last, illustrate limitations future trends, which aims provide inspiration research.

Language: Английский

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction DOI Creative Commons
Pascal Notin, Aaron W. Kollasch, Daniel P. Ritter

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Dec. 8, 2023

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease designing novel that can address our most pressing challenges climate, agriculture and healthcare. Despite a surge machine learning-based protein models tackle these questions, an assessment their respective benefits challenging due use distinct, often contrived, experimental datasets, variable performance across different families. Addressing requires scale. To end we introduce ProteinGym, large-scale holistic set benchmarks specifically designed for fitness prediction design. It encompasses both broad collection over 250 standardized deep mutational scanning assays, spanning millions mutated sequences, as well curated clinical datasets providing high-quality expert annotations about mutation effects. We devise robust evaluation framework combines metrics design, factors known limitations underlying methods, covers zero-shot supervised settings. report diverse 70 high-performing various subfields (eg., alignment-based, inverse folding) into unified benchmark suite. open source corresponding codebase, MSAs, structures, model predictions develop user-friendly website facilitates data access analysis.

Language: Английский

Citations

98

cACP-DeepGram: Classification of anticancer peptides via deep neural network and skip-gram-based word embedding model DOI
Shahid Akbar, Maqsood Hayat, Muhammad Tahir

et al.

Artificial Intelligence in Medicine, Journal Year: 2022, Volume and Issue: 131, P. 102349 - 102349

Published: July 6, 2022

Language: Английский

Citations

96

Applications of transformer-based language models in bioinformatics: a survey DOI Creative Commons
Shuang Zhang, Rui Fan, Yuti Liu

et al.

Bioinformatics Advances, Journal Year: 2023, Volume and Issue: 3(1)

Published: Jan. 1, 2023

Abstract Summary The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural processing (NLP). Since there are inherent similarities between various biological sequences languages, remarkable interpretability adaptability these models prompted a new wave their application bioinformatics research. To provide timely comprehensive review, we introduce key developments by describing detailed structure transformers summarize contribution to wide range research from basic sequence analysis drug discovery. While applications diverse multifaceted, identify discuss common challenges, heterogeneity training data, computational expense model interpretability, opportunities context We hope that broader community NLP researchers, bioinformaticians biologists will be brought together foster future development inspire novel unattainable traditional methods. Supplementary information data available at Bioinformatics Advances online.

Language: Английский

Citations

95

Contrastive learning in protein language space predicts interactions between drugs and protein targets DOI Creative Commons
Rohit Singh, Samuel Sledzieski, Bryan D. Bryson

et al.

Proceedings of the National Academy of Sciences, Journal Year: 2023, Volume and Issue: 120(24)

Published: June 8, 2023

Sequence-based prediction of drug-target interactions has the potential to accelerate drug discovery by complementing experimental screens. Such computational needs be generalizable and scalable while remaining sensitive subtle variations in inputs. However, current techniques fail simultaneously meet these goals, often sacrificing performance one achieve others. We develop a deep learning model, ConPLex, successfully leveraging advances pretrained protein language models ("PLex") employing protein-anchored contrastive coembedding ("Con") outperform state-of-the-art approaches. ConPLex achieves high accuracy, broad adaptivity unseen data, specificity against decoy compounds. It makes predictions binding based on distance between learned representations, enabling at scale massive compound libraries human proteome. Experimental testing 19 kinase-drug interaction validated 12 interactions, including four with subnanomolar affinity, plus strongly EPHB1 inhibitor (KD = 1.3 nM). Furthermore, embeddings are interpretable, which enables us visualize embedding space use characterize function cell-surface proteins. anticipate that will facilitate efficient making highly silico screening feasible genome scale. is available open source https://ConPLex.csail.mit.edu.

Language: Английский

Citations

92

Pre-trained Language Models in Biomedical Domain: A Systematic Survey DOI Open Access
Benyou Wang, Qianqian Xie, Jiahuan Pei

et al.

ACM Computing Surveys, Journal Year: 2023, Volume and Issue: 56(3), P. 1 - 52

Published: Aug. 1, 2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural processing tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science communities propose various PLMs trained on datasets, e.g., text, electronic health records, protein, DNA sequences However, cross-discipline characteristics of hinder their spreading among communities; some existing works are isolated each other without comprehensive comparison discussions. It is nontrivial to make a survey that not only systematically reviews recent advances in applications but standardizes terminology benchmarks. article summarizes progress pre-trained domain downstream Particularly, we discuss motivations introduce key concepts models. We then taxonomy categorizes them perspectives systematically. Plus, tasks exhaustively discussed, respectively. Last, illustrate limitations future trends, which aims provide inspiration research.

Language: Английский

Citations

89