Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs DOI Creative Commons
Louis Robinson, Timothy Atkinson, Liviu Copoiu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Dec. 4, 2023

Abstract Understanding protein function is vital for drug discovery, disease diagnosis, and engineering. While Protein Language Models (PLMs) pre-trained on vast sequence datasets have achieved remarkable success, equivalent Structure (PSMs) remain underrepresented. We attribute this to the relative lack of high-confidence structural data suitable pre-training objectives. In context, we introduce BioCLIP, a contrastive learning framework that pre-trains PSMs by leveraging PLMs, generating meaningful per-residue per-chain representations. When evaluated tasks such as protein-protein interaction, Gene Ontology annotation, Enzyme Commission number prediction, BioCLIP-trained consistently outperform models trained from scratch further enhance performance when merged with embeddings. Notably, BioCLIP approaches, or exceeds, specialized methods across all benchmarks using its singular design. Our work addresses challenges obtaining quality designing self-supervised objectives, setting stage more comprehensive function. Source code publicly available 2 .

Language: Английский

Learning from pre-pandemic data to forecast viral escape DOI Creative Commons
Nicole N. Thadani, Sarah F. Gurev, Pascal Notin

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: July 22, 2022

Summary Effective pandemic preparedness relies on anticipating viral mutations that are able to evade host immune responses in order facilitate vaccine and therapeutic design. However, current strategies for evolution prediction not available early a – experimental approaches require polyclonal antibodies test against existing computational methods draw heavily from strain prevalence make reliable predictions of variants concern. To address this, we developed EVEscape, generalizable, modular framework combines fitness deep learning model historical sequences with biophysical structural information. EVEscape quantifies the escape potential at scale has advantage being applicable before surveillance sequencing, scans, or 3D structures antibody complexes available. We demonstrate trained prior 2020, is as accurate high-throughput scans variation SARS-CoV-2 generalizable other viruses including Influenza, HIV, understudied such Lassa Nipah. provide continually updated scores all strains predict likely additional forecast emerging tool ongoing development ( evescape.org ).

Language: Английский

Citations

10

Mapping Data to Deep Understanding: Making the Most of the Deluge of SARS-CoV-2 Genome Sequences DOI Creative Commons
Bahrad A. Sokhansanj, Gail Rosen

mSystems, Journal Year: 2022, Volume and Issue: 7(2)

Published: March 21, 2022

Next-generation sequencing has been essential to the global response COVID-19 pandemic. As of January 2022, nearly 7 million severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequences are available researchers in public databases. Sequence databases an abundant resource from which extract biologically relevant and clinically actionable information. pandemic gone on, SARS-CoV-2 rapidly evolved, involving complex genomic changes that challenge current approaches classifying variants. Deep sequence learning could be a potentially powerful way build sequence-to-phenotype models. Unfortunately, while they can predictive, deep typically produces "black box" models cannot directly provide biological clinical insight. Researchers should therefore consider implementing emerging methods for visualizing interpreting Finally, address important data limitations, including (i) disparities, (ii) insufficient metadata, (iii) screening artifacts due poor quality control.

Language: Английский

Citations

8

Interpretable and Predictive Deep Neural Network Modeling of the SARS-CoV-2 Spike Protein Sequence to Predict COVID-19 Disease Severity DOI Creative Commons
Bahrad A. Sokhansanj, Zhengqiao Zhao, Gail Rosen

et al.

Biology, Journal Year: 2022, Volume and Issue: 11(12), P. 1786 - 1786

Published: Dec. 8, 2022

Through the COVID-19 pandemic, SARS-CoV-2 has gained and lost multiple mutations in novel or unexpected combinations. Predicting how complex affect disease severity is critical planning public health responses as virus continues to evolve. This paper presents a computational framework complement conventional lineage classification applies it predict severe potential of viral genetic variation. The transformer-based neural network model architecture additional layers that provide sample embeddings sequence-wide attention for interpretation visualization. First, training taxonomy validates architecture's interpretability. Second, an interpretable predictive trained on spike protein sequence patient metadata from GISAID. Confounding effects changing demographics, increasing vaccination rates, improving treatment over time are addressed by including demographics case date independent input model. resulting can be interpreted identify potentially significant proves robust predctive tool. Although data obtained entirely before availability empirical Omicron, Omicron's reduced risk disease, accord with epidemiological experimental data.

Language: Английский

Citations

7

Mechanistic study of the transmission pattern of the SARS‐CoV‐2 omicron variant DOI Creative Commons
Ke An,

Xianzhi Yang,

Mengqi Luo

et al.

Proteins Structure Function and Bioinformatics, Journal Year: 2024, Volume and Issue: 92(6), P. 705 - 719

Published: Jan. 5, 2024

Abstract The omicron variant of severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) characterized by 30 mutations in its spike protein, has rapidly spread worldwide since November 2021, significantly exacerbating the ongoing COVID‐19 pandemic. In order to investigate relationship between these and variant's high transmissibility, we conducted a systematic analysis mutational effect on spike–angiotensin‐converting enzyme‐2 (ACE2) interactions explored structural/energy correlation key mutations, utilizing reliable coarse‐grained model. Our study extended beyond receptor‐binding domain (RBD) trimer through comprehensive modeling full‐length rather than just RBD. free‐energy calculation revealed that enhanced binding affinity protein ACE2 receptor is correlated with increased structural stability isolated thus explaining heightened transmissibility. conclusion was supported our experimental analyses involving expression purification trimer. Furthermore, energy decomposition established those electrostatic make major contributions this effect. We categorized into four groups an analytical framework can be employed studying future mutations. Additionally, calculations rationalized reduced towards most available therapeutic neutralizing antibodies, when compared wild type. By providing concrete data offering solid explanation, contributes better understanding theories observations lays foundation for investigations.

Language: Английский

Citations

1

Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs DOI Creative Commons
Louis Robinson, Timothy Atkinson, Liviu Copoiu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Dec. 4, 2023

Abstract Understanding protein function is vital for drug discovery, disease diagnosis, and engineering. While Protein Language Models (PLMs) pre-trained on vast sequence datasets have achieved remarkable success, equivalent Structure (PSMs) remain underrepresented. We attribute this to the relative lack of high-confidence structural data suitable pre-training objectives. In context, we introduce BioCLIP, a contrastive learning framework that pre-trains PSMs by leveraging PLMs, generating meaningful per-residue per-chain representations. When evaluated tasks such as protein-protein interaction, Gene Ontology annotation, Enzyme Commission number prediction, BioCLIP-trained consistently outperform models trained from scratch further enhance performance when merged with embeddings. Notably, BioCLIP approaches, or exceeds, specialized methods across all benchmarks using its singular design. Our work addresses challenges obtaining quality designing self-supervised objectives, setting stage more comprehensive function. Source code publicly available 2 .

Language: Английский

Citations

3