GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction DOI Creative Commons
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Oct. 11, 2023

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially complex genomes such as that humans. To address this challenge, we here introduce GPN-MSA, novel framework leverages whole-genome sequence alignments across multiple species and takes only few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), population genomic data (gnomAD), our model human genome achieves outstanding performance deleteriousness prediction both coding non-coding variants.

Language: Английский

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants DOI
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

et al.

Nature Biotechnology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 2, 2025

Language: Английский

Citations

5

Scientific Large Language Models: A Survey on Biological & Chemical Domains DOI Open Access
Qiang Zhang, Keyan Ding, Tingting Lv

et al.

ACM Computing Surveys, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 26, 2025

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized systems developed within various scientific disciplines. This growing interest has led to the advent LLMs, novel subclass specifically engineered for facilitating discovery. As burgeoning area community AI Science, warrant comprehensive exploration. However, systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor methodically delineate concept “scientific language”, whilst providing thorough review latest advancements LLMs. Given expansive realm disciplines, our analysis adopts focused lens, concentrating on biological chemical domains. includes an in-depth examination textual knowledge, small molecules, macromolecular proteins, genomic sequences, their combinations, analyzing terms model architectures, capabilities, datasets, evaluation. Finally, critically examine prevailing challenges point out promising research directions along with advances By offering overview technical developments field, aspires be invaluable resource researchers navigating intricate landscape

Language: Английский

Citations

3

Engineering of highly active and diverse nuclease enzymes by combining machine learning and ultra-high-throughput screening DOI Creative Commons
Neil Thomas, David Belanger, Chenling Xu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 24, 2024

Abstract Optimizing enzymes to function in novel chemical environments is a central goal of synthetic biology, but optimization often hindered by rugged, expansive protein search space and costly experiments. In this work, we present TeleProt, an ML framework that blends evolutionary experimental data design diverse variant libraries, employ it improve the catalytic activity nuclease enzyme degrades biofilms accumulate on chronic wounds. After multiple rounds high-throughput experiments using both TeleProt standard directed evolution (DE) approaches parallel, find our approach found significantly better top-performing than DE, had hit rate at finding diverse, high-activity variants, was even able high-performance initial library no prior data. We have released dataset 55K one most extensive genotype-phenotype landscapes date, drive further progress ML-guided design.

Language: Английский

Citations

11

Variant effect predictor correlation with functional assays is reflective of clinical classification performance DOI Creative Commons
Benjamin Livesey, Joseph A. Marsh

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: May 14, 2024

Abstract Understanding the relationship between protein sequence and function is crucial for accurate genetic variant classification. Variant effect predictors (VEPs) play a vital role in deciphering this complex relationship, yet evaluating their performance remains challenging due to data circularity, where same or related used training assessment. High-throughput experimental strategies like deep mutational scanning (DMS) offer promising solution. In study, we extend upon our previous benchmarking approach, assessing of 84 different VEPs DMS experiments from 36 human proteins. addition, new pairwise, VEP-centric ranking method reduces impact VEP score availability on overall ranking. We observe remarkably high correspondence DMS-based benchmarks clinical classification, especially that have not been directly trained variants. Our results suggest comparing against diverse functional assays represents reliable strategy relative However, major challenges interpretation scores persist, highlighting need further research fully leverage computational diagnosis. also address practical considerations end users terms choice methodology.

Language: Английский

Citations

9

Genetic constraint at single amino acid resolution in protein domains improves missense variant prioritisation and gene discovery DOI Creative Commons
Xiaolei Zhang, Pantazis Theotokis,

Nicholas Li

et al.

Genome Medicine, Journal Year: 2024, Volume and Issue: 16(1)

Published: July 11, 2024

Abstract Background One of the major hurdles in clinical genetics is interpreting consequences associated with germline missense variants humans. Recent significant advances have leveraged natural variation observed large-scale human populations to uncover genes or genomic regions that show a depletion variation, indicative selection pressure. We refer this as “genetic constraint”. Although existing genetic constraint metrics been demonstrated be successful prioritising diseases, their spatial resolution limited distinguishing pathogenic from benign within genes. Methods aim identify are significantly depleted general population. Given size currently available exome genome sequencing data, it not possible directly detect individual variants, since average expected number observations variant at most positions less than one. instead focus on protein domains, grouping homologous similar functional impacts examine variations these comparable sets. To accomplish this, we develop Homologous Missense Constraint (HMC) score. utilise Genome Aggregation Database (gnomAD) 125 K data and evaluate quasi amino-acid by combining signals across homologues. Results one million under strong negative domains. Though our approach annotates only nonetheless allows us assess 22% confidently. It precisely distinguishes for both early-onset adult-onset disorders. outperforms pathogenicity meta-predictors de novo mutations probands developmental disorders (DD). also methodologically independent these, adding power predict when used combination. demonstrate utility gene discovery identifying seven newly DD could act through an altered-function mechanism. Conclusions Grouping effective evaluating constraint. HMC novel accurate predictor consequence improved interpretation.

Language: Английский

Citations

9

A general temperature-guided language model to design proteins of enhanced stability and activity DOI Creative Commons
Fan Jiang, Mingchen Li, Jiajun Dong

et al.

Science Advances, Journal Year: 2024, Volume and Issue: 10(48)

Published: Nov. 27, 2024

Designing protein mutants with both high stability and activity is a critical yet challenging task in engineering. Here, we introduce PRIME, deep learning model, which can suggest improved without any prior experimental mutagenesis data for the specified protein. Leveraging temperature-aware language modeling, PRIME demonstrated superior predictive ability compared to current state-of-the-art models on public dataset across 283 assays. Furthermore, validated PRIME’s predictions five proteins, examining impact of top 30 45 single-site mutations various properties, including thermal stability, antigen-antibody binding affinity, polymerize nonnatural nucleic acid or resilience extreme alkaline conditions. More than 30% PRIME-recommended exhibited performance their premutation counterparts all proteins desired properties. We developed an efficient effective method based rapidly obtain multisite enhanced stability. Hence, demonstrates broad applicability

Language: Английский

Citations

9

Are protein language models the new universal key? DOI Creative Commons
Konstantin Weißenow, Burkhard Rost

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 102997 - 102997

Published: Feb. 7, 2025

Protein language models (pLMs) capture some aspects of the grammar life as written in protein sequences. The so-called pLM embeddings implicitly contain this information. Therefore, can serve exclusive input into downstream supervised methods for prediction. Over last 33 years, evolutionary information extracted through simple averaging specific families from multiple sequence alignments (MSAs) has been most successful universal key to success For many applications, MSA-free pLM-based predictions now have become significantly more accurate. reason is often a combination two aspects. Firstly, condense so efficiently that prediction succeed with small models, i.e., they need few free parameters particular era exploding deep neural networks. Secondly, provide protein-specific solutions. As additional benefit, once pre-training complete, solutions tend consume much fewer resources than MSA-based In fact, we appeal community rather optimize foundation retrain new ones and evolve incentives require even at loss accuracy. Although pLMs not, yet, succeeded entirely replace body developed over three decades, clearly are rapidly advancing

Language: Английский

Citations

1

Genome modeling and design across all domains of life with Evo 2 DOI Creative Commons
Garyk Brixi, Matthew G. Durrant, Ja‐Lok Ku

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 21, 2025

Abstract All of life encodes information with DNA. While tools for sequencing, synthesis, and editing genomic code have transformed biological research, intelligently composing new systems would also require a deep understanding the immense complexity encoded by genomes. We introduce Evo 2, foundation model trained on 9.3 trillion DNA base pairs from highly curated atlas spanning all domains life. train 2 7B 40B parameters to an unprecedented 1 million token context window single-nucleotide resolution. learns sequence alone accurately predict functional impacts genetic variation—from noncoding pathogenic mutations clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that autonomously breadth features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, prophage regions. Beyond its predictive capabilities, generates mitochondrial, prokaryotic, eukaryotic sequences at genome scale greater naturalness coherence than previous methods. Guiding via inference-time search enables controllable generation epigenomic structure, which demonstrate first scaling results in biology. make fully open, parameters, training code, inference OpenGenome2 dataset, accelerate exploration design complexity.

Language: Английский

Citations

1

Teaching AI to speak protein DOI Creative Commons
Michael Heinzinger, Burkhard Rost

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 102986 - 102986

Published: Feb. 21, 2025

Language: Английский

Citations

1

Predicting absolute protein folding stability using generative models DOI Creative Commons
Matteo Cagiada, Sergey Ovchinnikov, Kresten Lindorff‐Larsen

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 15, 2024

Abstract While there has been substantial progress in our ability to predict changes protein stability due amino acid substitutions, slower methods the absolute of a protein. Here we show how generative model for sequence can be leveraged stability. We benchmark predictions across broad set proteins and find mean error 1.5 kcal/mol correlation coefficient 0.7 range natural, small–medium sized up ca. 150 residues. analyse current limitations future directions including such may useful predicting conformational free energies. Our approach is simple use freely available via an online implementation.

Language: Английский

Citations

7