Do protein language models learn phylogeny? DOI Creative Commons
Sanjana Tule, Gabriel Foley, Mikael Bodén

и другие.

Briefings in Bioinformatics, Год журнала: 2024, Номер 26(1)

Опубликована: Ноя. 22, 2024

Abstract Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent classical phylogenetic tree inference. We connect these two paradigms by assessing the of protein-based language models (pLMs) discern without being explicitly trained do so. evaluate ESM2, ProtTrans, and MSA-Transformer relative methods, while also considering sequence insertions deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends outperform other pLMs (including multimodal ESM3) recovering among homologous sequences both low- high-gap settings. agree with conventional methods general, but more so for families fewer implied indels, highlighting indels as key factor differentiating phylogenetics pLMs. find that preferentially capture broader opposed finer within specific family, where has sweet spot highly divergent at remote distance. Less than 10% neurons are sufficient broadly recapitulate distances; when used isolation, difference between is further diminished. show polysemantic, shared different never fully overlapping. highlight potential complementary tool analysis, especially extending homologs difficult align imply complex histories deletions. Implementations analyses available https://github.com/santule/pLMEvo.

Язык: Английский

Protein sequence modelling with Bayesian flow networks DOI Creative Commons
Timothy Atkinson, Thomas D. Barrett, Scott Cameron

и другие.

Nature Communications, Год журнала: 2025, Номер 16(1)

Опубликована: Апрель 3, 2025

Язык: Английский

Процитировано

0

Scaling unlocks broader generation and deeper functional understanding of proteins DOI Creative Commons

Aadyot Bhatnagar,

Sarthak Jain,

Joel Beazer

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown

Опубликована: Апрель 16, 2025

Abstract Generative protein language models (PLMs) are powerful tools for designing proteins purpose-built to solve problems in medicine, agriculture, and industrial processes. Recent work has trained ever larger models, but there been little systematic study of the optimal training distributions influence model scale on sequences generated by PLMs. We introduce ProGen3 family sparse generative PLMs, we develop compute-optimal scaling laws up a 46B-parameter pre-trained 1.5T amino acid tokens. ProGen3’s pre-training data is sampled from an optimized distribution over Profluent Protein Atlas v1, carefully curated dataset 3.4B full-length proteins. evaluate first time wet lab find that generate viable much wider diversity families. Finally, both computationally experimentally more responsive alignment with laboratory data, resulting improved fitness prediction sequence generation capabilities. These results indicate PLMs like ProGen3-46B larger, well-curated datasets foundation push frontier design. 1

Язык: Английский

Процитировано

0

Fine-Tuning Protein Language Models Unlocks the Potential of Underrepresented Viral Proteomes DOI Creative Commons

Rajan Sawhney,

Barbra D. Ferrell,

Thibaut Dejean

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown

Опубликована: Апрель 23, 2025

ABSTRACT Protein language models (pLMs) have revolutionized computational biology by generating rich protein vector representations—or embeddings—enabling major advancements in de novo design, structure prediction, variant effect analysis, and evolutionary studies. Despite these breakthroughs, current pLMs often exhibit biases against proteins from underrepresented species, with viral being particularly affected—frequently referred to as the “dark matter” of biological world due their vast diversity ubiquity, yet sparse representation training datasets. Here, we show that fine-tuning pre-trained on sequences, using diverse learning frameworks parameter-efficient strategies, significantly enhances quality improves performance downstream tasks. To support further research, provide source code for benchmarking embedding quality. By enabling more accurate modeling proteins, our approach advances tools understanding biology, combating emerging infectious diseases, driving biotechnological innovation.

Язык: Английский

Процитировано

0

VespaG: Expert-guided protein Language Models enable accurate and blazingly fast fitness prediction DOI Creative Commons
Céline Marquet, Julius Schlensok, Marina Abakarova

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Апрель 28, 2024

Exhaustive experimental annotation of the effect all known protein variants remains daunting and expensive, stressing need for scalable predictions. We introduce VespaG, a blazingly fast missense amino acid variant predictor, leveraging Language Model (pLM) embeddings as input to minimal deep learning model. To overcome sparsity training data, we created dataset 39 million single from human proteome applying multiple sequence alignment-based predictor GEMME pseudo standard-of-truth. This setup increases interpretability compared baseline pLM is easily retrainable with novel or updated pLMs. Assessed against ProteinGym benchmark (217 multiplex assays - MAVE 2.5 variants), VespaG achieved mean Spearman correlation 0.48 +/- 0.02, matching top-performing methods evaluated on same data. has advantage being orders magnitude faster, predicting mutational landscapes proteins in proteomes such Homo sapiens Drosophila melanogaster under 30 minutes consumer laptop (12-core CPU, 16 GB RAM).

Язык: Английский

Процитировано

3

Improvements in viral gene annotation using large language models and soft alignments DOI Creative Commons
William L. Harrigan, Barbra D. Ferrell, K. Eric Wommack

и другие.

BMC Bioinformatics, Год журнала: 2024, Номер 25(1)

Опубликована: Апрель 25, 2024

Abstract Background The annotation of protein sequences in public databases has long posed a challenge molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological by annotating based on embeddings. Results Central our contribution the soft alignment algorithm, drawing from traditional but leveraging embedding similarity at amino acid level bypass need conventional scoring matrices. method not only surpasses pooled embedding-based models efficiency also interpretability, enabling users easily trace homologous acids and delve deeper into alignments. Far being black box, approach provides transparent, BLAST-like visualizations, combining biological research with AI advancements elevate through analysis while ensuring interpretability. Tests Virus Orthologous Groups ViralZone indicated that recognized annotated both blastp pooling-based methods, are commonly used sequence annotation, failed detect. Conclusion embeddings shows great potential LLMs enhancing especially genomics. These findings present promising avenue more efficient accurate function inference

Язык: Английский

Процитировано

2

Beware of Data Leakage from Protein LLM Pretraining DOI Creative Commons
Leon Hermann, Tobias Fiedler, Hoang An Nguyen

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 24, 2024

Abstract Pretrained protein language models are becoming increasingly popular as a backbone for property inference tasks such structure prediction or function annotation, accelerating biological research. However, related research oftentimes does not consider the effects of data leakage from pretraining on actual downstream task, resulting in potentially unrealistic performance estimates. Reported generalization might necessarily be reproducible proteins highly dissimilar set. In this work, we measure model domain thermostability prediction. Specifically, compare two different dataset split strategies: pretraining-aware split, designed to avoid similarity between and held-out test sets, commonly-used naive relying clustering training task without taking into account. Our experiments suggest that shows consistent melting point across all experiments, distorting measured performance. The source code our splits available at https://github.com/tfiedlerdev/pretraining-aware-hotprot .

Язык: Английский

Процитировано

2

Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation DOI

Pranav Kantroo,

Günter P. Wagner, Benjamin B. Machta

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 13, 2024

Protein language models trained on the masked modeling objective learn to predict identity of hidden amino acid residues within a sequence using remaining observable as context. They do so by embedding into high dimensional space that encapsulates relevant contextual cues. These vectors serve an informative context-sensitive representation not only aids with defined training objective, but can also be used for other tasks downstream models. We propose scheme use embeddings unmasked estimate corresponding probability all positions in single forward pass through model. This One Fell Swoop (OFS) approach allows us efficiently pseudo-perplexity sequence, measure model's uncertainty its predictions, fitness estimate. find ESM2 OFS performs nearly well true at estimation, and more notably it defines new state art ProteinGym Indels benchmark. The strong performance prompted investigate if could detect elevated stability reported reconstructed ancestral sequences. this ranks reconstructions fit than extant Finally, we show computational efficiency technique Monte Carlo methods rapidly explore functional space.

Язык: Английский

Процитировано

1

SeqDance: A Protein Language Model for Representing Protein Dynamic Properties DOI Creative Commons
Chao Hou, Yufeng Shen

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Окт. 15, 2024

Abstract Proteins perform their functions by folding amino acid sequences into dynamic structural ensembles. Despite the important role of protein dynamics, complexity and absence efficient representation methods have limited integration studies on function mutation fitness, especially in deep learning applications. To address this, we present SeqDance, a language model designed to learn properties directly from sequence alone. SeqDance is pre-trained biophysical derived over 30,400 molecular dynamics trajectories 28,600 normal mode analyses. Our results show that effectively captures local interactions, co-movement patterns, global conformational features, even for proteins lacking homologs pre-training set. Additionally, showed enhances prediction fitness landscapes, disorder-to-order transition binding regions, phase-separating proteins. By sequence, complements conventional evolution- static structure-based methods, offering new insights behavior function.

Язык: Английский

Процитировано

1

Functional protein mining with conformal guarantees DOI Creative Commons
Ron Boger, Seyone Chithrananda, Anastasios N. Angelopoulos

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 28, 2024

1 Abstract Molecular structure prediction and homology detection provide a promising path to discovering new protein function evolutionary relationships. However, current approaches lack statistical reliability assurances, limiting their practical utility for selecting proteins further experimental in-silico characterization. To address this challenge, we introduce novel approach search leveraging principles from conformal prediction, offering framework that ensures guarantees with user-specified risk provides calibrated probabilities (rather than raw ML scores) any model. Our method (1) lets users select many biologically-relevant loss metrics (i.e. false discovery rate) assigns reliable functional annotating genes of unknown function; (2) achieves state-of-the-art performance in enzyme classification without training models; (3) robustly rapidly pre-filters computationally intensive structural alignment algorithms. enhances the enables likely desirable properties.

Язык: Английский

Процитировано

0

Benchmarking text-integrated protein language model embeddings and embedding fusion on diverse downstream tasks DOI Creative Commons
Young Su Ko, Jonathan Parkinson, Wei Wang

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Авг. 26, 2024

Abstract Protein language models (pLMs) have traditionally been trained in an unsupervised manner using large protein sequence databases with autoregressive or masked-language modeling training paradigm. Recent methods attempted to enhance pLMs by integrating additional information, the form of text, which are referred as “text+protein” (tpLMs). We evaluate and compare six tpLMs (OntoProtein, ProteinDT, ProtST, ProteinCLIP, ProTrek, ESM3) against ESM2, a baseline text-free pLM, across downstream tasks designed assess learned representations. find that while outperform ESM2 five out benchmarks, no tpLM was consistently best. Thus, we additionally investigate potential embedding fusion, exploring whether combinations embeddings can improve performance on benchmarks exploiting strengths multiple tpLMs. single highlighting its useful strategy field machine-learning for proteins. To facilitate practical application outline heuristic framework efficiently identify optimal combination embeddings, reducing exponential time complexity exhaustive search down manageable linear complexity. Using our fusion framework, achieve state-of-the-art performances protein-protein interaction prediction homologous recovery without any specific model adjustments hyperparameter tuning. Our experiments suggest is tool proteins toolbox. Lastly, this study highlights future research strategies maximizing utility pLMs.

Язык: Английский

Процитировано

0