Do protein language models learn phylogeny? DOI Creative Commons
Sanjana Tule, Gabriel Foley, Mikael Bodén

и другие.

Briefings in Bioinformatics, Год журнала: 2024, Номер 26(1)

Опубликована: Ноя. 22, 2024

Abstract Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent classical phylogenetic tree inference. We connect these two paradigms by assessing the of protein-based language models (pLMs) discern without being explicitly trained do so. evaluate ESM2, ProtTrans, and MSA-Transformer relative methods, while also considering sequence insertions deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends outperform other pLMs (including multimodal ESM3) recovering among homologous sequences both low- high-gap settings. agree with conventional methods general, but more so for families fewer implied indels, highlighting indels as key factor differentiating phylogenetics pLMs. find that preferentially capture broader opposed finer within specific family, where has sweet spot highly divergent at remote distance. Less than 10% neurons are sufficient broadly recapitulate distances; when used isolation, difference between is further diminished. show polysemantic, shared different never fully overlapping. highlight potential complementary tool analysis, especially extending homologs difficult align imply complex histories deletions. Implementations analyses available https://github.com/santule/pLMEvo.

Язык: Английский

Rapid in silico directed evolution by a protein language model with EVOLVEpro DOI
Kaiyi Jiang, Zhaoqing Yan, Matteo Di Bernardo

и другие.

Science, Год журнала: 2024, Номер unknown

Опубликована: Ноя. 21, 2024

Directed protein evolution is central to biomedical applications but faces challenges like experimental complexity, inefficient multi-property optimization, and local maxima traps. While in silico methods using language models (PLMs) can provide modeled fitness landscape guidance, they struggle generalize across diverse families map activity. We present EVOLVEpro, a few-shot active learning framework that combines PLMs regression rapidly improve EVOLVEpro surpasses current methods, yielding up 100-fold improvements desired properties. demonstrate its effectiveness six proteins RNA production, genome editing, antibody binding applications. These results highlight the advantages of with minimal data over zero-shot predictions. opens new possibilities for AI-guided engineering biology medicine.

Язык: Английский

Процитировано

20

AI Methods for Antimicrobial Peptides: Progress and Challenges DOI Creative Commons
Carlos A. Brizuela, Gary Liu, J Stokes

и другие.

Microbial Biotechnology, Год журнала: 2025, Номер 18(1)

Опубликована: Янв. 1, 2025

ABSTRACT Antimicrobial peptides (AMPs) are promising candidates to combat multidrug‐resistant pathogens. However, the high cost of extensive wet‐lab screening has made AI methods for identifying and designing AMPs increasingly important, with machine learning (ML) techniques playing a crucial role. approaches have recently revolutionised this field by accelerating discovery new anti‐infective activity, particularly in preclinical mouse models. Initially, classical ML dominated field, but there been shift towards deep (DL) Despite significant contributions, existing reviews not thoroughly explored potential large language models (LLMs), graph neural networks (GNNs) structure‐guided AMP design. This review aims fill that gap providing comprehensive overview latest advancements, challenges opportunities using methods, particular emphasis on LLMs, GNNs We discuss limitations current highlight most relevant topics address coming years

Язык: Английский

Процитировано

3

How to build the virtual cell with artificial intelligence: Priorities and opportunities DOI Creative Commons
Charlotte Bunne, Yusuf Roohani, Yanay Rosen

и другие.

Cell, Год журнала: 2024, Номер 187(25), С. 7045 - 7063

Опубликована: Дек. 1, 2024

Cells are essential to understanding health and disease, yet traditional models fall short of modeling simulating their function behavior. Advances in AI omics offer groundbreaking opportunities create an virtual cell (AIVC), a multi-scale, multi-modal large-neural-network-based model that can represent simulate the behavior molecules, cells, tissues across diverse states. This Perspective provides vision on design how collaborative efforts build AIVCs will transform biological research by allowing high-fidelity simulations, accelerating discoveries, guiding experimental studies, offering new for cellular functions fostering interdisciplinary collaborations open science.

Язык: Английский

Процитировано

15

Conditional language models enable the efficient design of proficient enzymes DOI Creative Commons
Geraldene Munsamy, R. Illanes, Silvia Fruncillo

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Май 5, 2024

Abstract The design of functional enzymes holds promise for transformative solutions across various domains but presents significant challenges. Inspired by the success language models in generating nature-like proteins, we explored potential an enzyme-specific model designing catalytically active artificial enzymes. Here, introduce ZymCTRL (’enzyme control’), a conditional trained on enzyme sequence space, capable based user-defined specifications. Experimental validation at diverse data regimes and different families demonstrated ZymCTRL’s ability to generate identity ranges. Specifically, describe carbonic anhydrases lactate dehydrogenases zero-shot, without requiring further training model, showcasing activity identities below 40% compared natural proteins. Biophysical analysis confirmed globularity well-folded nature generated sequences. Furthermore, fine-tuning enabled generation outside space with comparable their counterparts. Two were selected scale production successfully lyophilised, maintaining demonstrating preliminary conversion one-pot enzymatic cascades under extreme conditions. Our findings open new door towards rapid cost-effective proficient dataset are freely available community.

Язык: Английский

Процитировано

12

Biophysics-based protein language models for protein engineering DOI Creative Commons
Sam Gelman,

Bryce Johnson,

Chase R. Freschlin

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Март 17, 2024

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure, and function. However, these overlook decades of research into biophysical factors governing We propose Mutational Effect Transfer Learning (METL), a model framework that unites advanced machine learning modeling. Using the METL framework, we pretrain transformer-based neural networks simulation to capture fundamental relationships between energetics. finetune experimental sequence-function harness signals apply them when predicting properties like thermostability, catalytic activity, fluorescence. excels in challenging engineering tasks generalizing from small training sets position extrapolation, although existing methods train remain many types assays. demonstrate METL's ability design functional green fluorescent variants only 64 examples, showcasing potential biophysics-based engineering.

Язык: Английский

Процитировано

11

Functional protein mining with conformal guarantees DOI Creative Commons
Ron Boger, Seyone Chithrananda, Anastasios N. Angelopoulos

и другие.

Nature Communications, Год журнала: 2025, Номер 16(1)

Опубликована: Янв. 2, 2025

Abstract Molecular structure prediction and homology detection offer promising paths to discovering protein function evolutionary relationships. However, current approaches lack statistical reliability assurances, limiting their practical utility for selecting proteins further experimental in-silico characterization. To address this challenge, we introduce a statistically principled approach search leveraging principles from conformal prediction, offering framework that ensures guarantees with user-specified risk provides calibrated probabilities (rather than raw ML scores) any model. Our method (1) lets users select many biologically-relevant loss metrics (i.e. false discovery rate) assigns reliable functional annotating genes of unknown function; (2) achieves state-of-the-art performance in enzyme classification without training new models; (3) robustly rapidly pre-filters computationally intensive structural alignment algorithms. enhances the enables uncharacterized likely desirable properties.

Язык: Английский

Процитировано

1

The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling DOI Creative Commons

Andre Cornman,

Jacob West-Roberts, Antônio Pedro Camargo

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Авг. 17, 2024

Abstract Biological language model performance depends heavily on pretraining data quality, diversity, and size. While metagenomic datasets feature enormous biological their utilization as has been limited due to challenges in accessibility, quality filtering deduplication. Here, we present the Open MetaGenomic (OMG) corpus, a genomic dataset totalling 3.1T base pairs 3.3B protein coding sequences, obtained by combining two largest repositories (JGI’s IMG EMBL’s MGnify). We first document composition of describe steps taken remove poor data. make OMG corpus available mixed-modality sequence that represents multi-gene encoding sequences with translated amino acids for nucleic intergenic sequences. train (gLM2) leverages context information learn robust functional representations, well coevolutionary signals protein-protein interfaces regulatory syntax. Furthermore, show deduplication embedding space can be used balance demonstrating improved downstream tasks. The is publicly hosted Hugging Face Hub at https://huggingface.co/datasets/tattabio/OMG gLM2 https://huggingface.co/tattabio/gLM2_650M .

Язык: Английский

Процитировано

6

Diverse Genomic Embedding Benchmark for functional evaluation across the tree of life DOI
Jacob West-Roberts,

Joshua Kravitz,

Nishant Jha

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 16, 2024

Abstract Biological foundation models hold significant promise for deciphering complex biological functions. However, evaluating their performance on functional tasks remains challenging due to the lack of standardized benchmarks encompassing diverse sequences and Existing annotations are often scarce, biased, susceptible train-test leakage, hindering robust evaluation. Furthermore, functions manifest at multiple scales, from individual residues large genomic segments. To address these limitations, we introduce Diverse Genomic Embedding Benchmark (DGEB), inspired by natural language embedding benchmarks. DGEB comprises six across 18 expert curated datasets, spanning all domains life both nucleic acid amino modalities. Notably, four datasets enable direct comparison between trained different Benchmarking protein (pLMs gLMs) reveals saturation with model scaling numerous tasks, especially those underrepresented (e.g. Archaea). This highlights limitations existing modeling objectives training data distributions capturing is available as an open-source package a public leaderboard https://github.com/TattaBio/DGEB .

Язык: Английский

Процитировано

4

Phyla: Towards a Foundation Model for Phylogenetic Inference DOI Creative Commons

Andrew Shen,

Yasha Ektefaie,

Lakhmi C. Jain

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown

Опубликована: Янв. 22, 2025

Deep learning has made strides in modeling protein sequences but often struggles to generalize beyond its training distribution. Current models focus on individual through masked language modeling, effective sequence analysis demands the ability reason across sequences, a critical step phylogenetic analysis. Training biological foundation explicitly for intersequence reasoning could enhance their generalizability and performance inference other tasks computational biology. Here, we report an ongoing development of PHYLA, architecture that operates explicit, higher-level semantic representation trees. PHYLA employs hybrid state-space transformer novel tree loss function achieve state-of-the-art benchmarks reconstruction. To validate PHYLA's capabilities, applied it reconstruct life, where accurately reclassified archaeal organisms, such as Lokiarchaeota, more closely related bacteria-aligning with recent insights. represents toward molecular reasoning, emphasizing structured over memorization advancing inference.

Язык: Английский

Процитировано

0

Pathogen genomic surveillance and the AI revolution DOI Creative Commons
Spyros Lytras, Kieran D. Lamb, Jumpei Ito

и другие.

Journal of Virology, Год журнала: 2025, Номер unknown

Опубликована: Янв. 29, 2025

ABSTRACT The unprecedented sequencing efforts during the COVID-19 pandemic paved way for genomic surveillance to become a powerful tool monitoring evolution of circulating viruses. Herein, we discuss how state-of-the-art artificial intelligence approach called protein language models (pLMs) can be used effectively analyzing pathogen data. We highlight examples pLMs applied predicting viral properties and lay out framework integrating into pipelines.

Язык: Английский

Процитировано

0