MethylGPT: a foundation model for the DNA methylome DOI Creative Commons
Kejun Ying, Jinyeop Song, Haotian Cui

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 3, 2024

Abstract DNA methylation serves as a powerful biomarker for disease diagnosis and biological age assessment. However, current analytical approaches often rely on linear models that cannot capture the complex, context-dependent nature of regulation. Here we present MethylGPT, transformer-based foundation model trained 226,555 (154,063 after QC deduplication) human profiles spanning diverse tissue types from 5,281 datasets, curated 49,156 CpG sites, 7.6 billion training tokens. MethylGPT learns biologically meaningful representations capturing both local genomic context higher-order chromosomal features without external supervision. The demonstrates robust value prediction (Pearson R=0.929) maintains stable performance in downstream tasks with up to 70% missing data. Applied across multiple types, achieves superior accuracy compared existing methods. Analysis model’s attention patterns reveals distinct signatures between young old samples, differential enrichment developmental aging-associated pathways. When finetuned mortality 60 major conditions using 18,859 samples Generation Scotland, predictive enables systematic evaluation intervention effects risks, demonstrating potential clinical applications. Our results demonstrate transformer architectures can effectively while preserving interpretability, suggesting broad utility epigenetic analysis

Language: Английский

Ribosomal protein phylogeography offers quantitative insights into the efficacy of genome-resolved surveys of microbial communities DOI Creative Commons
Matthew S. Schechter, Florian Trigodet, Iva Veseli

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 15, 2025

The increasing availability of microbial genomes is essential to gain insights into ecology and evolution that can propel biotechnological biomedical advances. Recent advances in genome recovery have significantly expanded the catalogue from diverse habitats. However, ability explain how well a set account for diversity given environment remains challenging individual studies or biome-specific databases. Here we present EcoPhylo, computational workflow characterize phylogeography any gene family through integrated analyses metagenomes, our application this approach ribosomal proteins quantify phylogeny-aware rates across three biomes. Our findings show vary widely taxa biomes, single amplified genomes, metagenome-assembled isolate non-uniform yet quantifiable representation environmental microbes. EcoPhylo reveals highly resolved, reference-free, multi-domain phylogenies conjunction with distribution patterns clades environments, providing means assess benchmark biome-level collections.

Language: Английский

Citations

0

Phyla: Towards a Foundation Model for Phylogenetic Inference DOI Creative Commons

Andrew Shen,

Yasha Ektefaie,

Lakhmi C. Jain

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 22, 2025

Deep learning has made strides in modeling protein sequences but often struggles to generalize beyond its training distribution. Current models focus on individual through masked language modeling, effective sequence analysis demands the ability reason across sequences, a critical step phylogenetic analysis. Training biological foundation explicitly for intersequence reasoning could enhance their generalizability and performance inference other tasks computational biology. Here, we report an ongoing development of PHYLA, architecture that operates explicit, higher-level semantic representation trees. PHYLA employs hybrid state-space transformer novel tree loss function achieve state-of-the-art benchmarks reconstruction. To validate PHYLA's capabilities, applied it reconstruct life, where accurately reclassified archaeal organisms, such as Lokiarchaeota, more closely related bacteria-aligning with recent insights. represents toward molecular reasoning, emphasizing structured over memorization advancing inference.

Language: Английский

Citations

0

Selective State Space Models Outperform Transformers at Predicting RNA-Seq Read Coverage DOI Creative Commons
Ian Holmes, Johannes Linder, David R. Kelley

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 17, 2025

Abstract Transformers are the basis for many state-of-the-art machine learning tools, including those predicting gene expression data from DNA sequence. The considerable time and cost of training transformer models has motivated development alternative approaches inspired by ideas signal-processing literature, such as state-space (Mamba), Fourier transforms (Hyena), wavelet (MultiResNet). To evaluate these methods potential replacements (or complements) attention, we developed a software library bilby, implemented using Python Jax/Flax, providing convolutional, bidirectional Hyena, Mamba, striped-architecture supervised multi-task in functional genomics. We report comparison architectures, testing several hyperparameters variations, reporting performance statistics withheld test set well downstream SNP classifiers. Relative to comprising convolution attention layers (implemented TensorFlow via Baskerville used Borzoi software), (optionally) achieve small but consistent improvements prediction accuracy, roughly comparable times parameter counts, when averaged across all output tracks splits (a proportional increase 3-4% Pearson R, 1-2% r 2 , with highest gains achieved Mamba were combined striped architecture). In contrast, Hyena (when reimplemented described literature) was not competitive attention-based at tasks, while MultiResNet proved too slow be practical. accuracy Mamba-based do yet translate significantly improved on classification tasks: benchmarks GTEx eQTL dataset yield similar results Mamba- classifiers, marginally outperforming one metric difference +0.007 area under ROC) slightly underperforming another −0.006 Spearman rank correlation). argue that suggest selective (such Striped Mamba) warrant further exploration genomics tasks. Our code trained publicly available https://github.com/ihh/bilby .

Language: Английский

Citations

0

Transformer model generated bacteriophage genomes are compositionally distinct from natural sequences DOI Creative Commons
Jeremy Ratcliff

NAR Genomics and Bioinformatics, Journal Year: 2024, Volume and Issue: 6(3)

Published: July 2, 2024

Novel applications of language models in genomics promise to have a large impact on the field. The megaDNA model is first publicly available generative for creating synthetic viral genomes. To evaluate megaDNA's ability recapitulate nonrandom genome composition viruses and assess whether genomes can be algorithmically detected, compositional metrics 4969 natural bacteriophage 1002

Language: Английский

Citations

3

MethylGPT: a foundation model for the DNA methylome DOI Creative Commons
Kejun Ying, Jinyeop Song, Haotian Cui

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 3, 2024

Abstract DNA methylation serves as a powerful biomarker for disease diagnosis and biological age assessment. However, current analytical approaches often rely on linear models that cannot capture the complex, context-dependent nature of regulation. Here we present MethylGPT, transformer-based foundation model trained 226,555 (154,063 after QC deduplication) human profiles spanning diverse tissue types from 5,281 datasets, curated 49,156 CpG sites, 7.6 billion training tokens. MethylGPT learns biologically meaningful representations capturing both local genomic context higher-order chromosomal features without external supervision. The demonstrates robust value prediction (Pearson R=0.929) maintains stable performance in downstream tasks with up to 70% missing data. Applied across multiple types, achieves superior accuracy compared existing methods. Analysis model’s attention patterns reveals distinct signatures between young old samples, differential enrichment developmental aging-associated pathways. When finetuned mortality 60 major conditions using 18,859 samples Generation Scotland, predictive enables systematic evaluation intervention effects risks, demonstrating potential clinical applications. Our results demonstrate transformer architectures can effectively while preserving interpretability, suggesting broad utility epigenetic analysis

Language: Английский

Citations

3