MethylGPT: a foundation model for the DNA methylome DOI Creative Commons
Kejun Ying, Jinyeop Song, Haotian Cui

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 3, 2024

Abstract DNA methylation serves as a powerful biomarker for disease diagnosis and biological age assessment. However, current analytical approaches often rely on linear models that cannot capture the complex, context-dependent nature of regulation. Here we present MethylGPT, transformer-based foundation model trained 226,555 (154,063 after QC deduplication) human profiles spanning diverse tissue types from 5,281 datasets, curated 49,156 CpG sites, 7.6 billion training tokens. MethylGPT learns biologically meaningful representations capturing both local genomic context higher-order chromosomal features without external supervision. The demonstrates robust value prediction (Pearson R=0.929) maintains stable performance in downstream tasks with up to 70% missing data. Applied across multiple types, achieves superior accuracy compared existing methods. Analysis model’s attention patterns reveals distinct signatures between young old samples, differential enrichment developmental aging-associated pathways. When finetuned mortality 60 major conditions using 18,859 samples Generation Scotland, predictive enables systematic evaluation intervention effects risks, demonstrating potential clinical applications. Our results demonstrate transformer architectures can effectively while preserving interpretability, suggesting broad utility epigenetic analysis

Language: Английский

Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences DOI Creative Commons
Jeffrey A. Ruffolo, Stephen Nayfach, Joseph P. Gallagher

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 22, 2024

Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as cells. Artificial intelligence (AI) enabled design provides a powerful alternative with bypass evolutionary constraints generate optimal properties. Here, using large language models (LLMs) trained on biological diversity at scale, we demonstrate first successful precision of genome programmable editor designed AI. To achieve this goal, curated dataset over one million CRISPR operons through systematic mining 26 terabases assembled genomes meta-genomes. We capacity our by generating 4.8x number protein clusters across CRISPR-Cas families found nature tailoring single-guide RNA sequences for Cas9-like effector proteins. Several generated comparable or improved activity specificity relative SpCas9, prototypical effector, being 400 mutations away sequence. Finally, an AI-generated editor, denoted OpenCRISPR-1, exhibits compatibility base editing. release OpenCRISPR-1 publicly facilitate broad, ethical usage research commercial applications.

Language: Английский

Citations

35

Democratizing protein language models with parameter-efficient fine-tuning DOI Creative Commons
Samuel Sledzieski, Meghana Kshirsagar, Minkyung Baek

et al.

Proceedings of the National Academy of Sciences, Journal Year: 2024, Volume and Issue: 121(26)

Published: June 20, 2024

Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from corpora of sequences. These are typically fine-tuned in a supervised setting to adapt the model specific downstream tasks. However, computational and memory footprint fine-tuning (FT) PLMs presents barrier for many research groups with limited resources. Natural processing seen similar explosion size models, where these challenges have addressed methods parameter-efficient (PEFT). In this work, we introduce paradigm proteomics through leveraging method LoRA training new two important tasks: predicting protein–protein interactions (PPIs) symmetry homooligomer quaternary structures. We show that approaches competitive traditional FT while requiring reduced substantially fewer parameters. additionally PPI prediction task, only classification head also remains full FT, using five orders magnitude parameters, each outperform state-of-the-art compute. further perform comprehensive evaluation hyperparameter space, demonstrate PEFT is robust variations hyperparameters, elucidate best practices differ those natural processing. All our adaptation code available open-source at https://github.com/microsoft/peft_proteomics . Thus, provide blueprint democratize power PLM

Language: Английский

Citations

16

Language models for biological research: a primer DOI
Elana P. Simon, Kyle Swanson, James Zou

et al.

Nature Methods, Journal Year: 2024, Volume and Issue: 21(8), P. 1422 - 1429

Published: Aug. 1, 2024

Language: Английский

Citations

13

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Language: Английский

Citations

12

‘ChatGPT for CRISPR’ creates new gene-editing tools DOI

Ewen Callaway

Nature, Journal Year: 2024, Volume and Issue: 629(8011), P. 272 - 272

Published: April 29, 2024

Language: Английский

Citations

8

Responsible AI in biotechnology: balancing discovery, innovation and biosecurity risks DOI Creative Commons
Nicole E. Wheeler

Frontiers in Bioengineering and Biotechnology, Journal Year: 2025, Volume and Issue: 13

Published: Feb. 5, 2025

The integration of artificial intelligence (AI) in protein design presents unparalleled opportunities for innovation bioengineering and biotechnology. However, it also raises significant biosecurity concerns. This review examines the changing landscape bioweapon risks, dual-use potential AI-driven tools, necessary safeguards to prevent misuse while fostering innovation. It highlights emerging policy frameworks, technical safeguards, community responses aimed at mitigating risks enabling responsible development application AI design.

Language: Английский

Citations

1

ProtMamba: a homology-aware but alignment-free protein state space model DOI Creative Commons
Damiano Sgarbossa, Cyril Malbranke, Anne‐Florence Bitbol

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: May 25, 2024

Abstract Protein design has important implications for drug discovery, personalized medicine, and biotechnology. Models based on multiple sequence alignments efficiently capture the evolutionary information in homologous protein sequences, but alignment construction is imperfect. We present ProtMamba, a homology-aware alignment-free language model Mamba architecture. In contrast with attention-based models, ProtMamba handles very long context, comprising hundreds of sequences. train large dataset concatenated using two GPUs. combine autoregressive modeling masked through fill-in-the-middle training objective. This makes adapted to various applications. demonstrate ProtMamba’s usefulness generation novel sequences fitness prediction. reaches competitive performance other models despite its smaller size, which sheds light importance long-context conditioning.

Language: Английский

Citations

6

Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model DOI Creative Commons
Jingjing Zhai, Aaron Gokaslan, Yair Schiff

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 5, 2024

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation offer cross-species prediction better than supervised through fine-tuning limited labeled data. We introduce PlantCaduceus, a DNA LM based the Caduceus Mamba architectures, curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus Arabidopsis data for four tasks, including predicting translation initiation/termination sites splice donor acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming best existing by 1.45 7.23-fold. is competitive state-of-the-art protein LMs terms deleterious mutation identification, threefold PhyloP. Additionally, successfully identifies well-known causal variants both maize. Overall, versatile that accelerate genomics crop breeding applications.

Language: Английский

Citations

6

Nucleotide dependency analysis of DNA language models reveals genomic functional elements DOI Creative Commons
Pedro Tomaz da Silva, Alexander Karollus, Johannes Hingerl

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: July 27, 2024

Deciphering how nucleotides in genomes encode regulatory instructions and molecular machines is a long-standing goal biology. DNA language models (LMs) implicitly capture functional elements their organization from genomic sequences alone by modeling probabilities of each nucleotide given its sequence context. However, using LMs for discovering has been challenging due to the lack interpretable methods. Here, we introduce dependencies which quantify substitutions at one position affect other positions. We generated genome-wide maps pairwise within kilobase ranges animal, fungal, bacterial species. show that indicate deleteriousness human genetic variants more effectively than alignment LM reconstruction. Regulatory appear as dense blocks dependency maps, enabling systematic identification transcription factor binding sites accurately trained on experimental data. Nucleotide also highlight bases contact RNA structures, including pseudoknots tertiary structure contacts, with remarkable accuracy. This led discovery four novel, experimentally validated structures Escherichia coli. Finally, reveal critical limitations several architectures training selection strategies benchmarking visual diagnosis. Altogether, analysis opens new avenue studying interactions genomes.

Language: Английский

Citations

6

The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling DOI Creative Commons

Andre Cornman,

Jacob West-Roberts, Antônio Pedro Camargo

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Aug. 17, 2024

Abstract Biological language model performance depends heavily on pretraining data quality, diversity, and size. While metagenomic datasets feature enormous biological their utilization as has been limited due to challenges in accessibility, quality filtering deduplication. Here, we present the Open MetaGenomic (OMG) corpus, a genomic dataset totalling 3.1T base pairs 3.3B protein coding sequences, obtained by combining two largest repositories (JGI’s IMG EMBL’s MGnify). We first document composition of describe steps taken remove poor data. make OMG corpus available mixed-modality sequence that represents multi-gene encoding sequences with translated amino acids for nucleic intergenic sequences. train (gLM2) leverages context information learn robust functional representations, well coevolutionary signals protein-protein interfaces regulatory syntax. Furthermore, show deduplication embedding space can be used balance demonstrating improved downstream tasks. The is publicly hosted Hugging Face Hub at https://huggingface.co/datasets/tattabio/OMG gLM2 https://huggingface.co/tattabio/gLM2_650M .

Language: Английский

Citations

6