ProkBERT PhaStyle: Accurate Phage Lifestyle Prediction with Pretrained Genomic Language Models DOI

Julianna Juhász,

Bodnár Babett,

János Juhász

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 8, 2024

Abstract Background Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. from metagenome metavirome assemblies are often fragmented, the diversity of environmental phages not well known. Current computational approaches rely on database comparisons machine learning algorithms that require significant effort expertise to update. We propose using genomic language models for classification, allowing efficient direct analysis nucleotide without need sophisticated preprocessing pipelines manually curated databases. Methods trained three (DNABERT-2, Nucleotide Transformer, ProkBERT) datasets short, fragmented sequences. These were then compared with dedicated prediction methods (PhaTYP, DeePhage, BACPHLIP) terms accuracy, speed, generalization capability. Results ProkBERT PhaStyle consistently outperforms existing various scenarios. It generalizes out-of-sample data, accurately classifies extreme environments, also demonstrates high inference speed. Despite having up 20 times fewer parameters, it proved be better performing than much larger models. Conclusions Genomic offer a simple computationally alternative solving complex classification tasks, such prediction. PhaStyle’s simplicity, performance suggest its utility clinical

Language: Английский

Species-aware DNA language models capture regulatory elements and their evolution DOI Creative Commons
Alexander Karollus, Johannes Hingerl,

Dennis Gankin

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Jan. 27, 2023

Abstract The rise of large-scale multi-species genome sequencing projects promises to shed new light on how genomes encode gene regulatory instructions. To this end, algorithms are needed that can leverage conservation capture elements while accounting for their evolution. Here we introduce species-aware DNA language models (LMs), which trained more than 800 species spanning over 500 million years Investigating ability predict masked nucleotides from context, show LMs distinguish transcription factor and RNA-binding protein motifs background non-coding sequence. Owing flexibility, conserved much further evolutionary distances sequence alignment would allow. Remarkably, reconstruct motif instances bound in vivo better unbound ones account the evolution sequences positional constraints, showing these functional high-order context. We training yields improved representations endogenous MPRA-based expression prediction, as well discovery. Collectively, results demonstrate a powerful, flexible, scalable tool integrate information large compendia highly diverged genomes.

Language: Английский

Citations

6

Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction DOI Creative Commons
Hai Thanh Phan, Céline Brouard, Raphaël Mourad

et al.

Briefings in Bioinformatics, Journal Year: 2024, Volume and Issue: 25(6)

Published: Sept. 23, 2024

Abstract Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount severely limited by the finite size of human genome. Conversely, mammalian growing exponentially due ongoing large-scale sequencing projects, but cases without data. To alleviate limitations we propose novel semi-supervised (SSL) based pseudo-labeling, allows exploit unlabeled from numerous genomes during model pre-training. We further improved it incorporating principles Noisy Student algorithm predict confidence pseudo-labeled data used pre-training, showed improvements transcription factor very few binding (very small training data). The flexible can be train any neural architecture including state-of-the-art models, shows strong predictive performance compared standard learning. Moreover, models trained SSL similar or better than large language DNABERT2.

Language: Английский

Citations

0

Limitations and Enhancements in Genomic Language Models: Dynamic Selection Approach DOI Creative Commons

S. Qiu

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 26, 2024

Genomic Language Models (GLMs), which learn from nucleotide sequences, have become essential tools for understanding the principles of life and demonstrated outstanding performance in downstream tasks genomic analysis, such as sequence generation classification. However, models that achieve state-of-the-art (SoTA) results benchmark tests often exhibit significant differences training methods, model architectures, tokenization techniques, leading to varied strengths weaknesses. Based on these differences, we propose a multi-model fusion approach based dynamic selector. By effectively integrating three with significantly different method enhances overall predictive tasks. Experimental indicate outperforms any single testing tasks(SoTA), achieving complementary advantages among models. Additionally, considering most researchers focus improving performance, they may overlook detailed analysis processing capabilities across architectures. To address this gap, study conducts comprehensive classification abilities models, hypotheses validations possible underlying causes. The findings reveal strong correlation between prominence motifs sequences. excessive reliance result limitations biological functions ultra-short core genes contextual relationships ultra-long We suggest issues need novel architectural module compensate deficiencies genes. code, data, pre-trained are available at https://github.com/Jacob-S-Qiu/glm_dynamic_selection.

Language: Английский

Citations

0

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale DOI Creative Commons
Caleb N. Ellington, Nian X. Sun, Nicholas Ho

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 5, 2024

Abstract Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of language models, genome remain nascent. Recent studies suggest bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be case that even short DNA modeled poorly by existing approaches, current unable represent wide array functions encoded DNA. To study this, we develop AIDO.DNA, pretrained module for representation in an AI-driven Digital Organism [1]. AIDO.DNA seven billion parameter encoder-only transformer trained on 10.6 nucleotides from dataset 796 species. By scaling model size while maintaining length 4k nucleotides, shows substantial improvements across breadth supervised, generative, zero-shot tasks relevant functional genomics, synthetic biology, drug development. Notably, outperforms prior architectures without new data, suggesting laws needed achieve computeoptimal models. Models code available through Model-Generator https://github.com/genbio-ai/AIDO Hugging Face at https://huggingface.co/genbio-ai .

Language: Английский

Citations

0

ProkBERT PhaStyle: Accurate Phage Lifestyle Prediction with Pretrained Genomic Language Models DOI

Julianna Juhász,

Bodnár Babett,

János Juhász

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 8, 2024

Abstract Background Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. from metagenome metavirome assemblies are often fragmented, the diversity of environmental phages not well known. Current computational approaches rely on database comparisons machine learning algorithms that require significant effort expertise to update. We propose using genomic language models for classification, allowing efficient direct analysis nucleotide without need sophisticated preprocessing pipelines manually curated databases. Methods trained three (DNABERT-2, Nucleotide Transformer, ProkBERT) datasets short, fragmented sequences. These were then compared with dedicated prediction methods (PhaTYP, DeePhage, BACPHLIP) terms accuracy, speed, generalization capability. Results ProkBERT PhaStyle consistently outperforms existing various scenarios. It generalizes out-of-sample data, accurately classifies extreme environments, also demonstrates high inference speed. Despite having up 20 times fewer parameters, it proved be better performing than much larger models. Conclusions Genomic offer a simple computationally alternative solving complex classification tasks, such prediction. PhaStyle’s simplicity, performance suggest its utility clinical

Language: Английский

Citations

0