ProkBERT PhaStyle: Accurate Phage Lifestyle Prediction with Pretrained Genomic Language Models DOI

Julianna Juhász,

Bodnár Babett,

János Juhász

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 8, 2024

Abstract Background Phage lifestyle prediction, i.e. classifying phage sequences as virulent or temperate, is crucial in biomedical and ecological applications. from metagenome metavirome assemblies are often fragmented, the diversity of environmental phages not well known. Current computational approaches rely on database comparisons machine learning algorithms that require significant effort expertise to update. We propose using genomic language models for classification, allowing efficient direct analysis nucleotide without need sophisticated preprocessing pipelines manually curated databases. Methods trained three (DNABERT-2, Nucleotide Transformer, ProkBERT) datasets short, fragmented sequences. These were then compared with dedicated prediction methods (PhaTYP, DeePhage, BACPHLIP) terms accuracy, speed, generalization capability. Results ProkBERT PhaStyle consistently outperforms existing various scenarios. It generalizes out-of-sample data, accurately classifies extreme environments, also demonstrates high inference speed. Despite having up 20 times fewer parameters, it proved be better performing than much larger models. Conclusions Genomic offer a simple computationally alternative solving complex classification tasks, such prediction. PhaStyle’s simplicity, performance suggest its utility clinical

Language: Английский

GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction DOI Creative Commons
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Oct. 11, 2023

Whereas protein language models have demonstrated remarkable efficacy in predicting the effects of missense variants, DNA counterparts not yet achieved a similar competitive edge for genome-wide variant effect predictions, especially complex genomes such as that humans. To address this challenge, we here introduce GPN-MSA, novel framework leverages whole-genome sequence alignments across multiple species and takes only few hours to train. Across several benchmarks on clinical databases (ClinVar, COSMIC, OMIM), experimental functional assays (DMS, DepMap), population genomic data (gnomAD), our model human genome achieves outstanding performance deleteriousness prediction both coding non-coding variants.

Language: Английский

Citations

11

Designing realistic regulatory DNA with autoregressive language models DOI
Avantika Lal, David Garfield, Tommaso Biancalani

et al.

Genome Research, Journal Year: 2024, Volume and Issue: 34(9), P. 1411 - 1420

Published: Sept. 1, 2024

Cis -regulatory elements (CREs), such as promoters and enhancers, are DNA sequences that regulate the expression of genes. The activity a CRE is influenced by order, composition, spacing sequence motifs bound proteins called transcription factors (TFs). Synthetic CREs with specific properties needed for biomanufacturing well many therapeutic applications including cell gene therapy. Here, we present regLM, framework to design synthetic desired properties, high, low, or type–specific activity, using autoregressive language models in conjunction supervised sequence-to-function models. We used our yeast human enhancers. demonstrate generated approach not only predicted have functionality but also contain biological features similar experimentally validated CREs. regLM thus facilitates realistic regulatory while providing insights into cis code.

Language: Английский

Citations

4

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring DOI Creative Commons

Ollie Liu,

Sami Jaghour,

Johannes Hagemann

et al.

Published: Jan. 12, 2025

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as _metagenomic foundation model_, on novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from large collection human wastewater samples, processed sequenced using deep (next-generation) sequencing methods. Unlike genomic models that focus individual genomes or curated sets specific species, the aim METAGENE-1 capture full distribution information present within this wastewater, aid in tasks relevant pandemic monitoring pathogen detection. carry out byte-pair encoding (BPE) tokenization our dataset, tailored for sequences, then model. In paper, first detail pretraining strategy, model architecture, highlighting considerations design choices enable effective modeling data. show results providing details about losses, system metrics, training stability course pretraining. Finally, demonstrate performance achieves state-of-the-art set benchmarks new evaluations focused human-pathogen detection sequence embedding, showcasing its potential public health applications monitoring, biosurveillance, early emerging threats. Website: metagene.ai [https://metagene.ai/] Model Weights: huggingface.co/metagene-ai [https://huggingface.co/metagene-ai] Code Repository: github.com/metagene-ai [https://github.com/metagene-ai]

Language: Английский

Citations

0

Sequence-Only Prediction of Super-Enhancers in Human Cell Lines Using Transformer Models DOI Creative Commons

Ekaterina V. Kravchuk,

German A. Ashniev, Marina G. Gladkova

et al.

Biology, Journal Year: 2025, Volume and Issue: 14(2), P. 172 - 172

Published: Feb. 7, 2025

The study discloses the application of transformer-based deep learning models for task super-enhancers prediction in human tumor cell lines with a specific focus on sequence-only features within studied entities super-enhancer and enhancer elements genome. proposed SE-prediction method included GENA-LM at handling long DNA sequences classification task, distinguishing from enhancers using H3K36me, H3K4me1, H3K4me3 H3K27ac landscape datasets HeLa, HEK293, H2171, Jurkat, K562, MM1S U87 lines. model was fine-tuned relevant sequence data, allowing analysis extended genomic without need epigenetic markers as early approaches. achieved balanced accuracy metrics, surpassing previous like SENet, particularly HEK293 K562 Also, it shown that frequently co-localize marks such H3K27ac. Therefore, attention mechanism provided insights into contributing to SE classification, indicating correlation between mentioned landscapes. These findings support potential transformer use further bioinformatics applications enhancer/super-enhancer characterization gene regulation studies.

Language: Английский

Citations

0

Benchmarking DNA large language models on quadruplexes DOI Creative Commons
Oleksandr Cherednichenko, Alan Herbert, Maria Poptsova

et al.

Computational and Structural Biotechnology Journal, Journal Year: 2025, Volume and Issue: 27, P. 992 - 1000

Published: Jan. 1, 2025

Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using benchmark datasets, it remains unclear which LLM best suited for specific downstream tasks, particularly generating whole-genome annotations. Current LLMs fall into three main categories: transformer-based models, long convolution-based and state-space (SSMs). In this study, we benchmarked different types of architectures maps G-quadruplexes (GQ), a type flipons, or non-B DNA structures, characterized by distinctive patterns roles diverse regulatory contexts. Although GQ forms from folding guanosine residues tetrads, the computational task challenging as bases involved may be on strands, separated large number nucleotides, made RNA rather than DNA. All performed comparably well, with DNABERT-2 HyenaDNA achieving superior results based F1 MCC. Analysis annotations revealed that recovered more quadruplexes distal enhancers intronic regions. The were better to detecting arrays likely contribute nuclear condensates gene transcription chromosomal scaffolds. Caduceus formed separate grouping generated de novo quadruplexes, while clustered together. Overall, our findings suggest complement each other. Genomic varying context lengths can detect distinct elements, underscoring importance selecting appropriate model task. code data underlying article are available at https://github.com/powidla/G4s-FMs.

Language: Английский

Citations

0

A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis DOI Creative Commons
Nimisha Ghosh, Daniele Santoni, Indrajit Saha

et al.

Computational and Structural Biotechnology Journal, Journal Year: 2025, Volume and Issue: unknown

Published: March 1, 2025

Language: Английский

Citations

0

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models DOI Creative Commons
Muhammad Nabeel Asim, Muhammad Ali Ibrahim, Alam Zaib

et al.

Frontiers in Medicine, Journal Year: 2025, Volume and Issue: 12

Published: April 8, 2025

Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline somatic mutations. Germline mutations underlie hereditary conditions, while induced by various factors including environmental influences, chemicals, lifestyle choices, errors in replication repair mechanisms which lead to cancer. sequence analysis plays a pivotal role uncovering the intricate information embedded within an organism's understanding modify it. This helps early detection diseases design targeted therapies. Traditional wet-lab experimental traditional methods is costly, time-consuming, prone errors. To accelerate large-scale analysis, researchers are developing AI applications complement methods. These approaches help generate hypotheses, prioritize experiments, interpret results identifying patterns large genomic datasets. Effective integration with validation requires scientists understand both fields. Considering need comprehensive literature bridges gap between fields, contributions this paper manifold: It presents diverse range tasks methodologies. equips essential biological knowledge 44 distinct aligns these 3 AI-paradigms, namely, classification, regression, clustering. streamlines into consolidating 36 databases used develop benchmark datasets for different tasks. ensure performance comparisons new existing predictors, it provides insights 140 related word embeddings language models across development predictors providing survey 39 67 based predictive pipeline values well top performing encoding-based their performances

Language: Английский

Citations

0

A germline chimeric KANK1-DMRT1 transcript derived from a complex structural variant is associated with a congenital heart defect segregating across five generations DOI
Silvia Souza da Costa, Veniamin Fishman, Mara Pinheiro

et al.

Chromosome Research, Journal Year: 2024, Volume and Issue: 32(2)

Published: March 19, 2024

Language: Английский

Citations

3

GENA-Web - GENomic Annotations Web Inference using DNA language models DOI Open Access

Alexey Shmelev,

Maxim Petrov, Dmitry Penzar

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 29, 2024

ABSTRACT The advent of advanced sequencing technologies has significantly reduced the cost and increased feasibility assembling high-quality genomes. Yet, annotation genomic elements remains a complex challenge. Even for species with comprehensively annotated reference genomes, functional assessment individual genetic variants is not straightforward. In response to these challenges, recent breakthroughs in machine learning have led development DNA language models. These transformer-based architectures are designed tackle wide array tasks enhanced efficiency accuracy. this context, we introduce GENA-Web, web-based platform that consolidates suite genome tools powered by version GENA-Web presented here encompasses diverse set models trained on human data, including prediction promoter activity, splice sites, determination various chromatin features, model scoring enhancer activity Drosophila. accessible online at https://dnalm.airi.net/

Language: Английский

Citations

3

Species-specific design of artificial promoters by transfer-learning based generative deep-learning model DOI Creative Commons

Yan Xia,

Xiaowen Du, Bin Liu

et al.

Nucleic Acids Research, Journal Year: 2024, Volume and Issue: 52(11), P. 6145 - 6157

Published: May 23, 2024

Abstract Native prokaryotic promoters share common sequence patterns, but are species dependent. For understudied with limited data, it is challenging to predict the strength of existing and generate novel promoters. Here, we developed PromoGen, a collection nucleotide language models species-specific functional promoters, across dozens in data parameter efficient way. Twenty-seven this were finetuned from pretrained model which was trained on multi-species When systematically compared native Escherichia coli- Bacillus subtilis-specific artificial PromoGen-generated (PGPs) demonstrated hold all distribution patterns A regression score generated either by PromoGen or another competitive neural network, overall PGPs higher. Encouraged silico analysis, further experimentally characterized twenty-two B. subtilis PGPs, results showed that four tested reached strong promoter level while active. Furthermore, user-friendly website for 27 different PromoGen. This work presented an deep-learning strategy de novo generation even datasets, providing valuable toolboxes especially metabolic engineering microorganisms.

Language: Английский

Citations

2