iLEC-DNA: Identifying Long Extra-chromosomal Circular DNA by Fusing Sequence-derived Features of Physicochemical Properties and Nucleotide Distribution Patterns DOI Creative Commons

Ahtisham Fazeel Abbasi,

Muhammad Nabeel Asim, Andreas Dengel

и другие.

Research Square (Research Square), Год журнала: 2023, Номер unknown

Опубликована: Сен. 29, 2023

Abstract Long extrachromosomal circular DNA (leccDNA) regulates several biological processes such as genomic instability, gene amplification, and oncogenesis. The identification of leccDNA holds significant importance to investigate its potential associations with cancer, autoimmune, cardiovascular, neurological diseases. In addition, understanding these can provide valuable insights about disease mechanisms therapeutic approaches. Conventionally , wet lab-based methods are utilized identify leccDNA, which hindered by the need for prior knowledge, resource-intensive processes, potentially limiting their broader applicability. To empower process across multiple species, paper in hand presents very first computational predictor. proposed iLEC-DNA predictor makes use SVM classifier along sequence-derived nucleotide distribution patterns physico-chemical properties-based features. study introduces a set 12 benchmark datasets related three namely Homo sapiens (HM), Arabidopsis Thaliana (AT), Saccharomyces cerevisiae (SC/YS). It performs large-scale experimentation under different experimental settings using more than 140 baseline predictors. outperforms predictors diverse producing average performance values 80.699%, 61.45% 80.7% terms ACC, MCC AUC-ROC all datasets. source code is available at https://github.com/FAhtisham/Extrachrosmosomal-DNA-Prediction. facilitate scientific community, web application https://sds_genetic_analysis.opendfki.de/iLEC_DNA//.

Язык: Английский

Large language models in bioinformatics: applications and perspectives DOI Creative Commons
Jiajia Liu,

Mengyuan Yang,

Yankai Yu

и другие.

arXiv (Cornell University), Год журнала: 2024, Номер unknown

Опубликована: Янв. 1, 2024

Large language models (LLMs) are a class of artificial intelligence based on deep learning, which have great performance in various tasks, especially natural processing (NLP). typically consist neural networks with numerous parameters, trained large amounts unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed proficiency modeling human language. In this review, we will present summary the prominent used processing, such as BERT and GPT, focus exploring applications at different omics levels bioinformatics, mainly including genomics, transcriptomics, proteomics, drug discovery single cell analysis. Finally, review summarizes prospects bioinformatic problems.

Язык: Английский

Процитировано

8

The Explainability of Transformers: Current Status and Directions DOI Creative Commons
Paolo Fantozzi, Maurizio Naldi

Computers, Год журнала: 2024, Номер 13(4), С. 92 - 92

Опубликована: Апрель 4, 2024

An increasing demand for model explainability has accompanied the widespread adoption of transformers in various fields applications. In this paper, we conduct a survey existing literature on transformers. We provide taxonomy methods based combination transformer components that are leveraged to arrive at explanation. For each method, describe its mechanism and find out attention-based methods, both alone conjunction with activation-based gradient-based most employed ones. A growing attention is also devoted deployment visualization techniques help explanation process.

Язык: Английский

Процитировано

5

MethylProphet: A Generalized Gene-Contextual Model for Inferring Whole-Genome DNA Methylation Landscape DOI Open Access
Xiaoke Huang, Qi Liu, Yifei Zhao

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown

Опубликована: Фев. 8, 2025

Abstract DNA methylation (DNAm), an epigenetic modification, regulates gene expression, influences phenotypes, and encodes inheritable information, making it critical for disease diagnosis, treatment, prevention. While human genome contains approximately 28 million CpG sites where DNAm can be measured, only 1–3% of these are typically available in most datasets due to complex experimental protocols high costs, hindering insights from data. Leveraging the relationship between expression offers promise computational inference, but existing statistical, machine learning, masking-based generative Transformers face limitations: they cannot infer at unmeasured CpGs or new samples effectively. To overcome challenges, we introduce MethylProphet, a gene-guided, context-aware Transformer model designed inference. MethylProphet employs Bottleneck MLP efficient profile compression specialized sequence tokenizer, integrating global patterns with local context through encoder architecture. Trained on whole-genome bisulfite sequencing data ENCODE (1.6B training CpG-sample pairs; 322B tokens), demonstrates strong performance hold-out evaluations, effectively inferring samples. In addition, its application 10842 pairs TCGA chromosome 1 (450M CpGsample 91B tokens) highlights potential facilitate pan-cancer landscape offering powerful tool advancing research precision medicine. All codes, data, protocols, models publicly via https://github.com/xk-huang/methylprophet/ .

Язык: Английский

Процитировано

0

Peptide classification landscape: An in-depth systematic literature review on peptide types, databases, datasets, predictors architectures and performance DOI
Muhammad Nabeel Asim,

Tayyaba Asif,

Faiza Mehmood

и другие.

Computers in Biology and Medicine, Год журнала: 2025, Номер 188, С. 109821 - 109821

Опубликована: Фев. 22, 2025

Язык: Английский

Процитировано

0

Deciphering genomic codes using advanced natural language processing techniques: a scoping review DOI Creative Commons
Shuyan Cheng, Yishu Wei, Yiliang Zhou

и другие.

Journal of the American Medical Informatics Association, Год журнала: 2025, Номер unknown

Опубликована: Фев. 25, 2025

Abstract Objectives The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application natural language processing (NLP) techniques, particularly large models (LLMs) transformer architectures, in deciphering codes, focusing on tokenization, models, regulatory annotation prediction. goal this is assess model accessibility most recent literature, gaining a better understanding existing capabilities constraints these tools data. Materials Methods Following Preferred Reporting Items Systematic Reviews Meta-Analyses (PRISMA) guidelines, our scoping was conducted across PubMed, Medline, Scopus, Web Science, Embase, ACM Digital Library. Studies were included if they focused NLP methodologies applied analysis, without restrictions publication date or article type. Results A total 26 studies published between 2021 April 2024 selected review. highlights that tokenization enhance data, with applications predicting annotations like transcription-factor binding sites chromatin accessibility. Discussion LLMs interpretation promising field can help streamline large-scale while also providing its structures. It has potential drive advancements personalized medicine by offering more efficient scalable solutions Further research needed discuss overcome current limitations, enhancing transparency applicability. Conclusion growing role NLP, LLMs, While improve prediction, remain interpretability. refine their genomics.

Язык: Английский

Процитировано

0

Beyond digital twins: the role of foundation models in enhancing the interpretability of multiomics modalities in precision medicine DOI Creative Commons
Sakhaa B. Alsaedi, Xin Gao, Takashi Gojobori

и другие.

FEBS Open Bio, Год журнала: 2025, Номер unknown

Опубликована: Фев. 24, 2025

Medical digital twins (MDTs) are virtual representations of patients that simulate the biological, physiological, and clinical processes individuals to enable personalized medicine. With increasing complexity omics data, particularly multiomics, there is a growing need for advanced computational frameworks interpret these data effectively. Foundation models (FMs), large‐scale machine learning pretrained on diverse types, have recently emerged as powerful tools improving interpretability decision‐making in precision This review discusses integration FMs into MDT systems, their role enhancing multiomics data. We examine current challenges, recent advancements, future opportunities leveraging analysis MDTs, with focus application

Язык: Английский

Процитировано

0

A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis DOI Creative Commons
Nimisha Ghosh, Daniele Santoni, Indrajit Saha

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2025, Номер unknown

Опубликована: Март 1, 2025

Язык: Английский

Процитировано

0

MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction DOI Creative Commons
Wenhuan Zeng, Anupam Gautam, Daniel H. Huson

и другие.

GigaScience, Год журнала: 2022, Номер 12

Опубликована: Дек. 28, 2022

Abstract Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation biomarker identification. Several deep learning–based methods have been proposed identify methylation, each seeks strike a balance between computational effort accuracy. Here, we introduce MuLan-Methyl, learning framework for predicting sites, which based on 5 popular transformer-based models. The identifies sites 3 different types of methylation: N6-adenine, N4-cytosine, 5-hydroxymethylcytosine. Each the employed adapted task using “pretrain fine-tune” paradigm. Pretraining performed custom corpus fragments taxonomy lineages self-supervised learning. Fine-tuning aims at status type. collectively predict status. We report excellent performance MuLan-Methyl benchmark dataset. Moreover, argue that model captures characteristic differences species relevant methylation. This work demonstrates can be applications in biological sequence joint utilization improves performance. Mulan-Methyl open source, provide web server implements approach.

Язык: Английский

Процитировано

20

Large language models and their applications in bioinformatics DOI Creative Commons
Oluwafemi A. Sarumi, Dominik Heider

Computational and Structural Biotechnology Journal, Год журнала: 2024, Номер 23, С. 3498 - 3505

Опубликована: Окт. 5, 2024

Язык: Английский

Процитировано

4

RNA Sequence Analysis Landscape: A Comprehensive Review of Task Types, Databases, Datasets, Word Embedding Methods, and Language Models DOI Creative Commons
Muhammad Nabeel Asim, Muhammad Ali Ibrahim,

Tayyaba Asif

и другие.

Heliyon, Год журнала: 2025, Номер 11(2), С. e41488 - e41488

Опубликована: Янв. 1, 2025

Deciphering information of RNA sequences reveals their diverse roles in living organisms, including gene regulation and protein synthesis. Aberrations sequence such as dysregulation mutations can drive a spectrum diseases cancers, genetic disorders, neurodegenerative conditions. Furthermore, researchers are harnessing RNA's therapeutic potential for transforming traditional treatment paradigms into personalized therapies through the development RNA-based drugs therapies. To gain insights biological functions to detect at early stages develop potent therapeutics, performing types analysis tasks. conventional wet-lab methods is expensive, time-consuming error prone. enable large-scale analysis, empowerment experimental with Artificial Intelligence (AI) applications necessitates scientists have comprehensive knowledge both DNA AI fields. While molecular biologists encounter challenges understanding methods, computer often lack basic foundations Considering absence literature that bridges this research gap promotes AI-driven applications, contributions manuscript manifold: It equips 47 distinct sets stage benchmark datasets related tasks by facilitating cruxes 64 different databases. presents word embeddings language models across streamlines new predictors providing survey 58 70 based predictive pipelines performance values well top encoding performances

Язык: Английский

Процитировано

0