MuLan-Methyl—multiple transformer-based language models for accurate DNA methylation prediction DOI Creative Commons
Wenhuan Zeng, Anupam Gautam, Daniel H. Huson

et al.

GigaScience, Journal Year: 2022, Volume and Issue: 12

Published: Dec. 28, 2022

Abstract Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism, and its analysis provides valuable insights into gene regulation biomarker identification. Several deep learning–based methods have been proposed identify methylation, each seeks strike a balance between computational effort accuracy. Here, we introduce MuLan-Methyl, learning framework for predicting sites, which based on 5 popular transformer-based models. The identifies sites 3 different types of methylation: N6-adenine, N4-cytosine, 5-hydroxymethylcytosine. Each the employed adapted task using “pretrain fine-tune” paradigm. Pretraining performed custom corpus fragments taxonomy lineages self-supervised learning. Fine-tuning aims at status type. collectively predict status. We report excellent performance MuLan-Methyl benchmark dataset. Moreover, argue that model captures characteristic differences species relevant methylation. This work demonstrates can be applications in biological sequence joint utilization improves performance. Mulan-Methyl open source, provide web server implements approach.

Language: Английский

DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis DOI Creative Commons
Ruheng Wang, Yi Jiang, Junru Jin

et al.

Nucleic Acids Research, Journal Year: 2023, Volume and Issue: 51(7), P. 3017 - 3029

Published: Feb. 17, 2023

Abstract Here, we present DeepBIO, the first-of-its-kind automated and interpretable deep-learning platform for high-throughput biological sequence functional analysis. DeepBIO is a one-stop-shop web service that enables researchers to develop new architectures answer any question. Specifically, given data, supports total of 42 state-of-the-art algorithms model training, comparison, optimization evaluation in fully pipeline. provides comprehensive result visualization analysis predictive models covering several aspects, such as interpretability, feature sequential region discovery. Additionally, nine base-level annotation tasks using architectures, with interpretations graphical visualizations validate reliability annotated sites. Empowered by high-performance computers, allows ultra-fast prediction up million-scale data few hours, demonstrating its usability real application scenarios. Case study results show an accurate, robust prediction, power deep learning Overall, expect ensure reproducibility analysis, lessen programming hardware burden biologists provide meaningful insights at both level base from sequences alone. publicly available https://inner.wei-group.net/DeepBIO.

Language: Английский

Citations

98

iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations DOI Creative Commons
Junru Jin, Yingying Yu, Ruheng Wang

et al.

Genome biology, Journal Year: 2022, Volume and Issue: 23(1)

Published: Oct. 17, 2022

Abstract In this study, we propose iDNA-ABF, a multi-scale deep biological language learning model that enables the interpretable prediction of DNA methylations based on genomic sequences only. Benchmarking comparisons show our iDNA-ABF outperforms state-of-the-art methods for different methylation predictions. Importantly, power in capturing both sequential and functional semantics information from background genomes. Moreover, by integrating analysis mechanism, well explain what learns, helping us build mapping discovery important determinants to in-depth their functions.

Language: Английский

Citations

91

DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model DOI Creative Commons
Yihe Pang,

Bin Liu

BMC Biology, Journal Year: 2024, Volume and Issue: 22(1)

Published: Jan. 2, 2024

Abstract Intrinsically disordered proteins and regions (IDPs/IDRs) are functionally important that exist as highly dynamic conformations under natural physiological conditions. IDPs/IDRs exhibit a broad range of molecular functions, their functions involve binding interactions with partners remaining native structural flexibility. The rapid increase in the number sequence databases diversity challenge existing computational methods for predicting protein intrinsic disorder functions. A region interacts different to perform multiple these dependencies correlations. In this study, we introduce DisoFLAG, method leverages graph-based interaction language model (GiPLM) jointly its potential GiPLM integrates semantic information based on pre-trained models into units enhance correlation representation DisoFLAG predictor takes amino acid sequences only inputs provides predictions six proteins, including protein-binding, DNA-binding, RNA-binding, ion-binding, lipid-binding, flexible linker. We evaluated predictive performance following Critical Assessment Intrinsic Disorder (CAID) experiments, results demonstrated offers accurate comprehensive extending current coverage computationally predicted function categories. standalone package web server have been established provide prediction tools disorders associated

Language: Английский

Citations

12

iDNA-OpenPrompt: OpenPrompt learning model for identifying DNA methylation DOI Creative Commons
Xia Yu, Jia Ren, Haixia Long

et al.

Frontiers in Genetics, Journal Year: 2024, Volume and Issue: 15

Published: April 16, 2024

Introduction: DNA methylation is a critical epigenetic modification involving the addition of methyl group to molecule, playing key role in regulating gene expression without changing sequence. The main difficulty identifying sites lies subtle and complex nature patterns, which may vary across different tissues, developmental stages, environmental conditions. Traditional methods for site identification, such as bisulfite sequencing, are typically labor-intensive, costly, require large amounts DNA, hindering high-throughput analysis. Moreover, these not always provide resolution needed detect at specific sites, especially genomic regions that rich repetitive sequences or have low levels methylation. Furthermore, current deep learning approaches generally lack sufficient accuracy. Methods: This study introduces iDNA-OpenPrompt model, leveraging novel OpenPrompt framework. model combines prompt template, verbalizer, Pre-trained Language Model (PLM) construct prompt-learning framework sequences. vocabulary library, BERT tokenizer, label words also introduced into enable accurate identification sites. Results Discussion: An extensive analysis conducted evaluate predictive, reliability, consistency capabilities model. experimental outcomes, covering 17 benchmark datasets include various species three modifications (4mC, 5hmC, 6mA), consistently indicate our surpasses outstanding performance robustness approaches.

Language: Английский

Citations

12

Prediction of blood–brain barrier penetrating peptides based on data augmentation with Augur DOI Creative Commons
Zhi-Feng Gu,

Yu-Duo Hao,

Tianyu Wang

et al.

BMC Biology, Journal Year: 2024, Volume and Issue: 22(1)

Published: April 19, 2024

Abstract Background The blood–brain barrier serves as a critical interface between the bloodstream and brain tissue, mainly composed of pericytes, neurons, endothelial cells, tightly connected basal membranes. It plays pivotal role in safeguarding from harmful substances, thus protecting integrity nervous system preserving overall homeostasis. However, this remarkable selective transmission also poses formidable challenge realm central diseases treatment, hindering delivery large-molecule drugs into brain. In response to challenge, many researchers have devoted themselves developing drug systems capable breaching barrier. Among these, penetrating peptides emerged promising candidates. These had advantages high biosafety, ease synthesis, exceptional penetration efficiency, making them an effective solution. While previous studies developed few prediction models for peptides, their performance has often been hampered by issue limited positive data. Results study, we present Augur, novel model using borderline-SMOTE-based data augmentation machine learning. extract highly interpretable physicochemical properties while solving issues small sample size imbalance negative samples. Experimental results demonstrate superior Augur with AUC value 0.932 on training set 0.931 independent test set. Conclusions This newly demonstrates predicting offering valuable insights development targeting neurological disorders. breakthrough may enhance efficiency peptide-based discovery pave way innovative treatment strategies diseases.

Language: Английский

Citations

12

GenoM7GNet: An Efficient N7-Methylguanosine Site Prediction Approach Based on a Nucleotide Language Model DOI
Chuang Li, Heshi Wang, Yanhua Wen

et al.

IEEE/ACM Transactions on Computational Biology and Bioinformatics, Journal Year: 2024, Volume and Issue: 21(6), P. 2258 - 2268

Published: Sept. 20, 2024

N

Language: Английский

Citations

12

AGF-PPIS: A protein–protein interaction site predictor based on an attention mechanism and graph convolutional networks DOI

Xiuhao Fu,

Ye Yuan,

Haoye Qiu

et al.

Methods, Journal Year: 2024, Volume and Issue: 222, P. 142 - 151

Published: Jan. 17, 2024

Language: Английский

Citations

11

A computational model of circRNA-associated diseases based on a graph neural network: prediction and case studies for follow-up experimental validation DOI Creative Commons
Mengting Niu, Chunyu Wang, Zhanguo Zhang

et al.

BMC Biology, Journal Year: 2024, Volume and Issue: 22(1)

Published: Jan. 29, 2024

Abstract Background Circular RNAs (circRNAs) have been confirmed to play a vital role in the occurrence and development of diseases. Exploring relationship between circRNAs diseases is far-reaching significance for studying etiopathogenesis treating To this end, based on graph Markov neural network algorithm (GMNN) constructed our previous work GMNN2CD, we further considered multisource biological data that affects association circRNA disease developed an updated web server CircDA human hepatocellular carcinoma (HCC) tissue verify prediction results CircDA. Results built Tumarkov-based deep learning framework. The regards biomolecules as nodes interactions molecules edges, reasonably abstracts multiomics data, models them heterogeneous biomolecular network, which can reflect complex different biomolecules. Case studies using literature from HCC, cervical, gastric cancers demonstrate predictor identify missing associations known diseases, quantitative real-time PCR (RT-qPCR) experiment HCC samples, it was found five were significantly differentially expressed, proved predict related new circRNAs. Conclusions This efficient computational case analysis with sufficient feedback allows us circRNA-associated disease-associated Our provides method provide guidance certain For ease use, online ( http://server.malab.cn/CircDA ) provided, code open-sourced https://github.com/nmt315320/CircDA.git convenience improvement.

Language: Английский

Citations

11

A BERT-based model for the prediction of lncRNA subcellular localization in Homo sapiens DOI Creative Commons
Zhao‐Yue Zhang, Zheng Zhang, Xiucai Ye

et al.

International Journal of Biological Macromolecules, Journal Year: 2024, Volume and Issue: 265, P. 130659 - 130659

Published: March 11, 2024

Language: Английский

Citations

11

CODENET: A deep learning model for COVID-19 detection DOI

Hong Ju,

Yanyan Cui,

Qiaosen Su

et al.

Computers in Biology and Medicine, Journal Year: 2024, Volume and Issue: 171, P. 108229 - 108229

Published: Feb. 29, 2024

Language: Английский

Citations

10