M6A-BERT-Stacking: A Tissue-Specific Predictor for Identifying RNA N6-Methyladenosine Sites Based on BERT and Stacking Strategy DOI Open Access

Qianyue Li,

Xin Cheng, Chen Song

и другие.

Symmetry, Год журнала: 2023, Номер 15(3), С. 731 - 731

Опубликована: Март 15, 2023

As the most abundant RNA methylation modification, N6-methyladenosine (m6A) could regulate asymmetric and symmetric division of hematopoietic stem cells play an important role in various diseases. Therefore, precise identification m6A sites around genomes different species is a critical step to further revealing their biological functions influence on these However, traditional wet-lab experimental methods for identifying are often laborious expensive. In this study, we proposed ensemble deep learning model called m6A-BERT-Stacking, powerful predictor detection tissues three species. First, utilized two encoding methods, i.e., di ribonucleotide index (DiNUCindex_RNA) k-mer word segmentation, extract sequence features. Second, matrices together with original sequences were respectively input into models parallel train sub-models, namely residual networks convolutional block attention module (Resnet-CBAM), bidirectional long short-term memory (BiLSTM-Attention), pre-trained encoder representations from transformers DNA-language (DNABERT). Finally, outputs all sub-models ensembled based stacking strategy obtain final prediction through fully connected layer. The results demonstrated that m6A-BERT-Stacking outperformed existing same independent datasets.

Язык: Английский

Anticancer peptides prediction with deep representation learning features DOI
Zhibin Lv, Feifei Cui, Quan Zou

и другие.

Briefings in Bioinformatics, Год журнала: 2021, Номер 22(5)

Опубликована: Янв. 6, 2021

Abstract Anticancer peptides constitute one of the most promising therapeutic agents for combating common human cancers. Using wet experiments to verify whether a peptide displays anticancer characteristics is time-consuming and costly. Hence, in this study, we proposed computational method named identify via deep representation learning features (iACP-DRLF) using light gradient boosting machine algorithm features. Two kinds sequence embedding technologies were used, namely soft symmetric alignment unified (UniRep) embedding, both which involved neural network models based on long short-term memory networks their derived networks. The results showed that use greatly improved capability discriminate from other peptides. Also, UMAP (uniform manifold approximation projection dimension reduction) SHAP (shapley additive explanations) analysis proved UniRep have an advantage over identification. python script pretrained could be downloaded https://github.com/zhibinlv/iACP-DRLF or http://public.aibiochem.net/iACP-DRLF/.

Язык: Английский

Процитировано

120

Biological Sequence Classification: A Review on Data and General Methods DOI Creative Commons
Chunyan Ao, Shihu Jiao, Yansu Wang

и другие.

Research, Год журнала: 2022, Номер 2022

Опубликована: Янв. 1, 2022

With the rapid development of biotechnology, number biological sequences has grown exponentially. The continuous expansion sequence data promotes application machine learning in to construct predictive models for mining information. There are many branches classification research. In this review, we mainly focus on function and modification based learning. Sequence-based prediction analysis basic tasks understand functions DNA, RNA, proteins, peptides. However, there hundreds developed sequences, quite varied specific methods seem dizzying at first glance. Here, aim establish a long-term support website (http://lab.malab.cn/~acy/BioseqData/home.html), which provides readers with detailed information method download links relevant datasets. We briefly introduce steps build an effective model framework data. addition, brief introduction single-cell sequencing applications biology is also included. Finally, discuss current challenges future perspectives

Язык: Английский

Процитировано

73

Discovering Consensus Regions for Interpretable Identification of RNA N6-Methyladenosine Modification Sites via Graph Contrastive Clustering DOI
Guodong Li, Bo-Wei Zhao, Xiaorui Su

и другие.

IEEE Journal of Biomedical and Health Informatics, Год журнала: 2024, Номер 28(4), С. 2362 - 2372

Опубликована: Янв. 24, 2024

As a pivotal post-transcriptional modification of RNA, N6-methyladenosine (m6A) has substantial influence on gene expression modulation and cellular fate determination. Although variety computational models have been developed to accurately identify potential m6A sites, few them are capable interpreting the identification process with insights gained from consensus knowledge. To overcome this problem, we propose deep learning model, namely M6A-DCR, by discovering regions for interpretable sites. In particular, M6A-DCR first constructs an instance graph each RNA sequence integrating specific positions types nucleotides. The discovery is then formulated as clustering problem in light aggregating all graphs. After that, adopts motif-aware reconstruction optimization learn high-quality embeddings input sequences, thus achieving sites end-to-end manner. Experimental results demonstrate superior performance comparing it several state-of-the-art models. consideration empowers our model make predictions at motif level. analysis cross validation through different species tissues further verifies consistency between evolutionary relationships among

Язык: Английский

Процитировано

28

Computational prediction and interpretation of cell-specific replication origin sites from multiple eukaryotes by exploiting stacking framework DOI
Leyi Wei, Wenjia He, Adeel Malik

и другие.

Briefings in Bioinformatics, Год журнала: 2020, Номер 22(4)

Опубликована: Сен. 22, 2020

Abstract Origins of replication sites (ORIs), which refers to the initiative locations genomic DNA replication, play essential roles in process. Detection ORIs’ distribution genome scale is one key steps in-depth understanding their regulation mechanisms. In this study, we presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models for identifying ORIs from four different eukaryotic species (Homo sapiens, Mus musculus, Drosophila melanogaster and Arabidopsis thaliana). For each model, employed 12 feature encoding schemes that cover nucleic acid composition, position-specific physicochemical properties information. The optimal set was identified individually developed respective baseline using eXtreme Gradient Boosting (XGBoost) classifier. Subsequently, predicted scores are integrated as vector train XGBoost develop final model. Extensive experimental results show achieves significantly better performance compared with on both training independent datasets. Interestingly, consistently outperforms existing predictor all models, not only but also test. Moreover, our provides necessary interpretations help model success by leveraging powerful SHapley Additive exPlanation algorithm, thus underlining most important significant predicting ORIs.

Язык: Английский

Процитировано

114

Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications DOI Creative Commons
Zitao Song, Daiyun Huang, Bowen Song

и другие.

Nature Communications, Год журнала: 2021, Номер 12(1)

Опубликована: Июнь 29, 2021

Abstract Recent studies suggest that epi-transcriptome regulation via post-transcriptional RNA modifications is vital for all types. Precise identification of modification sites essential understanding the functions and regulatory mechanisms RNAs. Here, we present MultiRM, a method integrated prediction interpretation from sequences. Built upon an attention-based multi-label deep learning framework, MultiRM not only simultaneously predicts putative twelve widely occurring transcriptome (m 6 A, m 1 5 C, U, Am, 7 G, Ψ, I, Cm, Gm, Um), but also returns key sequence contents contribute most to positive predictions. Importantly, our model revealed strong association among different types perspective their associated contexts. Our work provides solution detecting multiple modifications, enabling analysis these gaining better sequence-based mechanisms.

Язык: Английский

Процитировано

91

AI applications in functional genomics DOI Creative Commons
Claudia Caudai, Antonella Galizia, Filippo Geraci

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2021, Номер 19, С. 5762 - 5790

Опубликована: Янв. 1, 2021

We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion AI follows remarkable achievements made possible by "deep learning", along with a burst "big data" that can meet its hunger. Biology is about to overthrow astronomy as paradigmatic representative big data producer. This has been huge advancements field high throughput technologies, applied determine how individual components biological system work together accomplish different processes. disciplines contributing this bulk are collectively known They consist studies of: i) information contained DNA (genomics); ii) modifications reversibly undergo (epigenomics); iii) RNA transcripts originated genome (transcriptomics); iv) ensemble chemical decorating types (epitranscriptomics); v) products protein-coding (proteomics); and vi) small molecules produced from cell metabolism (metabolomics) present an organism or at given time, physiological pathological conditions. After reviewing main genomics, we discuss important accompanying issues, including ethical, legal economic issues importance explainability.

Язык: Английский

Процитировано

84

AcrPred: A hybrid optimization with enumerated machine learning algorithm to predict Anti-CRISPR proteins DOI
Fanny Dao,

Menglu Liu,

Wei Su

и другие.

International Journal of Biological Macromolecules, Год журнала: 2022, Номер 228, С. 706 - 714

Опубликована: Дек. 28, 2022

Язык: Английский

Процитировано

49

Identification of sub-Golgi protein localization by use of deep representation learning features DOI Creative Commons
Zhibin Lv, Pingping Wang, Quan Zou

и другие.

Bioinformatics, Год журнала: 2020, Номер 36(24), С. 5600 - 5609

Опубликована: Дек. 14, 2020

The Golgi apparatus has a key functional role in protein biosynthesis within the eukaryotic cell with malfunction resulting various neurodegenerative diseases. For better understanding of apparatus, it is essential to identification sub-Golgi localization. Although some machine learning methods have been used identify localization proteins by sequence representation fusion, more accurate still challenging existing methodology.we developed protocol using deep features 107 dimensions. By this protocol, we demonstrated that instead multi-type feature fusion as previous state-of-the-art sub-Golgi-protein classifiers, sufficient exploit only one type for accurately proteins. Compared independent testing results benchmark datasets, our able perform generally, reliably and robustly prediction.A use-friendly webserver freely accessible at http://isGP-DRLF.aibiochem.net prediction code https://github.com/zhibinlv/isGP-DRLF.Supplementary data are available Bioinformatics online.

Язык: Английский

Процитировано

55

Deep-4mCW2V: A sequence-based predictor to identify N4-methylcytosine sites in Escherichia coli DOI
Hasan Zulfiqar,

Zi‐Jie Sun,

Qin-Lai Huang

и другие.

Methods, Год журнала: 2021, Номер 203, С. 558 - 563

Опубликована: Авг. 3, 2021

Язык: Английский

Процитировано

54

Identification of cyclin protein using gradient boost decision tree algorithm DOI Creative Commons
Hasan Zulfiqar, Shi-Shi Yuan, Qin-Lai Huang

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2021, Номер 19, С. 4123 - 4131

Опубликована: Янв. 1, 2021

Cyclin proteins are capable to regulate the cell cycle by forming a complex with cyclin-dependent kinases activate cycle. Correct recognition of cyclin could provide key clues for studying their functions. However, sequences share low similarity, which results in poor prediction sequence similarity-based methods. Thus, it is urgent construct machine learning model identify proteins. This study aimed develop computational discriminate from non-cyclin In our model, protein were encoded seven kinds features that amino acid composition, composition k-spaced pairs, tri peptide pseudo geary correlation, normalized moreau-broto autocorrelation and composition/transition/distribution. Afterward, these optimized using analysis variance (ANOVA) minimum redundancy maximum relevance (mRMR) incremental feature selection (IFS) technique. A gradient boost decision tree (GBDT) classifier was trained on optimal features. Five-fold cross-validated showed would cyclins an accuracy 93.06% AUC value 0.971, higher than two recent studies same data.

Язык: Английский

Процитировано

53