Computational analysis and prediction of PE_PGRS proteins using machine learning DOI Creative Commons
Fuyi Li, Xudong Guo, Dongxu Xiang

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2022, Номер 20, С. 662 - 674

Опубликована: Янв. 1, 2022

genome comprises approximately 10% of two families poorly characterised genes due to their high GC content and highly repetitive nature. The largest sub-group, the proline-glutamic acid polymorphic guanine-cytosine-rich sequence (PE_PGRS) family, is thought be involved in host response disease pathogenicity. Due genetic variability complexity analysis, they are typically disregarded for further research genomic studies. There currently limited online resources homology computational tools that can identify analyse PE_PGRS proteins. In addition, computational-intensive time-consuming, lack sensitivity. Therefore, methods rapidly accurately proteins valuable facilitate functional elucidation family this study, we developed first machine learning-based bioinformatics approach, termed PEPPER, allow users accurately. PEPPER was built upon a comprehensive evaluation 13 popular learning algorithms with various physicochemical features. Empirical studies demonstrated achieved significantly better performance than alignment-based approaches, BLASTP PHMMER, both prediction accuracy speed. anticipated community-wide efforts conduct high-throughput identification analysis

Язык: Английский

A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptide-binding prediction DOI
Shutao Mei, Fuyi Li, André Leier

и другие.

Briefings in Bioinformatics, Год журнала: 2019, Номер 21(4), С. 1119 - 1135

Опубликована: Апрель 6, 2019

Human leukocyte antigen class I (HLA-I) molecules are encoded by major histocompatibility complex (MHC) loci in humans. The binding and interaction between HLA-I intracellular peptides derived from a variety of proteolytic mechanisms play crucial role subsequent T-cell recognition target cells the specificity immune response. In this context, tools that predict likelihood for peptide to bind specific HLA allotypes important selecting most promising antigenic targets immunotherapy. article, we comprehensively review currently available predicting selection allomorphs. Specifically, compare their calculation methods prediction score, employed algorithms, evaluation strategies software functionalities. addition, have evaluated performance reviewed based on an independent validation data set, containing 21 101 experimentally verified ligands across 19 allotypes. benchmarking results show MixMHCpred 2.0.1 achieves best allomorphs studied, while NetMHCpan 4.0 NetMHCcons 1.1 outperform other machine learning-based consensus-based tools, respectively. Importantly, it should be noted predicted with higher score allotype does not necessarily imply will immunogenic. That said, peptide-binding predictors still very useful they can help significantly reduce large number epitope candidates need verified. Several factors, including susceptibility proteasome cleavage, transport into endoplasmic reticulum receptor repertoire, also contribute immunogenicity antigens, some them considered predictors. Therefore, integrating features these additional factors together HLA-binding properties using machine-learning algorithms may increase accuracy immunogenic peptides. As such, anticipate survey assist researchers appropriate suit purposes provide guidelines development improved future.

Язык: Английский

Процитировано

153

An Interpretable Prediction Model for Identifying N7-Methylguanosine Sites Based on XGBoost and SHAP DOI Creative Commons
Yue Bi, Dongxu Xiang, Zongyuan Ge

и другие.

Molecular Therapy — Nucleic Acids, Год журнала: 2020, Номер 22, С. 362 - 372

Опубликована: Авг. 25, 2020

Recent studies have increasingly shown that the chemical modification of mRNA plays an important role in regulation gene expression. N7-methylguanosine (m7G) is a type positively-charged essential for efficient expression and cell viability. However, research on m7G has received little attention to date. Bioinformatics tools can be applied as auxiliary methods identify sites transcriptomes. In this study, we develop novel interpretable machine learning-based approach termed XG-m7G differentiation using XGBoost algorithm six different types sequence-encoding schemes. Both 10-fold jackknife cross-validation tests indicate outperforms iRNA-m7G. Moreover, powerful SHAP algorithm, new framework also provides desirable interpretations model performance highlights most features identifying sites. anticipated serve useful tool guide researchers their future

Язык: Английский

Процитировано

140

DeepTorrent: a deep learning-based approach for predicting DNA N4-methylcytosine sites DOI
Quanzhong Liu, Jin-Xiang Chen, Yanze Wang

и другие.

Briefings in Bioinformatics, Год журнала: 2020, Номер 22(3)

Опубликована: Май 20, 2020

Abstract DNA N4-methylcytosine (4mC) is an important epigenetic modification that plays a vital role in regulating replication and expression. However, it challenging to detect 4mC sites through experimental methods, which are time-consuming costly. Thus, computational tools can identify would be very useful for understanding the mechanism of this type modification. Several machine learning-based predictors have been proposed past 3 years, although their performance unsatisfactory. Deep learning promising technique development more accurate site predictions. In work, we propose deep approach, called DeepTorrent, improved prediction from sequences. It combines four different feature encoding schemes encode raw sequences employs multi-layer convolutional neural networks with inception module integrated bidirectional long short-term memory effectively learn higher-order representations. Dimension reduction concatenated maps filters sizes then applied module. addition, attention transfer techniques also employed train robust predictor. Extensive benchmarking experiments demonstrate DeepTorrent significantly improves compared several state-of-the-art methods.

Язык: Английский

Процитировано

113

Procleave: Predicting Protease-Specific Substrate Cleavage Sites by Combining Sequence and Structural Information DOI Creative Commons
Fuyi Li, André Leier, Quanzhong Liu

и другие.

Genomics Proteomics & Bioinformatics, Год журнала: 2020, Номер 18(1), С. 52 - 64

Опубликована: Фев. 1, 2020

Abstract Proteases are enzymes that cleave and hydrolyse the peptide bonds between two specific amino acid residues of target substrate proteins. Protease-controlled proteolysis plays a key role in degradation recycling proteins, which is essential for various physiological processes. Thus, solving identification problem will have important implications precise understanding functions roles proteases, as well therapeutic pharmaceutical applicability. Consequently, there great demand bioinformatics methods can predict novel cleavage events with high accuracy by utilizing both sequence structural information. In this study, we present Procleave, approach predicting protease-specific substrates sites taking into account their 3D Structural features known were represented discrete values using LOWESS data-smoothing optimization method, turned out to be critical performance Procleave. The optimal approximations all parameter encoded conditional random field (CRF) computational framework, alongside chemical group-based features. Here, demonstrate outstanding Procleave through extensive benchmarking independent tests. capable correctly identifying most case study. Importantly, when applied human proteome encompassing 17,628 protein structures, suggests number potential corresponding different proteases. implemented webserver freely accessible at http://procleave.erc.monash.edu/.

Язык: Английский

Процитировано

98

Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components DOI
Zhe Ju, Shiyun Wang

Genomics, Год журнала: 2019, Номер 112(1), С. 859 - 866

Опубликована: Июнь 6, 2019

Язык: Английский

Процитировано

85

Progresses in Predicting Post-translational Modification DOI

Kuo‐Chen Chou

International Journal of Peptide Research and Therapeutics, Год журнала: 2019, Номер 26(2), С. 873 - 888

Опубликована: Июль 12, 2019

Язык: Английский

Процитировано

82

Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification DOI
Xiao Liang, Fuyi Li, Jin-Xiang Chen

и другие.

Briefings in Bioinformatics, Год журнала: 2020, Номер 22(4)

Опубликована: Окт. 19, 2020

Anti-cancer peptides (ACPs) are known as potential therapeutics for cancer. Due to their unique ability target cancer cells without affecting healthy directly, they have been extensively studied. Many peptide-based drugs currently evaluated in the preclinical and clinical trials. Accurate identification of ACPs has received considerable attention recent years; such, a number machine learning-based methods silico developed. These promote research on mechanism against some extent. There is vast difference these terms training/testing datasets, learning algorithms, feature encoding schemes, selection evaluation strategies used. Therefore, it desirable summarize advantages disadvantages existing methods, provide useful insights suggestions development improvement novel computational tools characterize identify ACPs. With this mind, we firstly comprehensively investigate 16 state-of-the-art predictors core performance metrics webserver/software usability. Then, comprehensive assessment conducted evaluate robustness scalability using well-prepared benchmark dataset. We model improvement. Moreover, propose ensemble framework, termed ACPredStackL, accurate ACPredStackL developed based stacking strategy combined with SVM, Naïve Bayesian, lightGBM KNN. Empirical benchmarking experiments demonstrate that achieves comparative predicting The webserver source code freely available at http://bigdata.biocie.cn/ACPredStackL/ https://github.com/liangxiaoq/ACPredStackL, respectively.

Язык: Английский

Процитировано

70

Positive-unlabeled learning in bioinformatics and computational biology: a brief review DOI
Fuyi Li, Shuangyu Dong, André Leier

и другие.

Briefings in Bioinformatics, Год журнала: 2021, Номер 23(1)

Опубликована: Окт. 8, 2021

Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This scheme requires two fully labeled classes of data (e.g. positive negative samples) train a model. However, in many bioinformatics applications, labeling is laborious, the samples might be potentially mislabeled due limited sensitivity experimental equipment. The unlabeled (PU) learning was therefore proposed enable classifier learn directly from large number (i.e. mixture or samples). To date, several PU developed various questions, such as sequence identification, functional site characterization interaction prediction. In this paper, we revisit collection 29 state-of-the-art bioinformatic applications questions. Various important aspects are extensively discussed, including methodology, application, design evaluation strategy. We also comment on existing issues offer our perspectives for future development applications. anticipate that work serves an instrumental guideline better understanding framework further developing next-generation frameworks critical

Язык: Английский

Процитировано

56

LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model DOI

Subash C. Pakhrin,

Suresh Pokharel, Kiyoko F. Aoki‐Kinoshita

и другие.

Glycobiology, Год журнала: 2023, Номер 33(5), С. 411 - 422

Опубликована: Апрель 17, 2023

Abstract Protein N-linked glycosylation is an important post-translational mechanism in Homo sapiens, playing essential roles many vital biological processes. It occurs at the N-X-[S/T] sequon amino acid sequences, where X can be any except proline. However, not all sequons are glycosylated; thus, a necessary but sufficient determinant for protein glycosylation. In this regard, computational prediction of sites confined to problem that has been extensively addressed by existing methods, especially regard creation negative sets and leveraging distilled information from language models (pLMs). Here, we developed LMNglyPred, deep learning-based approach, predict glycosylated human proteins using embeddings pre-trained pLM. LMNglyPred produces sensitivity, specificity, Matthews Correlation Coefficient, precision, accuracy 76.50, 75.36, 0.49, 60.99, 75.74 percent, respectively, on benchmark-independent test set. These results demonstrate robust tool sequon.

Язык: Английский

Процитировано

24

RAACBook: a web server of reduced amino acid alphabet for sequence-dependent inference by using Chou’s five-step rule DOI Creative Commons
Lei Zheng, Shenghui Huang,

Nengjiang Mu

и другие.

Database, Год журнала: 2019, Номер 2019

Опубликована: Янв. 1, 2019

Abstract By reducing amino acid alphabet, the protein complexity can be significantly simplified, which could improve computational efficiency, decrease information redundancy and reduce chance of overfitting. Although some reduced alphabets have been proposed, different classification rules produce distinctive results for sequence analysis. Thus, it is urgent to construct a systematical frame alphabets. In this work, we constructed comprehensive web server called RAACBook analysis machine learning application by integrating reduction The contains three parts: (i) 74 types alphabet were manually extracted generate 673 clusters (RAACs) dealing with unique problems. It easy users select desired RAACs from multilayer browser tool. (ii) An online tool was developed analyze primary protein. K-tuple composition defining correlation parameters (K-tuple, g-gap, λ-correlation). are visualized as alignment, mergence RAA composition, feature distribution logo sequence. (iii) provided train model based on RAAC. optimal selected according evaluation indexes (ROC, AUC, MCC, etc.). conclusion, presents powerful user-friendly service in proteomics. freely available at http://bioinfor.imu.edu.cn/raacbook. Database URL: http://bioinfor.imu.edu.cn/raacbook

Язык: Английский

Процитировано

69