Computational and Structural Biotechnology Journal,
Год журнала:
2022,
Номер
20, С. 662 - 674
Опубликована: Янв. 1, 2022
genome
comprises
approximately
10%
of
two
families
poorly
characterised
genes
due
to
their
high
GC
content
and
highly
repetitive
nature.
The
largest
sub-group,
the
proline-glutamic
acid
polymorphic
guanine-cytosine-rich
sequence
(PE_PGRS)
family,
is
thought
be
involved
in
host
response
disease
pathogenicity.
Due
genetic
variability
complexity
analysis,
they
are
typically
disregarded
for
further
research
genomic
studies.
There
currently
limited
online
resources
homology
computational
tools
that
can
identify
analyse
PE_PGRS
proteins.
In
addition,
computational-intensive
time-consuming,
lack
sensitivity.
Therefore,
methods
rapidly
accurately
proteins
valuable
facilitate
functional
elucidation
family
this
study,
we
developed
first
machine
learning-based
bioinformatics
approach,
termed
PEPPER,
allow
users
accurately.
PEPPER
was
built
upon
a
comprehensive
evaluation
13
popular
learning
algorithms
with
various
physicochemical
features.
Empirical
studies
demonstrated
achieved
significantly
better
performance
than
alignment-based
approaches,
BLASTP
PHMMER,
both
prediction
accuracy
speed.
anticipated
community-wide
efforts
conduct
high-throughput
identification
analysis
Briefings in Bioinformatics,
Год журнала:
2019,
Номер
21(4), С. 1119 - 1135
Опубликована: Апрель 6, 2019
Human
leukocyte
antigen
class
I
(HLA-I)
molecules
are
encoded
by
major
histocompatibility
complex
(MHC)
loci
in
humans.
The
binding
and
interaction
between
HLA-I
intracellular
peptides
derived
from
a
variety
of
proteolytic
mechanisms
play
crucial
role
subsequent
T-cell
recognition
target
cells
the
specificity
immune
response.
In
this
context,
tools
that
predict
likelihood
for
peptide
to
bind
specific
HLA
allotypes
important
selecting
most
promising
antigenic
targets
immunotherapy.
article,
we
comprehensively
review
currently
available
predicting
selection
allomorphs.
Specifically,
compare
their
calculation
methods
prediction
score,
employed
algorithms,
evaluation
strategies
software
functionalities.
addition,
have
evaluated
performance
reviewed
based
on
an
independent
validation
data
set,
containing
21
101
experimentally
verified
ligands
across
19
allotypes.
benchmarking
results
show
MixMHCpred
2.0.1
achieves
best
allomorphs
studied,
while
NetMHCpan
4.0
NetMHCcons
1.1
outperform
other
machine
learning-based
consensus-based
tools,
respectively.
Importantly,
it
should
be
noted
predicted
with
higher
score
allotype
does
not
necessarily
imply
will
immunogenic.
That
said,
peptide-binding
predictors
still
very
useful
they
can
help
significantly
reduce
large
number
epitope
candidates
need
verified.
Several
factors,
including
susceptibility
proteasome
cleavage,
transport
into
endoplasmic
reticulum
receptor
repertoire,
also
contribute
immunogenicity
antigens,
some
them
considered
predictors.
Therefore,
integrating
features
these
additional
factors
together
HLA-binding
properties
using
machine-learning
algorithms
may
increase
accuracy
immunogenic
peptides.
As
such,
anticipate
survey
assist
researchers
appropriate
suit
purposes
provide
guidelines
development
improved
future.
Molecular Therapy — Nucleic Acids,
Год журнала:
2020,
Номер
22, С. 362 - 372
Опубликована: Авг. 25, 2020
Recent
studies
have
increasingly
shown
that
the
chemical
modification
of
mRNA
plays
an
important
role
in
regulation
gene
expression.
N7-methylguanosine
(m7G)
is
a
type
positively-charged
essential
for
efficient
expression
and
cell
viability.
However,
research
on
m7G
has
received
little
attention
to
date.
Bioinformatics
tools
can
be
applied
as
auxiliary
methods
identify
sites
transcriptomes.
In
this
study,
we
develop
novel
interpretable
machine
learning-based
approach
termed
XG-m7G
differentiation
using
XGBoost
algorithm
six
different
types
sequence-encoding
schemes.
Both
10-fold
jackknife
cross-validation
tests
indicate
outperforms
iRNA-m7G.
Moreover,
powerful
SHAP
algorithm,
new
framework
also
provides
desirable
interpretations
model
performance
highlights
most
features
identifying
sites.
anticipated
serve
useful
tool
guide
researchers
their
future
Briefings in Bioinformatics,
Год журнала:
2020,
Номер
22(3)
Опубликована: Май 20, 2020
Abstract
DNA
N4-methylcytosine
(4mC)
is
an
important
epigenetic
modification
that
plays
a
vital
role
in
regulating
replication
and
expression.
However,
it
challenging
to
detect
4mC
sites
through
experimental
methods,
which
are
time-consuming
costly.
Thus,
computational
tools
can
identify
would
be
very
useful
for
understanding
the
mechanism
of
this
type
modification.
Several
machine
learning-based
predictors
have
been
proposed
past
3
years,
although
their
performance
unsatisfactory.
Deep
learning
promising
technique
development
more
accurate
site
predictions.
In
work,
we
propose
deep
approach,
called
DeepTorrent,
improved
prediction
from
sequences.
It
combines
four
different
feature
encoding
schemes
encode
raw
sequences
employs
multi-layer
convolutional
neural
networks
with
inception
module
integrated
bidirectional
long
short-term
memory
effectively
learn
higher-order
representations.
Dimension
reduction
concatenated
maps
filters
sizes
then
applied
module.
addition,
attention
transfer
techniques
also
employed
train
robust
predictor.
Extensive
benchmarking
experiments
demonstrate
DeepTorrent
significantly
improves
compared
several
state-of-the-art
methods.
Genomics Proteomics & Bioinformatics,
Год журнала:
2020,
Номер
18(1), С. 52 - 64
Опубликована: Фев. 1, 2020
Abstract
Proteases
are
enzymes
that
cleave
and
hydrolyse
the
peptide
bonds
between
two
specific
amino
acid
residues
of
target
substrate
proteins.
Protease-controlled
proteolysis
plays
a
key
role
in
degradation
recycling
proteins,
which
is
essential
for
various
physiological
processes.
Thus,
solving
identification
problem
will
have
important
implications
precise
understanding
functions
roles
proteases,
as
well
therapeutic
pharmaceutical
applicability.
Consequently,
there
great
demand
bioinformatics
methods
can
predict
novel
cleavage
events
with
high
accuracy
by
utilizing
both
sequence
structural
information.
In
this
study,
we
present
Procleave,
approach
predicting
protease-specific
substrates
sites
taking
into
account
their
3D
Structural
features
known
were
represented
discrete
values
using
LOWESS
data-smoothing
optimization
method,
turned
out
to
be
critical
performance
Procleave.
The
optimal
approximations
all
parameter
encoded
conditional
random
field
(CRF)
computational
framework,
alongside
chemical
group-based
features.
Here,
demonstrate
outstanding
Procleave
through
extensive
benchmarking
independent
tests.
capable
correctly
identifying
most
case
study.
Importantly,
when
applied
human
proteome
encompassing
17,628
protein
structures,
suggests
number
potential
corresponding
different
proteases.
implemented
webserver
freely
accessible
at
http://procleave.erc.monash.edu/.
Briefings in Bioinformatics,
Год журнала:
2020,
Номер
22(4)
Опубликована: Окт. 19, 2020
Anti-cancer
peptides
(ACPs)
are
known
as
potential
therapeutics
for
cancer.
Due
to
their
unique
ability
target
cancer
cells
without
affecting
healthy
directly,
they
have
been
extensively
studied.
Many
peptide-based
drugs
currently
evaluated
in
the
preclinical
and
clinical
trials.
Accurate
identification
of
ACPs
has
received
considerable
attention
recent
years;
such,
a
number
machine
learning-based
methods
silico
developed.
These
promote
research
on
mechanism
against
some
extent.
There
is
vast
difference
these
terms
training/testing
datasets,
learning
algorithms,
feature
encoding
schemes,
selection
evaluation
strategies
used.
Therefore,
it
desirable
summarize
advantages
disadvantages
existing
methods,
provide
useful
insights
suggestions
development
improvement
novel
computational
tools
characterize
identify
ACPs.
With
this
mind,
we
firstly
comprehensively
investigate
16
state-of-the-art
predictors
core
performance
metrics
webserver/software
usability.
Then,
comprehensive
assessment
conducted
evaluate
robustness
scalability
using
well-prepared
benchmark
dataset.
We
model
improvement.
Moreover,
propose
ensemble
framework,
termed
ACPredStackL,
accurate
ACPredStackL
developed
based
stacking
strategy
combined
with
SVM,
Naïve
Bayesian,
lightGBM
KNN.
Empirical
benchmarking
experiments
demonstrate
that
achieves
comparative
predicting
The
webserver
source
code
freely
available
at
http://bigdata.biocie.cn/ACPredStackL/
https://github.com/liangxiaoq/ACPredStackL,
respectively.
Briefings in Bioinformatics,
Год журнала:
2021,
Номер
23(1)
Опубликована: Окт. 8, 2021
Conventional
supervised
binary
classification
algorithms
have
been
widely
applied
to
address
significant
research
questions
using
biological
and
biomedical
data.
This
scheme
requires
two
fully
labeled
classes
of
data
(e.g.
positive
negative
samples)
train
a
model.
However,
in
many
bioinformatics
applications,
labeling
is
laborious,
the
samples
might
be
potentially
mislabeled
due
limited
sensitivity
experimental
equipment.
The
unlabeled
(PU)
learning
was
therefore
proposed
enable
classifier
learn
directly
from
large
number
(i.e.
mixture
or
samples).
To
date,
several
PU
developed
various
questions,
such
as
sequence
identification,
functional
site
characterization
interaction
prediction.
In
this
paper,
we
revisit
collection
29
state-of-the-art
bioinformatic
applications
questions.
Various
important
aspects
are
extensively
discussed,
including
methodology,
application,
design
evaluation
strategy.
We
also
comment
on
existing
issues
offer
our
perspectives
for
future
development
applications.
anticipate
that
work
serves
an
instrumental
guideline
better
understanding
framework
further
developing
next-generation
frameworks
critical
Glycobiology,
Год журнала:
2023,
Номер
33(5), С. 411 - 422
Опубликована: Апрель 17, 2023
Abstract
Protein
N-linked
glycosylation
is
an
important
post-translational
mechanism
in
Homo
sapiens,
playing
essential
roles
many
vital
biological
processes.
It
occurs
at
the
N-X-[S/T]
sequon
amino
acid
sequences,
where
X
can
be
any
except
proline.
However,
not
all
sequons
are
glycosylated;
thus,
a
necessary
but
sufficient
determinant
for
protein
glycosylation.
In
this
regard,
computational
prediction
of
sites
confined
to
problem
that
has
been
extensively
addressed
by
existing
methods,
especially
regard
creation
negative
sets
and
leveraging
distilled
information
from
language
models
(pLMs).
Here,
we
developed
LMNglyPred,
deep
learning-based
approach,
predict
glycosylated
human
proteins
using
embeddings
pre-trained
pLM.
LMNglyPred
produces
sensitivity,
specificity,
Matthews
Correlation
Coefficient,
precision,
accuracy
76.50,
75.36,
0.49,
60.99,
75.74
percent,
respectively,
on
benchmark-independent
test
set.
These
results
demonstrate
robust
tool
sequon.
Abstract
By
reducing
amino
acid
alphabet,
the
protein
complexity
can
be
significantly
simplified,
which
could
improve
computational
efficiency,
decrease
information
redundancy
and
reduce
chance
of
overfitting.
Although
some
reduced
alphabets
have
been
proposed,
different
classification
rules
produce
distinctive
results
for
sequence
analysis.
Thus,
it
is
urgent
to
construct
a
systematical
frame
alphabets.
In
this
work,
we
constructed
comprehensive
web
server
called
RAACBook
analysis
machine
learning
application
by
integrating
reduction
The
contains
three
parts:
(i)
74
types
alphabet
were
manually
extracted
generate
673
clusters
(RAACs)
dealing
with
unique
problems.
It
easy
users
select
desired
RAACs
from
multilayer
browser
tool.
(ii)
An
online
tool
was
developed
analyze
primary
protein.
K-tuple
composition
defining
correlation
parameters
(K-tuple,
g-gap,
λ-correlation).
are
visualized
as
alignment,
mergence
RAA
composition,
feature
distribution
logo
sequence.
(iii)
provided
train
model
based
on
RAAC.
optimal
selected
according
evaluation
indexes
(ROC,
AUC,
MCC,
etc.).
conclusion,
presents
powerful
user-friendly
service
in
proteomics.
freely
available
at
http://bioinfor.imu.edu.cn/raacbook.
Database
URL:
http://bioinfor.imu.edu.cn/raacbook