Abstract
The
emergence
of
self‐supervised
deep
language
models
has
revolutionized
natural
processing
tasks
and
recently
extended
its
applications
to
biological
sequence
analysis.
Traditional
models,
primarily
based
on
Transformer
architectures,
demonstrate
substantial
effectiveness
in
various
applications.
However,
these
are
inherently
constrained
by
the
attention
mechanism's
quadratic
computational
complexity,
,
which
limits
their
efficiency
leads
high
costs.
To
address
limitations,
we
introduce
ProtHyena,
a
novel
approach
that
leverages
Hyena
operator
protein
modeling.
This
innovative
methodology
alternates
between
subquadratic
long
convolutions
element‐wise
gating
operations,
circumvents
constraints
imposed
mechanisms
reduces
complexity
levels.
enables
faster
more
memory‐efficient
modeling
sequences.
ProtHyena
can
achieve
state‐of‐the‐art
results
comparable
performance
8
downstream
tasks,
including
engineering
(protein
fluorescence
stability
prediction),
property
prediction
(neuropeptide
cleavage,
signal
peptide,
solubility,
disorder,
gene
function
structure
prediction,
with
only
1.6
M
parameters.
architecture
represents
highly
efficient
solution
for
modeling,
offering
promising
avenue
fast
analysis
Symmetry,
Год журнала:
2023,
Номер
15(9), С. 1723 - 1723
Опубликована: Сен. 8, 2023
Deep
learning
techniques
have
found
applications
across
diverse
fields,
enhancing
the
efficiency
and
effectiveness
of
decision-making
processes.
The
integration
these
underscores
significance
interdisciplinary
research.
In
particular,
decisions
often
rely
on
output’s
projected
value
or
probability
from
neural
networks,
considering
different
values
relevant
output
factor.
This
review
examines
impact
deep
systems,
analyzing
25
papers
published
between
2017
2022.
highlights
improved
accuracy
but
emphasizes
need
for
addressing
issues
like
interpretability,
generalizability,
to
build
reliable
decision
support
systems.
Future
research
directions
include
transparency,
explainability,
real-world
validation,
underscoring
importance
collaboration
successful
implementation.
Scientific Reports,
Год журнала:
2024,
Номер
14(1)
Опубликована: Фев. 23, 2024
Abstract
The
voltage-gated
sodium
(Na
v
)
channel
is
a
crucial
molecular
component
responsible
for
initiating
and
propagating
action
potentials.
While
the
α
subunit,
forming
pore,
plays
central
role
in
this
function,
complete
physiological
function
of
Na
channels
relies
on
interactions
between
subunit
auxiliary
proteins,
known
as
protein–protein
(PPI).
blocking
peptides
(NaBPs)
have
been
recognized
promising
alternative
therapeutic
agent
pain
itch.
Although
traditional
experimental
methods
can
precisely
determine
effect
activity
NaBPs,
they
remain
time-consuming
costly.
Hence,
machine
learning
(ML)-based
that
are
capable
accurately
contributing
silico
prediction
NaBPs
highly
desirable.
In
study,
we
develop
an
innovative
meta-learning-based
NaBP
method
(MetaNaBP).
MetaNaBP
generates
new
feature
representations
by
employing
wide
range
sequence-based
descriptors
cover
multiple
perspectives,
combination
with
powerful
ML
algorithms.
Then,
these
were
optimized
to
identify
informative
features
using
two-step
selection
method.
Finally,
selected
applied
final
meta-predictor.
To
best
our
knowledge,
first
meta-predictor
prediction.
Experimental
results
demonstrated
achieved
accuracy
0.948
Matthews
correlation
coefficient
0.898
over
independent
test
dataset,
which
5.79%
11.76%
higher
than
existing
addition,
discriminative
power
surpassed
conventional
both
training
datasets.
We
anticipate
will
be
exploited
large-scale
analysis
narrow
down
potential
NaBPs.
BMC Bioinformatics,
Год журнала:
2024,
Номер
25(1)
Опубликована: Март 16, 2024
Protein
language
models,
inspired
by
the
success
of
large
models
in
deciphering
human
language,
have
emerged
as
powerful
tools
for
unraveling
intricate
code
life
inscribed
within
protein
sequences.
They
gained
significant
attention
their
promising
applications
across
various
areas,
including
sequence-based
prediction
secondary
and
tertiary
structure,
discovery
new
functional
sequences/folds,
assessment
mutational
impact
on
fitness.
However,
utility
learning
to
predict
residue
properties
based
scant
datasets,
such
protein-protein
interaction
(PPI)-hotspots
whose
mutations
significantly
impair
PPIs,
remained
unclear.
Here,
we
explore
feasibility
using
language-learned
representations
features
machine
PPI-hotspots
a
dataset
containing
414
experimentally
confirmed
504
PPI-nonhot
spots.
This
study
proposes
an
advanced
machine
learning
(ML)
framework
for
breast
cancer
diagnostics
by
integrating
transcriptomic
profiling
with
optimized
feature
selection
and
classification
techniques.
A
dataset
of
1759
samples
(987
patients,
772
healthy
controls)
was
analyzed
using
Recursive
Feature
Elimination,
Boruta,
ElasticNet
selection.
Dimensionality
reduction
techniques,
including
Non-Negative
Matrix
Factorization
(NMF),
Autoencoders,
transformer-based
embeddings
(BioBERT,
DNABERT),
were
applied
to
enhance
model
interpretability.
Classifiers
such
as
XGBoost,
LightGBM,
ensemble
voting,
Multi-Layer
Perceptron,
Stacking
trained
grid
search
cross-validation.
Model
evaluation
conducted
accuracy,
AUC,
MCC,
Kappa
Score,
ROC,
PR
curves,
external
validation
performed
on
independent
175
samples.
XGBoost
LightGBM
achieved
the
highest
test
accuracies
(0.91
0.90)
AUC
values
(up
0.92),
particularly
NMF
BioBERT.
The
Voting
method
exhibited
best
accuracy
(0.92),
confirming
its
robustness.
Transformer-based
techniques
significantly
improved
performance
compared
conventional
approaches
like
PCA
Decision
Trees.
proposed
ML
enhances
diagnostic
interpretability,
demonstrating
strong
generalizability
dataset.
These
findings
highlight
potential
precision
oncology
personalized
diagnostics.
ABSTRACT
We
here
present
a
chatbot
assistant
infrastructure
(
https://www.ebi.ac.uk/pride/chatbot/
)
that
simplifies
user
interactions
with
the
PRIDE
database's
documentation
and
dataset
search
functionality.
The
framework
utilizes
multiple
Large
Language
Models
(LLM):
llama2,
chatglm,
mixtral
(mistral),
openhermes.
It
also
includes
web
service
API
(Application
Programming
Interface),
interface,
components
for
indexing
managing
vector
databases.
An
Elo‐ranking
system‐based
benchmark
component
is
included
in
as
well,
which
allows
evaluating
performance
of
each
LLM
improving
documentation.
not
only
users
to
interact
but
can
be
used
find
datasets
using
an
LLM‐based
recommendation
system,
enabling
discoverability.
Importantly,
while
our
exemplified
through
its
application
database
context,
modular
adaptable
nature
approach
positions
it
valuable
tool
experiences
across
spectrum
bioinformatics
proteomics
tools
resources,
among
other
domains.
integration
advanced
LLMs,
innovative
vector‐based
construction,
benchmarking
framework,
optimized
collectively
form
robust
transferable
infrastructure.
open‐source
https://github.com/PRIDE‐Archive/pride‐chatbot
).
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Окт. 21, 2023
A
bstract
Large
Language
Models
(LLMs)
have
garnered
significant
recognition
in
the
life
sciences
for
their
capacity
to
comprehend
and
utilize
knowledge.
The
contemporary
expectation
diverse
industries
extends
beyond
employing
LLMs
merely
as
chatbots;
instead,
there
is
a
growing
emphasis
on
harnessing
potential
adept
analysts
proficient
dissecting
intricate
issues
within
these
sectors.
realm
of
bioinformatics
no
exception
this
trend.
In
paper,
we
introduce
B
ioinfo
-B
ench
,
novel
yet
straightforward
benchmark
framework
suite
crafted
assess
academic
knowledge
data
mining
capabilities
foundational
models
bioinformatics.
systematically
gathered
from
three
distinct
perspectives:
acquisition,
analysis,
application,
facilitating
comprehensive
examination
LLMs.
Our
evaluation
encompassed
prominent
ChatGPT,
Llama,
Galactica.
findings
revealed
that
excel
drawing
heavily
upon
training
retention.
However,
proficiency
addressing
practical
professional
queries
conducting
nuanced
inference
remains
constrained.
Given
insights,
are
poised
delve
deeper
into
domain,
engaging
further
extensive
research
discourse.
It
pertinent
note
project
currently
progress,
all
associated
materials
will
be
made
publicly
accessible.
1