ACM Computing Surveys,
Journal Year:
2023,
Volume and Issue:
56(3), P. 1 - 52
Published: Aug. 1, 2023
Pre-trained
language
models
(PLMs)
have
been
the
de
facto
paradigm
for
most
natural
processing
tasks.
This
also
benefits
biomedical
domain:
researchers
from
informatics,
medicine,
and
computer
science
communities
propose
various
PLMs
trained
on
datasets,
e.g.,
text,
electronic
health
records,
protein,
DNA
sequences
However,
cross-discipline
characteristics
of
hinder
their
spreading
among
communities;
some
existing
works
are
isolated
each
other
without
comprehensive
comparison
discussions.
It
is
nontrivial
to
make
a
survey
that
not
only
systematically
reviews
recent
advances
in
applications
but
standardizes
terminology
benchmarks.
article
summarizes
progress
pre-trained
domain
downstream
Particularly,
we
discuss
motivations
introduce
key
concepts
models.
We
then
taxonomy
categorizes
them
perspectives
systematically.
Plus,
tasks
exhaustively
discussed,
respectively.
Last,
illustrate
limitations
future
trends,
which
aims
provide
inspiration
research.
Patterns,
Journal Year:
2020,
Volume and Issue:
1(9), P. 100142 - 100142
Published: Nov. 12, 2020
Deep
learning
is
catalyzing
a
scientific
revolution
fueled
by
big
data,
accessible
toolkits,
and
powerful
computational
resources,
impacting
many
fields,
including
protein
structural
modeling.
Protein
modeling,
such
as
predicting
structure
from
amino
acid
sequence
evolutionary
information,
designing
proteins
toward
desirable
functionality,
or
properties
behavior
of
protein,
critical
to
understand
engineer
biological
systems
at
the
molecular
level.
In
this
review,
we
summarize
recent
advances
in
applying
deep
techniques
tackle
problems
modeling
design.
We
dissect
emerging
approaches
using
for
discuss
challenges
that
must
be
addressed.
argue
central
importance
structure,
following
"sequence
→
function"
paradigm.
This
review
directed
help
both
biologists
gain
familiarity
with
methods
applied
computer
scientists
perspective
on
biologically
meaningful
may
benefit
techniques.
Nucleic Acids Research,
Journal Year:
2021,
Volume and Issue:
50(D1), P. D693 - D700
Published: Nov. 9, 2021
Abstract
Rhea
(https://www.rhea-db.org)
is
an
expert-curated
knowledgebase
of
biochemical
reactions
based
on
the
chemical
ontology
ChEBI
(Chemical
Entities
Biological
Interest)
(https://www.ebi.ac.uk/chebi).
In
this
paper,
we
describe
a
number
key
developments
in
since
our
last
report
database
issue
Nucleic
Acids
Research
2019.
These
include
improved
reaction
coverage
Rhea,
adoption
as
reference
vocabulary
for
enzyme
annotation
UniProt
UniProtKB
(https://www.uniprot.org),
development
new
website,
and
designation
ELIXIR
Core
Data
Resource.
We
hope
that
these
other
will
enhance
utility
resource
to
study
engineer
enzymes
metabolic
systems
which
they
function.
Briefings in Bioinformatics,
Journal Year:
2021,
Volume and Issue:
22(5)
Published: Jan. 4, 2021
Recently,
language
representation
models
have
drawn
a
lot
of
attention
in
the
natural
processing
field
due
to
their
remarkable
results.
Among
them,
bidirectional
encoder
representations
from
transformers
(BERT)
has
proven
be
simple,
yet
powerful
model
that
achieved
novel
state-of-the-art
performance.
BERT
adopted
concept
contextualized
word
embedding
capture
semantics
and
context
words
which
they
appeared.
In
this
study,
we
present
technique
by
incorporating
BERT-based
multilingual
bioinformatics
represent
information
DNA
sequences.
We
treated
sequences
as
sentences
then
used
transform
them
into
fixed-length
numerical
matrices.
As
case
applied
our
method
enhancer
prediction,
is
well-known
challenging
problem
field.
observed
features
improved
more
than
5-10%
terms
sensitivity,
specificity,
accuracy
Matthews
correlation
coefficient
compared
current
bioinformatics.
Moreover,
advanced
experiments
show
deep
learning
(as
represented
2D
convolutional
neural
networks;
CNN)
holds
potential
better
other
traditional
machine
techniques.
conclusion,
suggest
CNNs
could
open
new
avenue
biological
modeling
using
sequence
information.
Scientific Reports,
Journal Year:
2021,
Volume and Issue:
11(1)
Published: Jan. 13, 2021
Abstract
Knowing
protein
function
is
crucial
to
advance
molecular
and
medical
biology,
yet
experimental
annotations
through
the
Gene
Ontology
(GO)
exist
for
fewer
than
0.5%
of
all
known
proteins.
Computational
methods
bridge
this
sequence-annotation
gap
typically
homology-based
annotation
transfer
by
identifying
sequence-similar
proteins
with
or
prediction
using
evolutionary
information.
Here,
we
propose
predicting
GO
terms
based
on
proximity
in
SeqVec
embedding
rather
sequence
space.
These
embeddings
originate
from
deep
learned
language
models
(LMs)
sequences
(SeqVec)
transferring
knowledge
gained
next
amino
acid
33
million
sequences.
Replicating
conditions
CAFA3,
our
method
reaches
an
F
max
37
±
2%,
50
3%,
57
2%
BPO,
MFO,
CCO,
respectively.
Numerically,
appears
close
top
ten
CAFA3
methods.
When
restricting
<
20%
pairwise
identity
query,
performance
drops
(F
BPO
MFO
43
CCO
53
2%);
still
outperforms
naïve
sequence-based
transfer.
Preliminary
results
CAFA4
appear
confirm
these
findings.
Overall,
new
concept
likely
change
proteins,
particular
smaller
families
intrinsically
disordered
regions.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2022,
Volume and Issue:
unknown
Published: Dec. 22, 2022
Abstract
Learning
the
design
patterns
of
proteins
from
sequences
across
evolution
may
have
promise
toward
generative
protein
design.
However
it
is
unknown
whether
language
models,
trained
on
natural
proteins,
will
be
capable
more
than
memorization
existing
families.
Here
we
show
that
models
generalize
beyond
to
generate
de
novo
proteins.
We
focus
two
tasks:
fixed
backbone
where
structure
specified,
and
unconstrained
generation
sampled
model.
Remarkably
although
are
only
sequences,
find
they
designing
structure.
A
total
228
generated
evaluated
experimentally
with
high
overall
success
rates
(152/228
or
67%)
in
producing
a
soluble
monomeric
species
by
size
exclusion
chromatography.
Out
152
successful
designs,
35
no
significant
sequence
match
known
Of
remaining
117,
identity
nearest
at
median
27%,
below
20%
for
6
as
low
18%
3
designs.
For
design,
model
generates
designs
each
eight
artificially
created
targets.
generation,
cover
diverse
topologies
secondary
compositions,
experimental
rate
(71/129
55%).
The
reflect
deep
linking
structure,
including
motifs
occur
related
structures,
not
observed
similar
structural
contexts
results
though
learn
grammar
enables
extending
Nature Communications,
Journal Year:
2022,
Volume and Issue:
13(1)
Published: April 8, 2022
How
we
choose
to
represent
our
data
has
a
fundamental
impact
on
ability
subsequently
extract
information
from
them.
Machine
learning
promises
automatically
determine
efficient
representations
large
unstructured
datasets,
such
as
those
arising
in
biology.
However,
empirical
evidence
suggests
that
seemingly
minor
changes
these
machine
models
yield
drastically
different
result
biological
interpretations
of
data.
This
begs
the
question
what
even
constitutes
most
meaningful
representation.
Here,
approach
this
for
protein
sequences,
which
have
received
considerable
attention
recent
literature.
We
explore
two
key
contexts
naturally
arise:
transfer
and
interpretable
learning.
In
first
context,
demonstrate
several
contemporary
practices
suboptimal
performance,
latter
taking
representation
geometry
into
account
significantly
improves
interpretability
lets
reveal
is
otherwise
obscured.
Scientific Reports,
Journal Year:
2022,
Volume and Issue:
12(1)
Published: May 19, 2022
Proteins
are
the
essential
biological
macromolecules
required
to
perform
nearly
all
processes,
and
cellular
functions.
rarely
carry
out
their
tasks
in
isolation
but
interact
with
other
proteins
(known
as
protein-protein
interaction)
present
surroundings
complete
activities.
The
knowledge
of
interactions
(PPIs)
unravels
behavior
its
functionality.
computational
methods
automate
prediction
PPI
less
expensive
than
experimental
terms
resources
time.
So
far,
most
works
on
have
mainly
focused
sequence
information.
Here,
we
use
graph
convolutional
network
(GCN)
attention
(GAT)
predict
interaction
between
by
utilizing
protein's
structural
information
features.
We
build
graphs
from
PDB
files,
which
contain
3D
coordinates
atoms.
protein
represents
amino
acid
network,
also
known
residue
contact
where
each
node
is
a
residue.
Two
nodes
connected
if
they
pair
atoms
(one
node)
within
threshold
distance.
To
extract
node/residue
features,
language
model.
input
model
sequence,
output
feature
vector
for
underlying
sequence.
validate
predictive
capability
proposed
graph-based
approach
two
datasets:
Human
S.
cerevisiae.
Obtained
results
demonstrate
effectiveness
it
outperforms
previous
leading
methods.
source
code
training
data
train
available
at
https://github.com/JhaKanchan15/PPI_GNN.git
.
Human Genetics,
Journal Year:
2021,
Volume and Issue:
141(10), P. 1629 - 1647
Published: Dec. 30, 2021
The
emergence
of
SARS-CoV-2
variants
stressed
the
demand
for
tools
allowing
to
interpret
effect
single
amino
acid
(SAVs)
on
protein
function.
While
Deep
Mutational
Scanning
(DMS)
sets
continue
expand
our
understanding
mutational
landscape
proteins,
results
challenge
analyses.
Protein
Language
Models
(pLMs)
use
latest
deep
learning
(DL)
algorithms
leverage
growing
databases
sequences.
These
methods
learn
predict
missing
or
masked
acids
from
context
entire
sequence
regions.
Here,
we
used
pLM
representations
(embeddings)
conservation
and
SAV
effects
without
multiple
alignments
(MSAs).
Embeddings
alone
predicted
residue
almost
as
accurately
sequences
ConSeq
using
MSAs
(two-state
Matthews
Correlation
Coefficient-MCC-for
ProtT5
embeddings
0.596
±
0.006
vs.
0.608
ConSeq).
Inputting
prediction
along
with
BLOSUM62
substitution
scores
mask
reconstruction
probabilities
into
a
simplistic
logistic
regression
(LR)
ensemble
Variant
Effect
Score
Prediction
Alignments
(VESPA)
magnitude
any
optimization
DMS
data.
Comparing
predictions
standard
set
39
experiments
other
(incl.
ESM-1v,
DeepSequence,
GEMME)
revealed
approach
competitive
state-of-the-art
(SOTA)
MSA
input.
No
method
outperformed
all
others,
neither
consistently
nor
statistically
significantly,
independently
performance
measure
applied
(Spearman
Pearson
correlation).
Finally,
investigated
binary
four
human
proteins.
Overall,
embedding-based
have
become
relying
at
fraction
costs
in
computing/energy.
Our
proteome
(~
20
k
proteins)
within
40
min
one
Nvidia
Quadro
RTX
8000.
All
data
are
freely
available
local
online
execution
through
bioembeddings.com,
https://github.com/Rostlab/VESPA
,
PredictProtein.
Recent
developments
in
deep
learning,
coupled
with
an
increasing
number
of
sequenced
proteins,
have
led
to
a
breakthrough
life
science
applications,
particular
protein
property
prediction.
There
is
hope
that
learning
can
close
the
gap
between
proteins
and
known
properties
based
on
lab
experiments.
Language
models
from
field
natural
language
processing
gained
popularity
for
predictions
new
computational
revolution
biology,
where
old
prediction
results
are
being
improved
regularly.
Such
learn
useful
multipurpose
representations
large
open
repositories
sequences
be
used,
instance,
predict
properties.
The
growing
quickly
because
class
model-the
Transformer
model.
We
review
recent
use
large-scale
applications
predicting
characteristics
how
such
used
predict,
example,
post-translational
modifications.
shortcomings
other
explain
proven
very
promising
way
unravel
information
hidden
amino
acids.