ACM Computing Surveys,
Journal Year:
2023,
Volume and Issue:
56(3), P. 1 - 52
Published: Aug. 1, 2023
Pre-trained
language
models
(PLMs)
have
been
the
de
facto
paradigm
for
most
natural
processing
tasks.
This
also
benefits
biomedical
domain:
researchers
from
informatics,
medicine,
and
computer
science
communities
propose
various
PLMs
trained
on
datasets,
e.g.,
text,
electronic
health
records,
protein,
DNA
sequences
However,
cross-discipline
characteristics
of
hinder
their
spreading
among
communities;
some
existing
works
are
isolated
each
other
without
comprehensive
comparison
discussions.
It
is
nontrivial
to
make
a
survey
that
not
only
systematically
reviews
recent
advances
in
applications
but
standardizes
terminology
benchmarks.
article
summarizes
progress
pre-trained
domain
downstream
Particularly,
we
discuss
motivations
introduce
key
concepts
models.
We
then
taxonomy
categorizes
them
perspectives
systematically.
Plus,
tasks
exhaustively
discussed,
respectively.
Last,
illustrate
limitations
future
trends,
which
aims
provide
inspiration
research.
Nature,
Journal Year:
2021,
Volume and Issue:
596(7873), P. 583 - 589
Published: July 15, 2021
Abstract
Proteins
are
essential
to
life,
and
understanding
their
structure
can
facilitate
a
mechanistic
of
function.
Through
an
enormous
experimental
effort
1–4
,
the
structures
around
100,000
unique
proteins
have
been
determined
5
but
this
represents
small
fraction
billions
known
protein
sequences
6,7
.
Structural
coverage
is
bottlenecked
by
months
years
painstaking
required
determine
single
structure.
Accurate
computational
approaches
needed
address
gap
enable
large-scale
structural
bioinformatics.
Predicting
three-dimensional
that
will
adopt
based
solely
on
its
amino
acid
sequence—the
prediction
component
‘protein
folding
problem’
8
—has
important
open
research
problem
for
more
than
50
9
Despite
recent
progress
10–14
existing
methods
fall
far
short
atomic
accuracy,
especially
when
no
homologous
available.
Here
we
provide
first
method
regularly
predict
with
accuracy
even
in
cases
which
similar
known.
We
validated
entirely
redesigned
version
our
neural
network-based
model,
AlphaFold,
challenging
14th
Critical
Assessment
Structure
Prediction
(CASP14)
15
demonstrating
competitive
majority
greatly
outperforming
other
methods.
Underpinning
latest
AlphaFold
novel
machine
learning
approach
incorporates
physical
biological
knowledge
about
structure,
leveraging
multi-sequence
alignments,
into
design
deep
algorithm.
Science,
Journal Year:
2023,
Volume and Issue:
379(6637), P. 1123 - 1130
Published: March 16, 2023
Recent
advances
in
machine
learning
have
leveraged
evolutionary
information
multiple
sequence
alignments
to
predict
protein
structure.
We
demonstrate
direct
inference
of
full
atomic-level
structure
from
primary
using
a
large
language
model.
As
models
sequences
are
scaled
up
15
billion
parameters,
an
atomic-resolution
picture
emerges
the
learned
representations.
This
results
order-of-magnitude
acceleration
high-resolution
prediction,
which
enables
large-scale
structural
characterization
metagenomic
proteins.
apply
this
capability
construct
ESM
Metagenomic
Atlas
by
predicting
structures
for
>617
million
sequences,
including
>225
that
predicted
with
high
confidence,
gives
view
into
vast
breadth
and
diversity
natural
Proceedings of the National Academy of Sciences,
Journal Year:
2021,
Volume and Issue:
118(15)
Published: April 5, 2021
Significance
Learning
biological
properties
from
sequence
data
is
a
logical
step
toward
generative
and
predictive
artificial
intelligence
for
biology.
Here,
we
propose
scaling
deep
contextual
language
model
with
unsupervised
learning
to
sequences
spanning
evolutionary
diversity.
We
find
that
without
prior
knowledge,
information
emerges
in
the
learned
representations
on
fundamental
of
proteins
such
as
secondary
structure,
contacts,
activity.
show
are
useful
across
benchmarks
remote
homology
detection,
prediction
long-range
residue–residue
mutational
effect.
Unsupervised
representation
enables
state-of-the-art
supervised
effect
structure
improves
features
contact
prediction.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Journal Year:
2021,
Volume and Issue:
44(10), P. 7112 - 7127
Published: July 7, 2021
Computational
biology
and
bioinformatics
provide
vast
data
gold-mines
from
protein
sequences,
ideal
for
Language
Models
(LMs)
taken
Natural
Processing
(NLP).
These
LMs
reach
new
prediction
frontiers
at
low
inference
costs.
Here,
we
trained
two
auto-regressive
models
(Transformer-XL,
XLNet)
four
auto-encoder
(BERT,
Albert,
Electra,
T5)
on
UniRef
BFD
containing
up
to
393
billion
amino
acids.
The
(pLMs)
were
the
Summit
supercomputer
using
5616
GPUs
TPU
Pod
up-to
1024
cores.
Dimensionality
reduction
revealed
that
raw
pLM-
embeddings
unlabeled
captured
some
biophysical
features
of
sequences.
We
validated
advantage
as
exclusive
input
several
subsequent
tasks:
(1)
a
per-residue
(per-token)
secondary
structure
(3-state
accuracy
Q3=81%-87%);
(2)
per-protein
(pooling)
predictions
sub-cellular
location
(ten-state
accuracy:
Q10=81%)
membrane
versus
water-soluble
(2-state
Q2=91%).
For
structure,
most
informative
embeddings
(ProtT5)
first
time
outperformed
state-of-the-art
without
multiple
sequence
alignments
(MSAs)
or
evolutionary
information
thereby
bypassing
expensive
database
searches.
Taken
together,
results
implied
pLMs
learned
xmlns:xlink="http://www.w3.org/1999/xlink">grammar
xmlns:xlink="http://www.w3.org/1999/xlink">language
life
.
All
our
are
available
through
https://github.com/agemagician/ProtTrans
Bioinformatics,
Journal Year:
2022,
Volume and Issue:
38(8), P. 2102 - 2110
Published: Jan. 8, 2022
Self-supervised
deep
language
modeling
has
shown
unprecedented
success
across
natural
tasks,
and
recently
been
repurposed
to
biological
sequences.
However,
existing
models
pretraining
methods
are
designed
optimized
for
text
analysis.
We
introduce
ProteinBERT,
a
model
specifically
proteins.
Our
scheme
combines
with
novel
task
of
Gene
Ontology
(GO)
annotation
prediction.
architectural
elements
that
make
the
highly
efficient
flexible
long
The
architecture
ProteinBERT
consists
both
local
global
representations,
allowing
end-to-end
processing
these
types
inputs
outputs.
obtains
near
state-of-the-art
performance,
sometimes
exceeds
it,
on
multiple
benchmarks
covering
diverse
protein
properties
(including
structure,
post-translational
modifications
biophysical
attributes),
despite
using
far
smaller
faster
than
competing
deep-learning
methods.
Overall,
provides
an
framework
rapidly
training
predictors,
even
limited
labeled
data.
arXiv (Cornell University),
Journal Year:
2019,
Volume and Issue:
unknown
Published: Jan. 1, 2019
Machine
learning
applied
to
protein
sequences
is
an
increasingly
popular
area
of
research.
Semi-supervised
for
proteins
has
emerged
as
important
paradigm
due
the
high
cost
acquiring
supervised
labels,
but
current
literature
fragmented
when
it
comes
datasets
and
standardized
evaluation
techniques.
To
facilitate
progress
in
this
field,
we
introduce
Tasks
Assessing
Protein
Embeddings
(TAPE),
a
set
five
biologically
relevant
semi-supervised
tasks
spread
across
different
domains
biology.
We
curate
into
specific
training,
validation,
test
splits
ensure
that
each
task
tests
generalization
transfers
real-life
scenarios.
benchmark
range
approaches
representation
learning,
which
span
recent
work
well
canonical
sequence
find
self-supervised
pretraining
helpful
almost
all
models
on
tasks,
more
than
doubling
performance
some
cases.
Despite
increase,
several
cases
features
learned
by
still
lag
behind
extracted
state-of-the-art
non-neural
This
gap
suggests
huge
opportunity
innovative
architecture
design
improved
modeling
paradigms
better
capture
signal
biological
sequences.
TAPE
will
help
machine
community
focus
effort
scientifically
problems.
Toward
end,
data
code
used
run
these
experiments
are
available
at
https://github.com/songlab-cal/tape.
Computational and Structural Biotechnology Journal,
Journal Year:
2021,
Volume and Issue:
19, P. 1750 - 1758
Published: Jan. 1, 2021
Natural
language
processing
(NLP)
is
a
field
of
computer
science
concerned
with
automated
text
and
analysis.
In
recent
years,
following
series
breakthroughs
in
deep
machine
learning,
NLP
methods
have
shown
overwhelming
progress.
Here,
we
review
the
success,
promise
pitfalls
applying
algorithms
to
study
proteins.
Proteins,
which
can
be
represented
as
strings
amino-acid
letters,
are
natural
fit
many
methods.
We
explore
conceptual
similarities
differences
between
proteins
language,
range
protein-related
tasks
amenable
learning.
present
for
encoding
information
analyzing
it
methods,
reviewing
classic
concepts
such
bag-of-words,
k-mers/n-grams
search,
well
modern
techniques
word
embedding,
contextualized
learning
neural
models.
particular,
focus
on
innovations
masked
modeling,
self-supervised
attention-based
Finally,
discuss
trends
challenges
intersection
protein
research.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2019,
Volume and Issue:
unknown
Published: April 29, 2019
Abstract
In
the
field
of
artificial
intelligence,
a
combination
scale
in
data
and
model
capacity
enabled
by
un-supervised
learning
has
led
to
major
advances
representation
statistical
generation.
life
sciences,
anticipated
growth
sequencing
promises
unprecedented
on
natural
sequence
diversity.
Protein
language
modeling
at
evolution
is
logical
step
toward
predictive
generative
intelligence
for
biology.
To
this
end
we
use
unsupervised
train
deep
contextual
86
billion
amino
acids
across
250
million
protein
sequences
spanning
evolutionary
The
resulting
contains
information
about
biological
properties
its
representations.
representations
are
learned
from
alone.
space
multi-scale
organization
reflecting
structure
level
biochemical
remote
homology
proteins.
Information
secondary
tertiary
encoded
can
be
identified
linear
projections.
Representation
produces
features
that
generalize
range
applications,
enabling
state-of-the-art
supervised
prediction
mutational
effect
structure,
improving
long-range
contact
prediction.
Nucleic Acids Research,
Journal Year:
2021,
Volume and Issue:
49(W1), P. W535 - W540
Published: May 11, 2021
Abstract
Since
1992
PredictProtein
(https://predictprotein.org)
is
a
one-stop
online
resource
for
protein
sequence
analysis
with
its
main
site
hosted
at
the
Luxembourg
Centre
Systems
Biomedicine
(LCSB)
and
queried
monthly
by
over
3,000
users
in
2020.
was
first
Internet
server
predictions.
It
pioneered
combining
evolutionary
information
machine
learning.
Given
as
input,
outputs
multiple
alignments,
predictions
of
structure
1D
2D
(secondary
structure,
solvent
accessibility,
transmembrane
segments,
disordered
regions,
flexibility,
disulfide
bridges)
function
(functional
effects
variation
or
point
mutations,
Gene
Ontology
(GO)
terms,
subcellular
localization,
protein-,
RNA-,
DNA
binding).
PredictProtein's
infrastructure
has
moved
to
LCSB
increasing
throughput;
use
MMseqs2
search
reduced
runtime
five-fold
(apparently
without
lowering
performance
prediction
methods);
user
interface
elements
improved
usability,
new
methods
were
added.
recently
included
from
deep
learning
embeddings
(GO
secondary
structure)
method
proteins
residues
binding
DNA,
RNA,
other
proteins.
PredictProtein.org
aspires
provide
reliable
computational
experimental
biologists
alike.
All
scripts
are
freely
available
offline
execution
high-throughput
settings.