Current Opinion in Chemical Biology,
Journal Year:
2021,
Volume and Issue:
65, P. 18 - 27
Published: May 26, 2021
Protein
engineering
seeks
to
identify
protein
sequences
with
optimized
properties.
When
guided
by
machine
learning,
sequence
generation
methods
can
draw
on
prior
knowledge
and
experimental
efforts
improve
this
process.
In
review,
we
highlight
recent
applications
of
learning
generate
sequences,
focusing
the
emerging
field
deep
generative
methods.
Proceedings of the National Academy of Sciences,
Journal Year:
2021,
Volume and Issue:
118(15)
Published: April 5, 2021
Significance
Learning
biological
properties
from
sequence
data
is
a
logical
step
toward
generative
and
predictive
artificial
intelligence
for
biology.
Here,
we
propose
scaling
deep
contextual
language
model
with
unsupervised
learning
to
sequences
spanning
evolutionary
diversity.
We
find
that
without
prior
knowledge,
information
emerges
in
the
learned
representations
on
fundamental
of
proteins
such
as
secondary
structure,
contacts,
activity.
show
are
useful
across
benchmarks
remote
homology
detection,
prediction
long-range
residue–residue
mutational
effect.
Unsupervised
representation
enables
state-of-the-art
supervised
effect
structure
improves
features
contact
prediction.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
Journal Year:
2021,
Volume and Issue:
44(10), P. 7112 - 7127
Published: July 7, 2021
Computational
biology
and
bioinformatics
provide
vast
data
gold-mines
from
protein
sequences,
ideal
for
Language
Models
(LMs)
taken
Natural
Processing
(NLP).
These
LMs
reach
new
prediction
frontiers
at
low
inference
costs.
Here,
we
trained
two
auto-regressive
models
(Transformer-XL,
XLNet)
four
auto-encoder
(BERT,
Albert,
Electra,
T5)
on
UniRef
BFD
containing
up
to
393
billion
amino
acids.
The
(pLMs)
were
the
Summit
supercomputer
using
5616
GPUs
TPU
Pod
up-to
1024
cores.
Dimensionality
reduction
revealed
that
raw
pLM-
embeddings
unlabeled
captured
some
biophysical
features
of
sequences.
We
validated
advantage
as
exclusive
input
several
subsequent
tasks:
(1)
a
per-residue
(per-token)
secondary
structure
(3-state
accuracy
Q3=81%-87%);
(2)
per-protein
(pooling)
predictions
sub-cellular
location
(ten-state
accuracy:
Q10=81%)
membrane
versus
water-soluble
(2-state
Q2=91%).
For
structure,
most
informative
embeddings
(ProtT5)
first
time
outperformed
state-of-the-art
without
multiple
sequence
alignments
(MSAs)
or
evolutionary
information
thereby
bypassing
expensive
database
searches.
Taken
together,
results
implied
pLMs
learned
xmlns:xlink="http://www.w3.org/1999/xlink">grammar
xmlns:xlink="http://www.w3.org/1999/xlink">language
life
.
All
our
are
available
through
https://github.com/agemagician/ProtTrans
BMC Bioinformatics,
Journal Year:
2019,
Volume and Issue:
20(1)
Published: Dec. 1, 2019
Predicting
protein
function
and
structure
from
sequence
is
one
important
challenge
for
computational
biology.
For
26
years,
most
state-of-the-art
approaches
combined
machine
learning
evolutionary
information.
However,
some
applications
retrieving
related
proteins
becoming
too
time-consuming.
Additionally,
information
less
powerful
small
families,
e.g.
the
Dark
Proteome.
Both
these
problems
are
addressed
by
new
methodology
introduced
here.We
a
novel
way
to
represent
sequences
as
continuous
vectors
(embeddings)
using
language
model
ELMo
taken
natural
processing.
By
modeling
sequences,
effectively
captured
biophysical
properties
of
life
unlabeled
big
data
(UniRef50).
We
refer
embeddings
SeqVec
(Sequence-to-Vector)
demonstrate
their
effectiveness
training
simple
neural
networks
two
different
tasks.
At
per-residue
level,
secondary
(Q3
=
79%
±
1,
Q8
68%
1)
regions
with
intrinsic
disorder
(MCC
0.59
0.03)
were
predicted
significantly
better
than
through
one-hot
encoding
or
Word2vec-like
approaches.
per-protein
subcellular
localization
was
in
ten
classes
(Q10
membrane-bound
distinguished
water-soluble
(Q2
87%
1).
Although
generated
best
predictions
single
no
solution
improved
over
existing
method
Nevertheless,
our
approach
popular
methods
even
did
beat
best.
Thus,
they
prove
condense
underlying
principles
sequences.
Overall,
novelty
speed:
where
lightning-fast
HHblits
needed
on
average
about
minutes
generate
target
protein,
created
0.03
s.
As
this
speed-up
independent
size
growing
databases,
provides
highly
scalable
analysis
proteomics,
i.e.
microbiome
metaproteome
analysis.Transfer-learning
succeeded
extract
databases
relevant
various
prediction
modeled
life,
namely
any
features
suggested
textbooks
methods.
The
exception
information,
however,
that
not
available
level
sequence.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2019,
Volume and Issue:
unknown
Published: June 20, 2019
Abstract
Protein
modeling
is
an
increasingly
popular
area
of
machine
learning
research.
Semi-supervised
has
emerged
as
important
paradigm
in
protein
due
to
the
high
cost
acquiring
supervised
labels,
but
current
literature
fragmented
when
it
comes
datasets
and
standardized
evaluation
techniques.
To
facilitate
progress
this
field,
we
introduce
Tasks
Assessing
Embeddings
(TAPE),
a
set
five
biologically
relevant
semi-supervised
tasks
spread
across
different
domains
biology.
We
curate
into
specific
training,
validation,
test
splits
ensure
that
each
task
tests
generalization
transfers
real-life
scenarios.
bench-mark
range
approaches
representation
learning,
which
span
recent
work
well
canonical
sequence
find
self-supervised
pretraining
helpful
for
almost
all
models
on
tasks,
more
than
doubling
performance
some
cases.
Despite
increase,
several
cases
features
learned
by
still
lag
behind
extracted
state-of-the-art
non-neural
This
gap
suggests
huge
opportunity
innovative
architecture
design
improved
paradigms
better
capture
signal
biological
sequences.
TAPE
will
help
community
focus
effort
scientifically
problems.
Toward
end,
data
code
used
run
these
experiments
are
available
at
https://github.com/songlab-cal/tape
.
Cell Systems,
Journal Year:
2021,
Volume and Issue:
12(6), P. 654 - 669.e3
Published: June 1, 2021
Language
models
have
recently
emerged
as
a
powerful
machine-learning
approach
for
distilling
information
from
massive
protein
sequence
databases.
From
readily
available
data
alone,
these
discover
evolutionary,
structural,
and
functional
organization
across
space.
Using
language
models,
we
can
encode
amino-acid
sequences
into
distributed
vector
representations
that
capture
their
structural
properties,
well
evaluate
the
evolutionary
fitness
of
variants.
We
discuss
recent
advances
in
modeling
applications
to
downstream
property
prediction
problems.
then
consider
how
be
enriched
with
prior
biological
knowledge
introduce
an
encoding
learned
representations.
The
distilled
by
allows
us
improve
function
through
transfer
learning.
Deep
are
revolutionizing
biology.
They
suggest
new
ways
therapeutic
design.
However,
further
developments
needed
strong
priors
increase
accessibility
broader
community.
Computational and Structural Biotechnology Journal,
Journal Year:
2021,
Volume and Issue:
19, P. 1750 - 1758
Published: Jan. 1, 2021
Natural
language
processing
(NLP)
is
a
field
of
computer
science
concerned
with
automated
text
and
analysis.
In
recent
years,
following
series
breakthroughs
in
deep
machine
learning,
NLP
methods
have
shown
overwhelming
progress.
Here,
we
review
the
success,
promise
pitfalls
applying
algorithms
to
study
proteins.
Proteins,
which
can
be
represented
as
strings
amino-acid
letters,
are
natural
fit
many
methods.
We
explore
conceptual
similarities
differences
between
proteins
language,
range
protein-related
tasks
amenable
learning.
present
for
encoding
information
analyzing
it
methods,
reviewing
classic
concepts
such
bag-of-words,
k-mers/n-grams
search,
well
modern
techniques
word
embedding,
contextualized
learning
neural
models.
particular,
focus
on
innovations
masked
modeling,
self-supervised
attention-based
Finally,
discuss
trends
challenges
intersection
protein
research.
Nature Communications,
Journal Year:
2021,
Volume and Issue:
12(1)
Published: April 23, 2021
Abstract
The
ability
to
design
functional
sequences
and
predict
effects
of
variation
is
central
protein
engineering
biotherapeutics.
State-of-art
computational
methods
rely
on
models
that
leverage
evolutionary
information
but
are
inadequate
for
important
applications
where
multiple
sequence
alignments
not
robust.
Such
include
the
prediction
variant
indels,
disordered
proteins,
proteins
such
as
antibodies
due
highly
variable
complementarity
determining
regions.
We
introduce
a
deep
generative
model
adapted
from
natural
language
processing
diverse
without
need
alignments.
performs
state-of-art
missense
indel
we
successfully
test
10
5
-nanobody
library
shows
better
expression
than
1000-fold
larger
synthetic
library.
Our
results
demonstrate
power
alignment-free
autoregressive
in
generalizing
regions
space
traditionally
considered
beyond
reach
design.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2022,
Volume and Issue:
unknown
Published: July 22, 2022
Abstract
Recent
breakthroughs
have
used
deep
learning
to
exploit
evolutionary
information
in
multiple
sequence
alignments
(MSAs)
accurately
predict
protein
structures.
However,
MSAs
of
homologous
proteins
are
not
always
available,
such
as
with
orphan
or
fast-evolving
like
antibodies,
and
a
typically
folds
natural
setting
from
its
primary
amino
acid
into
three-dimensional
structure,
suggesting
that
should
be
necessary
protein’s
folded
form.
Here,
we
introduce
OmegaFold,
the
first
computational
method
successfully
high-resolution
structure
single
alone.
Using
new
combination
language
model
allows
us
make
predictions
sequences
geometry-inspired
transformer
trained
on
structures,
OmegaFold
outperforms
RoseTTAFold
achieves
similar
prediction
accuracy
AlphaFold2
recently
released
enables
accurate
do
belong
any
functionally
characterized
family
antibodies
tend
noisy
due
fast
evolution.
Our
study
fills
much-encountered
gap
brings
step
closer
understanding
folding
nature.