BMC Bioinformatics,
Journal Year:
2024,
Volume and Issue:
25(1)
Published: March 16, 2024
Protein
language
models,
inspired
by
the
success
of
large
models
in
deciphering
human
language,
have
emerged
as
powerful
tools
for
unraveling
intricate
code
life
inscribed
within
protein
sequences.
They
gained
significant
attention
their
promising
applications
across
various
areas,
including
sequence-based
prediction
secondary
and
tertiary
structure,
discovery
new
functional
sequences/folds,
assessment
mutational
impact
on
fitness.
However,
utility
learning
to
predict
residue
properties
based
scant
datasets,
such
protein-protein
interaction
(PPI)-hotspots
whose
mutations
significantly
impair
PPIs,
remained
unclear.
Here,
we
explore
feasibility
using
language-learned
representations
features
machine
PPI-hotspots
a
dataset
containing
414
experimentally
confirmed
504
PPI-nonhot
spots.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 2, 2024
Abstract
More
than
three
billion
years
of
evolution
have
produced
an
image
biology
encoded
into
the
space
natural
proteins.
Here
we
show
that
language
models
trained
on
tokens
generated
by
can
act
as
evolutionary
simulators
to
generate
functional
proteins
are
far
away
from
known
We
present
ESM3,
a
frontier
multimodal
generative
model
reasons
over
sequence,
structure,
and
function
ESM3
follow
complex
prompts
combining
its
modalities
is
highly
responsive
biological
alignment.
prompted
fluorescent
with
chain
thought.
Among
generations
synthesized,
found
bright
protein
at
distance
(58%
identity)
Similarly
distant
separated
five
hundred
million
evolution.
Science,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 16, 2025
More
than
three
billion
years
of
evolution
have
produced
an
image
biology
encoded
into
the
space
natural
proteins.
Here
we
show
that
language
models
trained
at
scale
on
evolutionary
data
can
generate
functional
proteins
are
far
away
from
known
We
present
ESM3,
a
frontier
multimodal
generative
model
reasons
over
sequence,
structure,
and
function
ESM3
follow
complex
prompts
combining
its
modalities
is
highly
responsive
to
alignment
improve
fidelity.
prompted
fluorescent
Among
generations
synthesized,
found
bright
protein
distance
(58%
sequence
identity)
proteins,
which
estimate
equivalent
simulating
five
hundred
million
evolution.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Feb. 8, 2024
Abstract
Large
pretrained
protein
language
models
(PLMs)
have
improved
property
and
structure
prediction
from
sequences
via
transfer
learning,
in
which
weights
representations
PLMs
are
repurposed
for
downstream
tasks.
Although
shown
great
promise,
currently
there
is
little
understanding
of
how
the
features
learned
by
pretraining
relate
to
useful
We
perform
a
systematic
analysis
learning
using
PLMs,
conducting
370
experiments
across
comprehensive
suite
factors
including
different
tasks,
architectures,
model
sizes,
depths,
time.
observe
that
while
almost
all
down-stream
tasks
do
benefit
compared
naive
sequence
representations,
majority
performance
does
not
scale
with
pretraining,
instead
relies
on
low-level
early
pretraining.
Our
results
point
mismatch
between
current
PLM
paradigms
most
applications
these
models,
indicating
need
better
methods.
Nature Communications,
Journal Year:
2024,
Volume and Issue:
15(1)
Published: July 2, 2024
Accurately
modeling
the
protein
fitness
landscapes
holds
great
importance
for
engineering.
Pre-trained
language
models
have
achieved
state-of-the-art
performance
in
predicting
without
wet-lab
experimental
data,
but
their
accuracy
and
interpretability
remain
limited.
On
other
hand,
traditional
supervised
deep
learning
require
abundant
labeled
training
examples
improvements,
posing
a
practical
barrier.
In
this
work,
we
introduce
FSFP,
strategy
that
can
effectively
optimize
under
extreme
data
scarcity
prediction.
By
combining
meta-transfer
learning,
to
rank,
parameter-efficient
fine-tuning,
FSFP
significantly
boost
of
various
using
merely
tens
single-site
mutants
from
target
protein.
silico
benchmarks
across
87
mutational
scanning
datasets
demonstrate
FSFP's
superiority
over
both
unsupervised
baselines.
Furthermore,
successfully
apply
engineer
Phi29
DNA
polymerase
through
experiments,
achieving
25%
increase
positive
rate.
These
results
underscore
potential
our
approach
aiding
AI-guided
Current Opinion in Structural Biology,
Journal Year:
2024,
Volume and Issue:
86, P. 102794 - 102794
Published: April 24, 2024
Engineering
new
molecules
with
desirable
functions
and
properties
has
the
potential
to
extend
our
ability
engineer
proteins
beyond
what
nature
so
far
evolved.
Advances
in
so-called
'de
novo'
design
problem
have
recently
been
brought
forward
by
developments
artificial
intelligence.
Generative
architectures,
such
as
language
models
diffusion
processes,
seem
adept
at
generating
novel,
yet
realistic
that
display
perform
specified
functions.
State-of-the-art
protocols
now
achieve
experimental
success
rates
nearing
20%,
thus
widening
access
de
novo
designed
proteins.
Despite
extensive
progress,
there
are
clear
field-wide
challenges,
for
example,
determining
best
silico
metrics
prioritise
designs
testing,
designing
can
undergo
large
conformational
changes
or
be
regulated
post-translational
modifications.
With
an
increase
number
of
being
developed,
this
review
provides
a
framework
understand
how
these
tools
fit
into
overall
process
protein
design.
Throughout,
we
highlight
power
incorporating
biochemical
knowledge
improve
performance
interpretability.
Cell Research,
Journal Year:
2024,
Volume and Issue:
34(9), P. 630 - 647
Published: July 5, 2024
Abstract
Mutations
in
amino
acid
sequences
can
provoke
changes
protein
function.
Accurate
and
unsupervised
prediction
of
mutation
effects
is
critical
biotechnology
biomedicine,
but
remains
a
fundamental
challenge.
To
resolve
this
challenge,
here
we
present
Pro
tein
M
utational
E
ffect
P
redictor
(ProMEP),
general
multiple
sequence
alignment-free
method
that
enables
zero-shot
effects.
A
multimodal
deep
representation
learning
model
embedded
ProMEP
was
developed
to
comprehensively
learn
both
structure
contexts
from
~160
million
proteins.
achieves
state-of-the-art
performance
mutational
effect
accomplishes
tremendous
improvement
speed,
enabling
efficient
intelligent
engineering.
Specifically,
accurately
forecasts
consequences
on
the
gene-editing
enzymes
TnpB
TadA,
successfully
guides
development
high-performance
tools
with
their
engineered
variants.
The
efficiency
5-site
mutant
reaches
up
74.04%
(vs
24.66%
for
wild
type);
base
editing
tool
basis
TadA
15-site
(in
addition
A106V/D108N
double
renders
deoxyadenosine
deaminase
activity
TadA)
exhibits
an
A-to-G
conversion
frequency
77.27%
69.80%
ABE8e,
previous
TadA-based
adenine
editor)
significantly
reduced
bystander
off-target
compared
ABE8e.
not
only
showcases
superior
predicting
proteins
also
demonstrates
great
capability
guide
Therefore,
exploration
gigantic
space
facilitates
practical
design
proteins,
thereby
advancing
studies
biomedicine
synthetic
biology.
Nature Communications,
Journal Year:
2024,
Volume and Issue:
15(1)
Published: Aug. 26, 2024
Annotating
active
sites
in
enzymes
is
crucial
for
advancing
multiple
fields
including
drug
discovery,
disease
research,
enzyme
engineering,
and
synthetic
biology.
Despite
the
development
of
numerous
automated
annotation
algorithms,
a
significant
trade-off
between
speed
accuracy
limits
their
large-scale
practical
applications.
We
introduce
EasIFA,
an
site
algorithm
that
fuses
latent
representations
from
Protein
Language
Model
3D
structural
encoder,
then
aligns
protein-level
information
with
knowledge
enzymatic
reactions
using
multi-modal
cross-attention
framework.
EasIFA
outperforms
BLASTp
10-fold
increase
improved
recall,
precision,
f1
score,
MCC
by
7.57%,
13.08%,
9.68%,
0.1012,
respectively.
It
also
surpasses
empirical-rule-based
other
state-of-the-art
deep
learning
method
based
on
PSSM
features,
achieving
ranging
650
to
1400
times
while
enhancing
quality.
This
makes
suitable
replacement
conventional
tools
both
industrial
academic
settings.
can
effectively
transfer
gained
coarsely
annotated
databases
smaller,
high-precision
datasets,
highlighting
its
ability
model
sparse
high-quality
databases.
Additionally,
shows
potential
as
catalytic
monitoring
tool
designing
desired
functions
beyond
natural
distribution.
Wang
et
al.
propose
efficient
algorithm,
advance
various