bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: March 4, 2023
Abstract
In
recent
years,
generative
protein
sequence
models
have
been
developed
to
sample
novel
sequences.
However,
predicting
whether
generated
proteins
will
fold
and
function
remains
challenging.
We
evaluate
computational
metrics
assess
the
quality
of
enzyme
sequences
produced
by
three
contrasting
models:
ancestral
reconstruction,
a
adversarial
network,
language
model.
Focusing
on
two
families,
we
expressed
purified
over
440
natural
with
70-90%
identity
most
similar
benchmark
for
in
vitro
activity.
Over
rounds
experiments,
filter
that
improved
experimental
success
rates
44-100%.
Surprisingly,
neither
nor
AlphaFold2
residue-confidence
scores
were
predictive
The
proposed
drive
engineering
research
serving
as
helping
select
active
variants
test
experimentally.
Cell,
Journal Year:
2024,
Volume and Issue:
187(3), P. 526 - 544
Published: Feb. 1, 2024
Methods
from
artificial
intelligence
(AI)
trained
on
large
datasets
of
sequences
and
structures
can
now
"write"
proteins
with
new
shapes
molecular
functions
de
novo,
without
starting
found
in
nature.
In
this
Perspective,
I
will
discuss
the
state
field
novo
protein
design
at
juncture
physics-based
modeling
approaches
AI.
New
folds
higher-order
assemblies
be
designed
considerable
experimental
success
rates,
difficult
problems
requiring
tunable
control
over
conformations
precise
shape
complementarity
for
recognition
are
coming
into
reach.
Emerging
incorporate
engineering
principles-tunability,
controllability,
modularity-into
process
beginning.
Exciting
frontiers
lie
deconstructing
cellular
and,
conversely,
constructing
synthetic
signaling
ground
up.
As
methods
improve,
many
more
challenges
unsolved.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 2, 2024
Abstract
More
than
three
billion
years
of
evolution
have
produced
an
image
biology
encoded
into
the
space
natural
proteins.
Here
we
show
that
language
models
trained
on
tokens
generated
by
can
act
as
evolutionary
simulators
to
generate
functional
proteins
are
far
away
from
known
We
present
ESM3,
a
frontier
multimodal
generative
model
reasons
over
sequence,
structure,
and
function
ESM3
follow
complex
prompts
combining
its
modalities
is
highly
responsive
biological
alignment.
prompted
fluorescent
with
chain
thought.
Among
generations
synthesized,
found
bright
protein
at
distance
(58%
identity)
Similarly
distant
separated
five
hundred
million
evolution.
Abstract
Large‐scale
artificial
intelligence
(AI)
models
such
as
ChatGPT
have
the
potential
to
improve
performance
on
many
benchmarks
and
real‐world
tasks.
However,
it
is
difficult
develop
maintain
these
because
of
their
complexity
resource
requirements.
As
a
result,
they
are
still
inaccessible
healthcare
industries
clinicians.
This
situation
might
soon
be
changed
advancements
in
graphics
processing
unit
(GPU)
programming
parallel
computing.
More
importantly,
leveraging
existing
large‐scale
AIs
GPT‐4
Med‐PaLM
integrating
them
into
multiagent
(e.g.,
Visual‐ChatGPT)
will
facilitate
implementations.
review
aims
raise
awareness
applications
healthcare.
We
provide
general
overview
several
advanced
AI
models,
including
language
vision‐language
graph
learning
language‐conditioned
multimodal
embodied
models.
discuss
medical
addition
challenges
future
directions.
Importantly,
we
stress
need
align
with
human
values
goals,
using
reinforcement
from
feedback,
ensure
that
accurate
personalized
insights
support
decision‐making
outcomes.
ACS Central Science,
Journal Year:
2024,
Volume and Issue:
10(2), P. 226 - 241
Published: Feb. 5, 2024
Enzymes
can
be
engineered
at
the
level
of
their
amino
acid
sequences
to
optimize
key
properties
such
as
expression,
stability,
substrate
range,
and
catalytic
efficiency-or
even
unlock
new
activities
not
found
in
nature.
Because
search
space
possible
proteins
is
vast,
enzyme
engineering
usually
involves
discovering
an
starting
point
that
has
some
desired
activity
followed
by
directed
evolution
improve
its
"fitness"
for
a
application.
Recently,
machine
learning
(ML)
emerged
powerful
tool
complement
this
empirical
process.
ML
models
contribute
(1)
discovery
functional
annotation
known
protein
or
generating
novel
with
functions
(2)
navigating
fitness
landscapes
optimization
mappings
between
associated
values.
In
Outlook,
we
explain
how
complements
discuss
future
potential
improved
outcomes.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: July 25, 2023
Abstract
Adapting
large
language
models
(LLMs)
to
protein
sequences
spawned
the
development
of
powerful
(pLMs).
Concurrently,
AlphaFold2
broke
through
in
structure
prediction.
Now
we
can
systematically
and
comprehensively
explore
dual
nature
proteins
that
act
exist
as
three-dimensional
(3D)
machines
evolve
linear
strings
one-dimensional
(1D)
sequences.
Here,
leverage
pLMs
simultaneously
model
both
modalities
by
combining
1D
with
3D
a
single
model.
We
encode
structures
token
using
3Di-alphabet
introduced
3D-alignment
method
Foldseek
.
This
new
foundation
pLM
extracts
features
patterns
resulting
“structure-sequence”
representation.
Toward
this
end,
built
non-redundant
dataset
from
AlphaFoldDB
fine-tuned
an
existing
(ProtT5)
translate
between
3Di
amino
acid
As
proof-of-concept
for
our
novel
approach,
dubbed
Protein
structure-sequence
T5
(
ProstT5
),
showed
improved
performance
subsequent
prediction
tasks,
“inverse
folding”,
namely
generation
adopting
given
structural
scaffold
(“fold”).
Our
work
showcased
potential
tap
into
information-rich
revolution
fueled
AlphaFold2.
paves
way
develop
tools
integrating
vast
resource
predictions,
opens
research
avenues
post-AlphaFold2
era.
is
freely
available
all
at
https://github.com/mheinzinger/ProstT5
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Sept. 12, 2023
Abstract
Deep
generative
models
are
increasingly
powerful
tools
for
the
in
silico
design
of
novel
proteins.
Recently,
a
family
called
diffusion
has
demonstrated
ability
to
generate
biologically
plausible
proteins
that
dissimilar
any
actual
seen
nature,
enabling
unprecedented
capability
and
control
de
novo
protein
design.
However,
current
state-of-the-art
structures,
which
limits
scope
their
training
data
restricts
generations
small
biased
subset
space.
Here,
we
introduce
general-purpose
framework,
EvoDiff,
combines
evolutionary-scale
with
distinct
conditioning
capabilities
controllable
generation
sequence
EvoDiff
generates
high-fidelity,
diverse,
structurally-plausible
cover
natural
functional
We
show
experimentally
express,
fold,
exhibit
expected
secondary
structure
elements.
Critically,
can
inaccessible
structure-based
models,
such
as
those
disordered
regions,
while
maintaining
scaffolds
structural
motifs.
validate
universality
our
sequence-based
formulation
by
characterizing
intrinsically-disordered
mitochondrial
targeting
signals,
metal-binding
proteins,
binders
designed
using
EvoDiff.
envision
will
expand
engineering
beyond
structure-function
paradigm
toward
programmable,
sequence-first
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: July 6, 2023
Protein
language
models
have
shown
remarkable
success
in
learning
biological
information
from
protein
sequences.
However,
most
existing
are
limited
by
either
autoencoding
or
autoregressive
pre-training
objectives,
which
makes
them
struggle
to
handle
understanding
and
generation
tasks
concurrently.
We
propose
a
unified
model,
xTrimoPGLM,
address
these
two
types
of
simultaneously
through
an
innovative
framework.
Our
key
technical
contribution
is
exploration
the
compatibility
potential
for
joint
optimization
has
led
strategy
training
xTrimoPGLM
at
unprecedented
scale
100
billion
parameters
1
trillion
tokens.
extensive
experiments
reveal
that
1)
significantly
outperforms
other
advanced
baselines
18
benchmarks
across
four
categories.
The
model
also
facilitates
atomic-resolution
view
structures,
leading
3D
structural
prediction
surpasses
model-based
tools.
2)
not
only
can
generate
de
novo
sequences
following
principles
natural
ones,
but
perform
programmable
after
supervised
fine-tuning
(SFT)
on
curated
These
results
highlight
substantial
capability
versatility
generating
sequences,
contributing
evolving
landscape
foundation
science.
Science,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 16, 2025
More
than
three
billion
years
of
evolution
have
produced
an
image
biology
encoded
into
the
space
natural
proteins.
Here
we
show
that
language
models
trained
at
scale
on
evolutionary
data
can
generate
functional
proteins
are
far
away
from
known
We
present
ESM3,
a
frontier
multimodal
generative
model
reasons
over
sequence,
structure,
and
function
ESM3
follow
complex
prompts
combining
its
modalities
is
highly
responsive
to
alignment
improve
fidelity.
prompted
fluorescent
Among
generations
synthesized,
found
bright
protein
distance
(58%
sequence
identity)
proteins,
which
estimate
equivalent
simulating
five
hundred
million
evolution.