bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Окт. 11, 2023
Whereas
protein
language
models
have
demonstrated
remarkable
efficacy
in
predicting
the
effects
of
missense
variants,
DNA
counterparts
not
yet
achieved
a
similar
competitive
edge
for
genome-wide
variant
effect
predictions,
especially
complex
genomes
such
as
that
humans.
To
address
this
challenge,
we
here
introduce
GPN-MSA,
novel
framework
leverages
whole-genome
sequence
alignments
across
multiple
species
and
takes
only
few
hours
to
train.
Across
several
benchmarks
on
clinical
databases
(ClinVar,
COSMIC,
OMIM),
experimental
functional
assays
(DMS,
DepMap),
population
genomic
data
(gnomAD),
our
model
human
genome
achieves
outstanding
performance
deleteriousness
prediction
both
coding
non-coding
variants.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 2, 2024
Abstract
More
than
three
billion
years
of
evolution
have
produced
an
image
biology
encoded
into
the
space
natural
proteins.
Here
we
show
that
language
models
trained
on
tokens
generated
by
can
act
as
evolutionary
simulators
to
generate
functional
proteins
are
far
away
from
known
We
present
ESM3,
a
frontier
multimodal
generative
model
reasons
over
sequence,
structure,
and
function
ESM3
follow
complex
prompts
combining
its
modalities
is
highly
responsive
biological
alignment.
prompted
fluorescent
with
chain
thought.
Among
generations
synthesized,
found
bright
protein
at
distance
(58%
identity)
Similarly
distant
separated
five
hundred
million
evolution.
The
genome
is
a
sequence
that
encodes
the
DNA,
RNA,
and
proteins
orchestrate
an
organism’s
function.
We
present
Evo,
long-context
genomic
foundation
model
with
frontier
architecture
trained
on
millions
of
prokaryotic
phage
genomes,
report
scaling
laws
DNA
to
complement
observations
in
language
vision.
Evo
generalizes
across
proteins,
enabling
zero-shot
function
prediction
competitive
domain-specific
models
generation
functional
CRISPR-Cas
transposon
systems,
representing
first
examples
protein-RNA
protein-DNA
codesign
model.
also
learns
how
small
mutations
affect
whole-organism
fitness
generates
megabase-scale
sequences
plausible
architecture.
These
capabilities
span
molecular
scales
complexity,
advancing
our
understanding
control
biology.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Фев. 27, 2024
The
genome
is
a
sequence
that
completely
encodes
the
DNA,
RNA,
and
proteins
orchestrate
function
of
whole
organism.
Advances
in
machine
learning
combined
with
massive
datasets
genomes
could
enable
biological
foundation
model
accelerates
mechanistic
understanding
generative
design
complex
molecular
interactions.
We
report
Evo,
genomic
enables
prediction
generation
tasks
from
to
scale.
Using
an
architecture
based
on
advances
deep
signal
processing,
we
scale
Evo
7
billion
parameters
context
length
131
kilobases
(kb)
at
single-nucleotide,
byte
resolution.
Trained
prokaryotic
genomes,
can
generalize
across
three
fundamental
modalities
central
dogma
biology
perform
zero-shot
competitive
with,
or
outperforms,
leading
domain-specific
language
models.
also
excels
multi-element
tasks,
which
demonstrate
by
generating
synthetic
CRISPR-Cas
complexes
entire
transposable
systems
for
first
time.
information
learned
over
predict
gene
essentiality
nucleotide
resolution
generate
coding-rich
sequences
up
650
kb
length,
orders
magnitude
longer
than
previous
methods.
multi-modal
multi-scale
provides
promising
path
toward
improving
our
control
multiple
levels
complexity.
More
than
three
billion
years
of
evolution
have
produced
an
image
biology
encoded
into
the
space
natural
proteins.
Here
we
show
that
language
models
trained
at
scale
on
evolutionary
data
can
generate
functional
proteins
are
far
away
from
known
We
present
ESM3,
a
frontier
multimodal
generative
model
reasons
over
sequence,
structure,
and
function
ESM3
follow
complex
prompts
combining
its
modalities
is
highly
responsive
to
alignment
improve
fidelity.
prompted
fluorescent
Among
generations
synthesized,
found
bright
protein
distance
(58%
sequence
identity)
proteins,
which
estimate
equivalent
simulating
five
hundred
million
evolution.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 12, 2024
Abstract
Protein
language
models
(pLMs)
trained
on
large
protein
sequence
databases
have
been
used
to
understand
disease
and
design
novel
proteins.
In
tasks,
the
likelihood
of
a
under
pLM
is
often
as
proxy
for
fitness,
so
it
critical
what
signals
likelihoods
capture.
this
work
we
find
that
unintentionally
encode
species
bias:
sequences
from
certain
are
systematically
higher,
independent
in
question.
We
quantify
bias
show
arises
part
because
unequal
representation
popular
databases.
further
can
be
detrimental
some
applications,
such
enhancing
thermostability.
These
results
highlight
importance
understanding
curating
training
data
mitigate
biases
improve
capabilities
under-explored
parts
space.
Directed
protein
evolution
is
central
to
biomedical
applications
but
faces
challenges
like
experimental
complexity,
inefficient
multi-property
optimization,
and
local
maxima
traps.
While
in
silico
methods
using
language
models
(PLMs)
can
provide
modeled
fitness
landscape
guidance,
they
struggle
generalize
across
diverse
families
map
activity.
We
present
EVOLVEpro,
a
few-shot
active
learning
framework
that
combines
PLMs
regression
rapidly
improve
EVOLVEpro
surpasses
current
methods,
yielding
up
100-fold
improvements
desired
properties.
demonstrate
its
effectiveness
six
proteins
RNA
production,
genome
editing,
antibody
binding
applications.
These
results
highlight
the
advantages
of
with
minimal
data
over
zero-shot
predictions.
opens
new
possibilities
for
AI-guided
engineering
biology
medicine.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Фев. 29, 2024
A
bstract
Proteins
serve
as
the
workhorses
of
living
organisms,
orchestrating
a
wide
array
vital
functions.
Post-translational
modifications
(PTMs)
their
amino
acids
greatly
influence
structural
and
functional
diversity
different
protein
types
uphold
proteostasis,
allowing
cells
to
swiftly
respond
environmental
changes
intricately
regulate
complex
biological
processes.
To
this
point,
efforts
model
features
proteins
have
involved
training
large
expressive
language
models
(pLMs)
such
ESM-2
ProtT5,
which
accurately
encode
structural,
functional,
physicochemical
properties
input
sequences.
However,
over
200
million
sequences
that
these
pLMs
were
trained
on
merely
scratch
surface
proteomic
diversity,
they
neither
nor
account
for
effects
PTMs.
In
work,
we
fill
major
gap
in
sequence
modeling
by
introducing
PTM
tokens
into
pLM
regime.
We
then
leverage
recent
advancements
structured
state
space
(SSMs),
specifically
Mamba,
utilizes
efficient
hardware-aware
primitives
overcome
quadratic
time
complexities
Transformers.
After
adding
comprehensive
set
vocabulary,
train
bidirectional
Mamba
blocks
whose
outputs
are
fused
with
state-of-the-art
embeddings
via
novel
gating
mechanism.
demonstrate
our
resultant
PTM-aware
pLM,
PTM-Mamba
,
improves
upon
ESM-2’s
performance
various
PTM-specific
tasks.
is
first
only
can
uniquely
represent
both
wild-type
sequences,
motivating
downstream
design
applications
specific
post-translationally
modified
proteins.
facilitate
applications,
made
available
at:
https://huggingface.co/ChatterjeeLab/PTM-Mamba
.
Nature Communications,
Год журнала:
2024,
Номер
15(1)
Опубликована: Июль 29, 2024
Abstract
The
effective
design
of
combinatorial
libraries
to
balance
fitness
and
diversity
facilitates
the
engineering
useful
enzyme
functions,
particularly
those
that
are
poorly
characterized
or
unknown
in
biology.
We
introduce
MODIFY,
a
machine
learning
(ML)
algorithm
learns
from
natural
protein
sequences
infer
evolutionarily
plausible
mutations
predict
fitness.
MODIFY
co-optimizes
predicted
sequence
starting
libraries,
prioritizing
high-fitness
variants
while
ensuring
broad
coverage.
In
silico
evaluation
shows
outperforms
state-of-the-art
unsupervised
methods
zero-shot
prediction
enables
ML-guided
directed
evolution
with
enhanced
efficiency.
Using
we
engineer
generalist
biocatalysts
derived
thermostable
cytochrome
c
achieve
enantioselective
C-B
C-Si
bond
formation
via
new-to-nature
carbene
transfer
mechanism,
leading
six
away
previously
developed
enzymes
exhibiting
superior
comparable
activities.
These
results
demonstrate
MODIFY’s
potential
solving
challenging
problems
beyond
reach
classic
evolution.