bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Окт. 11, 2023
Whereas
protein
language
models
have
demonstrated
remarkable
efficacy
in
predicting
the
effects
of
missense
variants,
DNA
counterparts
not
yet
achieved
a
similar
competitive
edge
for
genome-wide
variant
effect
predictions,
especially
complex
genomes
such
as
that
humans.
To
address
this
challenge,
we
here
introduce
GPN-MSA,
novel
framework
leverages
whole-genome
sequence
alignments
across
multiple
species
and
takes
only
few
hours
to
train.
Across
several
benchmarks
on
clinical
databases
(ClinVar,
COSMIC,
OMIM),
experimental
functional
assays
(DMS,
DepMap),
population
genomic
data
(gnomAD),
our
model
human
genome
achieves
outstanding
performance
deleteriousness
prediction
both
coding
non-coding
variants.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 3, 2024
Generative
models
for
protein
design
trained
on
experimentally
determined
structures
have
proven
useful
a
variety
of
tasks.
However,
such
methods
are
limited
by
the
quantity
and
diversity
used
training,
which
represent
small,
biased
fraction
space.
Here,
we
describe
proseLM,
method
sequence
based
adaptation
language
to
incorporate
structural
functional
context.
We
show
that
proseLM
benefits
from
scaling
trends
underlying
models,
addition
non-protein
context
–
nucleic
acids,
ligands,
ions
improves
recovery
native
residues
during
4-5%
across
model
scales.
These
improvements
most
pronounced
directly
interface
with
context,
faithfully
recovered
at
rates
>70%
capable
models.
validated
optimizing
editing
efficiency
genome
editors
in
human
cells,
achieving
50%
increase
base
activity,
redesigning
therapeutic
antibodies,
resulting
PD-1
binder
2.2
nM
affinity.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Апрель 8, 2024
Abstract
Engineering
enzyme
biocatalysts
for
higher
efficiency
is
key
to
enabling
sustainable,
‘green’
production
processes
the
chemical
and
pharmaceutical
industry.
This
challenge
can
be
tackled
from
two
angles:
by
directed
evolution,
based
on
labor-intensive
experimental
testing
of
variant
libraries,
or
computational
methods,
where
sequence-function
data
are
used
predict
biocatalyst
improvements.
Here,
we
combine
both
approaches
into
a
two-week
workflow,
ultra-high
throughput
screening
library
imine
reductases
(IREDs)
in
microfluidic
devices
provides
not
only
selected
‘hits’,
but
also
long-read
sequence
linked
fitness
scores
>17
thousand
variants.
We
demonstrate
engineering
an
IRED
chiral
amine
synthesis
mapping
functional
information
one
go,
ready
interpretation
extrapolation
protein
engineers
with
help
machine
learning
(ML).
calculate
position-dependent
mutability
combinability
mutations
comprehensively
illuminate
complex
interplay
driven
synergistic,
often
positively
epistatic
effects.
Interpreted
easy-to-use
regression
tree-based
ML
algorithms
designed
suit
evaluation
random
whole-gene
mutagenesis
data,
3-fold
improved
‘hits’
obtained
extrapolated
further
give
up
23-fold
improvements
catalytic
rate
after
handful
mutants.
Our
campaign
paradigmatic
future
that
will
rely
access
large
maps
as
profiles
way
responds
mutation.
These
chart
function
exploiting
synergy
rapid
combined
extrapolation.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 18, 2024
Directed
evolution
of
proteins
is
critical
for
applications
in
basic
biological
research,
therapeutics,
diagnostics,
and
sustainability.
However,
directed
methods
are
labor
intensive,
cannot
efficiently
optimize
over
multiple
protein
properties,
often
trapped
by
local
maxima.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Окт. 6, 2023
Abstract
Amino
acid
insertions
and
deletions
(indels)
are
an
abundant
class
of
genetic
variants.
However,
compared
to
substitutions,
the
effects
indels
on
protein
stability
not
well
understood
poorly
predicted.
To
better
understand
here
we
analyze
new
existing
large-scale
deep
indel
mutagenesis
(DIM)
structurally
diverse
proteins.
The
vary
extensively
among
within
proteins
predicted
by
computational
methods.
address
this
shortcoming
present
INDELi,
a
series
models
that
combine
experimental
or
substitution
secondary
structure
information
provide
good
prediction
both
pathogenicity.
Moreover,
quantifying
protein-protein
interactions
suggests
can
be
important
gain-of-function
Our
results
overview
impact
method
predict
their
genome-wide.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Окт. 11, 2023
Whereas
protein
language
models
have
demonstrated
remarkable
efficacy
in
predicting
the
effects
of
missense
variants,
DNA
counterparts
not
yet
achieved
a
similar
competitive
edge
for
genome-wide
variant
effect
predictions,
especially
complex
genomes
such
as
that
humans.
To
address
this
challenge,
we
here
introduce
GPN-MSA,
novel
framework
leverages
whole-genome
sequence
alignments
across
multiple
species
and
takes
only
few
hours
to
train.
Across
several
benchmarks
on
clinical
databases
(ClinVar,
COSMIC,
OMIM),
experimental
functional
assays
(DMS,
DepMap),
population
genomic
data
(gnomAD),
our
model
human
genome
achieves
outstanding
performance
deleteriousness
prediction
both
coding
non-coding
variants.