bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: April 20, 2023
A
protein’s
genetic
architecture
–
the
set
of
causal
rules
by
which
its
sequence
produces
functions
also
determines
possible
evolutionary
trajectories.
Prior
research
has
proposed
that
proteins
is
very
complex,
with
pervasive
epistatic
interactions
constrain
evolution
and
make
function
difficult
to
predict
from
sequence.
Most
this
work
analyzed
only
direct
paths
between
two
interest
excluding
vast
majority
genotypes
trajectories
considered
a
single
protein
function,
leaving
unaddressed
functional
specificity
impact
on
new
functions.
Here
we
develop
method
based
ordinal
logistic
regression
directly
characterize
global
determinants
multiple
20-state
combinatorial
deep
mutational
scanning
(DMS)
experiments.
We
use
it
dissect
transcription
factor’s
for
DNA,
using
data
DMS
an
ancient
steroid
hormone
receptor’s
capacity
activate
biologically
relevant
DNA
elements.
show
recognition
consists
dense
main
pairwise
effects
involve
virtually
every
amino
acid
state
in
protein-DNA
interface,
but
higher-order
epistasis
plays
tiny
role.
Pairwise
enlarge
sequences
are
primary
different
They
massively
expand
number
opportunities
single-residue
mutations
switch
one
target
another.
By
bringing
variants
close
together
space,
therefore
facilitates
rather
than
constrains
ACS Central Science,
Journal Year:
2024,
Volume and Issue:
10(2), P. 226 - 241
Published: Feb. 5, 2024
Enzymes
can
be
engineered
at
the
level
of
their
amino
acid
sequences
to
optimize
key
properties
such
as
expression,
stability,
substrate
range,
and
catalytic
efficiency-or
even
unlock
new
activities
not
found
in
nature.
Because
search
space
possible
proteins
is
vast,
enzyme
engineering
usually
involves
discovering
an
starting
point
that
has
some
desired
activity
followed
by
directed
evolution
improve
its
"fitness"
for
a
application.
Recently,
machine
learning
(ML)
emerged
powerful
tool
complement
this
empirical
process.
ML
models
contribute
(1)
discovery
functional
annotation
known
protein
or
generating
novel
with
functions
(2)
navigating
fitness
landscapes
optimization
mappings
between
associated
values.
In
Outlook,
we
explain
how
complements
discuss
future
potential
improved
outcomes.
A
protein’s
genetic
architecture
–
the
set
of
causal
rules
by
which
its
sequence
produces
functions
also
determines
possible
evolutionary
trajectories.
Prior
research
has
proposed
that
proteins
is
very
complex,
with
pervasive
epistatic
interactions
constrain
evolution
and
make
function
difficult
to
predict
from
sequence.
Most
this
work
analyzed
only
direct
paths
between
two
interest
excluding
vast
majority
genotypes
trajectories
considered
a
single
protein
function,
leaving
unaddressed
functional
specificity
impact
on
new
functions.
Here,
we
develop
method
based
ordinal
logistic
regression
directly
characterize
global
determinants
multiple
20-state
combinatorial
deep
mutational
scanning
(DMS)
experiments.
We
use
it
dissect
transcription
factor’s
for
DNA,
using
data
DMS
an
ancient
steroid
hormone
receptor’s
capacity
activate
biologically
relevant
DNA
elements.
show
recognition
consists
dense
main
pairwise
effects
involve
virtually
every
amino
acid
state
in
protein-DNA
interface,
but
higher-order
epistasis
plays
tiny
role.
Pairwise
enlarge
sequences
are
primary
different
They
massively
expand
number
opportunities
single-residue
mutations
switch
one
target
another.
By
bringing
variants
close
together
space,
therefore
facilitates
rather
than
constrains
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 29, 2024
Abstract
Missense
variants
that
change
the
amino
acid
sequences
of
proteins
cause
one
third
human
genetic
diseases
1
.
Tens
millions
missense
exist
in
current
population,
with
vast
majority
having
unknown
functional
consequences.
Here
we
present
first
large-scale
experimental
analysis
across
many
different
proteins.
Using
DNA
synthesis
and
cellular
selection
experiments
quantify
impact
>500,000
on
abundance
>500
protein
domains.
This
dataset,
Human
Domainome
1,
reveals
>60%
pathogenic
reduce
stability.
The
contribution
stability
to
fitness
varies
diseases,
is
particularly
important
recessive
disorders.
Combining
measurements
language
models
annotates
sites
Mutational
effects
are
largely
conserved
homologous
domains,
allowing
accurate
prediction
entire
families
using
energy
models.
demonstrates
feasibility
assaying
at
scale
provides
a
large
consistent
reference
dataset
for
clinical
variant
interpretation
training
benchmarking
computational
methods.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Jan. 24, 2024
Abstract
The
massively
parallel
nature
of
deep
mutational
scanning
(DMS)
allows
the
quantification
phenotypic
effects
thousands
perturbations
in
a
single
experiment.
We
have
developed
MoCHI,
software
tool
that
parameterisation
arbitrarily
complex
models
using
DMS
data.
MoCHI
simplifies
task
building
custom
from
measurements
mutant
on
any
number
phenotypes.
It
inference
free
energy
changes,
as
well
pairwise
and
higher-order
interaction
terms
(energetic
couplings)
for
specified
biophysical
models.
When
suitable
user-specified
mechanistic
model
is
not
available,
global
nonlinearities
(epistasis)
can
be
estimated
directly
also
builds
upon
leverages
theory
ensemble
(or
background-averaged)
epistasis
to
learn
sparse
predictive
incorporate
epistatic
are
informative
genetic
architecture
underlying
biological
system.
combination
performed
at
scale,
including
construction
complete
allosteric
maps
proteins.
freely
available
(
https://github.com/lehner-lab/MoCHI
)
implemented
an
easy-to-use
python
package
relying
PyTorch
machine
learning
framework.
A
protein's
genetic
architecture
-
the
set
of
causal
rules
by
which
its
sequence
produces
functions
also
determines
possible
evolutionary
trajectories.
Prior
research
has
proposed
that
proteins
is
very
complex,
with
pervasive
epistatic
interactions
constrain
evolution
and
make
function
difficult
to
predict
from
sequence.
Most
this
work
analyzed
only
direct
paths
between
two
interest
excluding
vast
majority
genotypes
trajectories
considered
a
single
protein
function,
leaving
unaddressed
functional
specificity
impact
on
new
functions.
Here,
we
develop
method
based
ordinal
logistic
regression
directly
characterize
global
determinants
multiple
20-state
combinatorial
deep
mutational
scanning
(DMS)
experiments.
We
use
it
dissect
transcription
factor's
for
DNA,
using
data
DMS
an
ancient
steroid
hormone
receptor's
capacity
activate
biologically
relevant
DNA
elements.
show
recognition
consists
dense
main
pairwise
effects
involve
virtually
every
amino
acid
state
in
protein-DNA
interface,
but
higher-order
epistasis
plays
tiny
role.
Pairwise
enlarge
sequences
are
primary
different
They
massively
expand
number
opportunities
single-residue
mutations
switch
one
target
another.
By
bringing
variants
close
together
space,
therefore
facilitates
rather
than
constrains
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 17, 2024
Abstract
Cells
have
evolved
mechanisms
to
distribute
∼10
billion
protein
molecules
subcellular
compartments
where
diverse
proteins
involved
in
shared
functions
must
efficiently
assemble.
Here,
we
demonstrate
that
with
share
amino
acid
sequence
codes
guide
them
compartment
destinations.
A
language
model,
ProtGPS,
was
developed
predicts
high
performance
the
localization
of
human
excluded
from
training
set.
ProtGPS
successfully
guided
generation
novel
sequences
selectively
assemble
targeted
compartments.
also
identified
pathological
mutations
change
this
code
and
lead
altered
proteins.
Our
results
indicate
contain
not
only
a
folding
code,
but
previously
unrecognized
governing
their
distribution
specific
cellular
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: May 12, 2024
Abstract
Protein
folding
is
driven
by
the
burial
of
hydrophobic
amino
acids
in
a
tightly-packed
core
that
excludes
water.
The
genetics,
biophysics
and
evolution
cores
are
not
well
understood,
part
because
lack
systematic
experimental
data
on
sequence
combinations
do
-
constitute
stable
functional
cores.
Here
we
randomize
protein
evaluate
their
stability
function
at
scale.
show
vast
numbers
acid
can
but
these
alternative
frequently
disrupt
allosteric
effects.
These
strong
effects
due
to
complicated,
highly
epistatic
fitness
landscapes
rather,
pervasive
nature
allostery,
with
many
individually
small
energy
changes
combining
function.
Indeed
both
ligand
binding
be
accurately
predicted
over
very
large
evolutionary
distances
using
additive
models
contribution
from
pairwise
energetic
couplings.
As
result,
trained
one
predict
across
hundreds
millions
years
evolution,
only
rare
couplings
experimentally
identify
limiting
transplantation
between
diverged
proteins.
Our
results
reveal
simple
architecture
suggest
allostery
major
constraint
evolution.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Jan. 30, 2024
ABSTRACT
Antibody
therapeutic
candidates
must
exhibit
not
only
tight
binding
to
their
target
but
also
good
developability
properties,
especially
low
risk
of
immunogenicity.
In
this
work,
we
fit
a
simple
generative
model,
SAM,
sixty
million
human
heavy
and
seventy
light
chains.
We
show
that
the
probability
sequence
calculated
by
model
distinguishes
sequences
from
other
species
with
same
or
better
accuracy
on
variety
benchmark
datasets
containing
>400
than
any
in
literature,
outperforming
large
language
models
(LLMs)
margins.
SAM
can
humanize
sequences,
generate
new
score
for
humanness.
It
is
both
fast
fully
interpretable.
Our
results
highlight
importance
using
as
baselines
protein
engineering
tasks.
additionally
introduce
tool
numbering
antibody
which
orders
magnitude
faster
existing
tools
literature.
Both
these
are
available
at
https://github.com/Wang-lab-UCSD/AntPack
.
A
protein’s
genetic
architecture
–
the
set
of
causal
rules
by
which
its
sequence
produces
functions
also
determines
possible
evolutionary
trajectories.
Prior
research
has
proposed
that
proteins
is
very
complex,
with
pervasive
epistatic
interactions
constrain
evolution
and
make
function
difficult
to
predict
from
sequence.
Most
this
work
analyzed
only
direct
paths
between
two
interest
excluding
vast
majority
genotypes
trajectories
considered
a
single
protein
function,
leaving
unaddressed
functional
specificity
impact
on
new
functions.
Here
we
develop
method
based
ordinal
logistic
regression
directly
characterize
global
determinants
multiple
20-state
combinatorial
deep
mutational
scanning
(DMS)
experiments.
We
use
it
dissect
transcription
factor’s
for
DNA,
using
data
DMS
an
ancient
steroid
hormone
receptor’s
capacity
activate
biologically
relevant
DNA
elements.
show
recognition
consists
dense
main
pairwise
effects
involve
virtually
every
amino
acid
state
in
protein-DNA
interface,
but
higher-order
epistasis
plays
tiny
role.
Pairwise
enlarge
sequences
are
primary
different
They
massively
expand
number
opportunities
single-residue
mutations
switch
one
target
another.
By
bringing
variants
close
together
space,
therefore
facilitates
rather
than
constrains