bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 15, 2024
Abstract
Accurate
prediction
of
immune
protein
structures
is
crucial
for
understanding
the
system
and
advancing
immunotherapy
development.
While
deep
learning
methods
have
significantly
advanced
structure
by
extracting
evolutionary
constraints
from
homologous
sequences
a
target
protein,
they
struggle
with
proteins
due
to
limited
number
known
lack
in
hypervariable
regions.
To
address
this
challenge,
we
propose
ImmuneFold,
transfer
approach
that
fine-tunes
ESMFold
specifically
proteins.
We
leverage
low-rank
adaption
(LoRA),
parameter-efficient
fine-tuning
technique
requires
considerably
less
memory
substantially
fewer
parameters.
Evaluations
on
various
proteins,
including
T-cell
receptors,
antibodies,
nanobodies,
demonstrate
ImmuneFold
outperforms
existing
accuracy.
Furthermore,
apply
develop
zero-shot
protocol
TCR-epitope
binding
prediction.
Unlike
previous
supervised
suffering
severe
overfitting
experimental
data,
our
first
predicts
using
then
directly
estimates
affinity
calculating
Rosseta
energy.
datasets
suggest
method
robust
accurate
predicting
binding.
In
summary,
demonstrates
predictions
binding,
highlighting
its
potential
advance
development
immunotherapies.
Journal of Chemical Information and Modeling,
Год журнала:
2025,
Номер
unknown
Опубликована: Март 17, 2025
Accurately
predicting
protein–ligand
interactions
and
enzymatic
kinetics
remains
a
challenge
for
computational
biology.
Here,
we
present
SELFprot,
suite
of
modular
transformer-based
machine
learning
architectures
that
leverage
the
ESM2–35M
model
architecture
protein
sequence
small
molecule
embeddings
to
improve
predictions
complex
biochemical
interactions.
SELFprot
employs
multitask
parameter-efficient
finetuning
through
low-rank
adaptation,
allowing
adaptive,
data-driven
refinement.
Furthermore,
ensemble
techniques
are
used
enhance
robustness
reduce
prediction
variance.
Evaluated
on
BindingDB
CatPred-DB
data
sets,
achieves
competitive
performance
with
notable
improvements
in
kcat,
Km,
Ki,
Kd,
IC50,
EC50
values
as
well
classification
functional
site
residues.
With
comparable
accuracy
existing
models
an
order
magnitude
fewer
parameters,
demonstrates
versatility
efficiency,
making
it
valuable
tool
interaction
studies
bioengineering.
Coiled
coils
are
a
common
protein
structural
motif
involved
in
cellular
functions
ranging
from
mediating
protein-protein
interactions
to
facilitating
processes
such
as
signal
transduction
or
regulation
of
gene
expression.
They
formed
by
two
more
alpha
helices
that
wind
around
central
axis
form
buried
hydrophobic
core.
Various
forms
coiled-coil
bundles
have
been
reported,
each
characterized
the
number,
orientation,
and
degree
winding
constituent
helices.
This
variability
is
underpinned
short
sequence
repeats
coiled
whose
properties
determine
both
their
overall
topology
local
geometry
The
strikingly
repetitive
has
enabled
development
accurate
sequence-based
prediction
methods;
however,
modeling
domains
remains
challenging
task.
In
this
work,
we
evaluated
accuracy
AlphaFold2
domains,
predicting
global
topological
properties.
Furthermore,
show
oligomeric
state
can
be
achieved
using
internal
representations
AlphaFold2,
with
performance
better
than
any
previous
state-of-the-art
method
(code
available
at
https://github.com/labstructbioinf/dc2_oligo).
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 26, 2024
Abstract
Protein
language
models
(pLMs)
have
traditionally
been
trained
in
an
unsupervised
manner
using
large
protein
sequence
databases
with
autoregressive
or
masked-language
modeling
training
paradigm.
Recent
methods
attempted
to
enhance
pLMs
by
integrating
additional
information,
the
form
of
text,
which
are
referred
as
“text+protein”
(tpLMs).
We
evaluate
and
compare
six
tpLMs
(OntoProtein,
ProteinDT,
ProtST,
ProteinCLIP,
ProTrek,
ESM3)
against
ESM2,
a
baseline
text-free
pLM,
across
downstream
tasks
designed
assess
learned
representations.
find
that
while
outperform
ESM2
five
out
benchmarks,
no
tpLM
was
consistently
best.
Thus,
we
additionally
investigate
potential
embedding
fusion,
exploring
whether
combinations
embeddings
can
improve
performance
on
benchmarks
exploiting
strengths
multiple
tpLMs.
single
highlighting
its
useful
strategy
field
machine-learning
for
proteins.
To
facilitate
practical
application
outline
heuristic
framework
efficiently
identify
optimal
combination
embeddings,
reducing
exponential
time
complexity
exhaustive
search
down
manageable
linear
complexity.
Using
our
fusion
framework,
achieve
state-of-the-art
performances
protein-protein
interaction
prediction
homologous
recovery
without
any
specific
model
adjustments
hyperparameter
tuning.
Our
experiments
suggest
is
tool
proteins
toolbox.
Lastly,
this
study
highlights
future
research
strategies
maximizing
utility
pLMs.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Окт. 25, 2024
Abstract
Protein
language
models
have
shown
great
promise
in
predicting
protein
structure,
function,
and
the
effects
of
missense
variants
on
fitness.
However,
their
use
has
been
limited
by
substantial
computational
resources
required.
In
this
work,
we
focus
improving
efficiency
(PLMs),
specifically
Evolutionary
Scale
Modeling
(ESM)
family,
to
increase
accessibility
PLMs.
By
implementing
optimizations
such
as
FlashAttention
Partition-Attention,
a
novel
technique
designed
handle
proteins
variable
length,
achieved
16-fold
speedup
inference
time
reduced
memory
usage
3
14
times
for
long
proteins.
Additionally,
4-bit
quantization
applied
billion-parameter
led
2
reduction
consumption
with
minimal
performance
loss
variant
effect
prediction
task.
Training
was
also
improved,
6-fold
runtime
through
activation
checkpointing
DeepSpeed
Zero-Offload
strategy.
For
fine-tuning,
employed
parameter-efficient
methods,
enabling
state-of-the-art
predictions
properties
functions
training
only
model
head
or
small
fraction
adapter
weights.
instance,
Spearman’s
correlation
coefficient
70%
melting
point
an
87%
area
under
precision-recall
curve
(AU-PRC)
transcription
factor
prediction.
Our
efficient
ESM
(ESME)
implementation
significantly
lowers
barrier
using
these
powerful
models,
making
them
accessible
academic
laboratories
resources.
The
ESME
is
available
PyPI
(
pypi.org/project/esm-efficient
)
GitHub
github.com/uci-cbcl/esm-efficient
).
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Окт. 29, 2024
Abstract
Understanding
T-Cell
receptor
(TCR)
and
epitope
interactions
is
critical
for
advancing
our
knowledge
of
the
human
immune
system.
Traditional
approaches
that
use
sequence
similarity
or
structure
data
often
struggle
to
scale
generalize
across
diverse
TCR/epitope
interactions.
To
address
these
limitations,
we
introduce
ImmuneCLIP,
a
contrastive
fine-tuning
method
leverages
pre-trained
protein
language
models
align
TCR
embeddings
in
shared
latent
space.
ImmuneCLIP
evaluated
on
ranking
binding
prediction
tasks,
where
it
consistently
outperforms
sequence-similarity
based
methods
existing
deep
learning
models.
Furthermore,
shows
strong
generalization
capabilities
even
with
limited
training
data,
highlighting
its
potential
studying
uncovering
patterns
improve
understanding
recognition
systems.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 15, 2024
Abstract
Accurate
prediction
of
immune
protein
structures
is
crucial
for
understanding
the
system
and
advancing
immunotherapy
development.
While
deep
learning
methods
have
significantly
advanced
structure
by
extracting
evolutionary
constraints
from
homologous
sequences
a
target
protein,
they
struggle
with
proteins
due
to
limited
number
known
lack
in
hypervariable
regions.
To
address
this
challenge,
we
propose
ImmuneFold,
transfer
approach
that
fine-tunes
ESMFold
specifically
proteins.
We
leverage
low-rank
adaption
(LoRA),
parameter-efficient
fine-tuning
technique
requires
considerably
less
memory
substantially
fewer
parameters.
Evaluations
on
various
proteins,
including
T-cell
receptors,
antibodies,
nanobodies,
demonstrate
ImmuneFold
outperforms
existing
accuracy.
Furthermore,
apply
develop
zero-shot
protocol
TCR-epitope
binding
prediction.
Unlike
previous
supervised
suffering
severe
overfitting
experimental
data,
our
first
predicts
using
then
directly
estimates
affinity
calculating
Rosseta
energy.
datasets
suggest
method
robust
accurate
predicting
binding.
In
summary,
demonstrates
predictions
binding,
highlighting
its
potential
advance
development
immunotherapies.