bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 24, 2024
Protein
language
models
such
as
the
transformer-based
Evolutionary
Scale
Modeling
2
(ESM2)
can
offer
deep
insights
into
evolutionary
and
structural
properties
of
proteins.
While
larger
models,
ESM2
15B,
promise
to
capture
more
complex
patterns
in
sequence
space,
they
also
present
practical
challenges
due
their
high
dimensionality
computational
cost.
We
systematically
evaluated
performance
all
across
many
biological
datasets
determine
impact
model
size
on
transfer
learning.
Surprisingly,
do
not
always
outperform
smaller
ones,
especially
when
data
is
limited.
Medium
sized
650M,
exhibited
consistent
performance,
falling
only
slightly
behind
15B
parameter
despite
being
over
20
times
smaller.
Additionally,
we
compared
various
methods
embedding
compression
identify
most
effective
approach,
found
that
mean
embeddings
consistently
outperformed
other
methods.
Our
results
show
650M
with
offers
an
optimal
balance
between
efficiency,
making
it
a
scalable
choice
for
learning
variety
applications.
Significance
Statement
This
work
common
belief
yield
better
results,
here
context
protein
biochemistry.
By
comparing
transformer
different
sizes
tasks,
demonstrate
medium
frequently
perform
well
variants,
specially
These
findings
provide
efficient
strategy
machine
learning-based
analysis
promote
broader
accessibility
AI
biology.
Smaller,
help
democratize
advanced
machine-learning
tools,
them
accessible
researchers
limited
resources.
ACS Central Science,
Journal Year:
2024,
Volume and Issue:
10(2), P. 226 - 241
Published: Feb. 5, 2024
Enzymes
can
be
engineered
at
the
level
of
their
amino
acid
sequences
to
optimize
key
properties
such
as
expression,
stability,
substrate
range,
and
catalytic
efficiency-or
even
unlock
new
activities
not
found
in
nature.
Because
search
space
possible
proteins
is
vast,
enzyme
engineering
usually
involves
discovering
an
starting
point
that
has
some
desired
activity
followed
by
directed
evolution
improve
its
"fitness"
for
a
application.
Recently,
machine
learning
(ML)
emerged
powerful
tool
complement
this
empirical
process.
ML
models
contribute
(1)
discovery
functional
annotation
known
protein
or
generating
novel
with
functions
(2)
navigating
fitness
landscapes
optimization
mappings
between
associated
values.
In
Outlook,
we
explain
how
complements
discuss
future
potential
improved
outcomes.
PLoS ONE,
Journal Year:
2025,
Volume and Issue:
20(3), P. e0319208 - e0319208
Published: March 26, 2025
Intrinsically
disordered
proteins
(IDPs)
and
their
intrinsically
regions
(IDRs)
lack
stable
three-dimensional
structures,
posing
significant
challenges
for
computational
prediction.
This
study
introduces
PUNCH2
PUNCH2-light
,
advanced
predictors
designed
to
address
these
through
curated
datasets,
innovative
feature
extraction,
optimized
neural
architectures.
By
integrating
experimental
datasets
from
PDB
(
PDB_missing
)
fully
sequences
DisProt
DisProt_FD
),
we
enhanced
model
performance
robustness.
Three
embedding
strategies—One-Hot,
MSA-based,
PLM-based
embeddings—were
evaluated,
with
ProtTrans
emerging
as
the
most
effective
single
combined
embeddings
achieving
best
results.
The
employ
a
12-layer
convolutional
network
(CNN_L12_narrow),
offering
balance
between
accuracy
efficiency.
combines
One-Hot,
ProtTrans,
MSA-Transformer
embeddings,
while
provides
faster
alternative
excluding
MSA-based
embeddings.
its
streamlined
variant,
are
competitive
other
on
CAID2
benchmark
rank
top
two
in
CAID3
competition.
These
tools
provide
efficient,
accurate
solutions
advance
IDP
research
understanding.
Current Opinion in Structural Biology,
Journal Year:
2025,
Volume and Issue:
91, P. 102997 - 102997
Published: Feb. 7, 2025
Protein
language
models
(pLMs)
capture
some
aspects
of
the
grammar
life
as
written
in
protein
sequences.
The
so-called
pLM
embeddings
implicitly
contain
this
information.
Therefore,
can
serve
exclusive
input
into
downstream
supervised
methods
for
prediction.
Over
last
33
years,
evolutionary
information
extracted
through
simple
averaging
specific
families
from
multiple
sequence
alignments
(MSAs)
has
been
most
successful
universal
key
to
success
For
many
applications,
MSA-free
pLM-based
predictions
now
have
become
significantly
more
accurate.
reason
is
often
a
combination
two
aspects.
Firstly,
condense
so
efficiently
that
prediction
succeed
with
small
models,
i.e.,
they
need
few
free
parameters
particular
era
exploding
deep
neural
networks.
Secondly,
provide
protein-specific
solutions.
As
additional
benefit,
once
pre-training
complete,
solutions
tend
consume
much
fewer
resources
than
MSA-based
In
fact,
we
appeal
community
rather
optimize
foundation
retrain
new
ones
and
evolve
incentives
require
even
at
loss
accuracy.
Although
pLMs
not,
yet,
succeeded
entirely
replace
body
developed
over
three
decades,
clearly
are
rapidly
advancing
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Jan. 31, 2024
Abstract
Protein
language
models
(PLMs)
have
emerged
as
powerful
approaches
for
mapping
protein
sequences
into
embeddings
suitable
various
applications.
As
representation
schemes,
PLMs
generate
per-token
(i.e.,
per-residue)
representations,
resulting
in
variable-sized
outputs
based
on
length.
This
variability
poses
a
challenge
protein-level
prediction
tasks
that
require
uniform-sized
consistent
analysis
across
different
proteins.
Previous
work
has
typically
used
average
pooling
to
summarize
token-level
PLM
outputs,
but
it
is
unclear
whether
this
method
effectively
prioritizes
the
relevant
information
representations.
We
introduce
novel
utilizing
optimal
transport
convert
variable-length
fixed-length
conceptualize
samples
from
probabilistic
distribution
and
employ
sliced-Wasserstein
distances
map
these
against
reference
set,
creating
Euclidean
embedding
output
space.
The
agnostic
length
of
input
represents
entire
protein.
demonstrate
superiority
our
over
several
downstream
tasks,
particularly
with
constrained
sizes,
enabling
smaller-scale
match
or
exceed
performance
average-pooled
larger-scale
PLMs.
Our
aggregation
scheme
especially
effective
longer
by
capturing
essential
might
be
lost
through
pooling.
Journal of Chemical Information and Modeling,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 24, 2025
Natural
Language
Processing
(NLP)
has
revolutionized
the
way
computers
are
used
to
study
and
interact
with
human
languages
is
increasingly
influential
in
of
protein
ligand
binding,
which
critical
for
drug
discovery
development.
This
review
examines
how
NLP
techniques
have
been
adapted
decode
"language"
proteins
small
molecule
ligands
predict
protein–ligand
interactions
(PLIs).
We
discuss
methods
such
as
long
short-term
memory
(LSTM)
networks,
transformers,
attention
mechanisms
can
leverage
different
data
types
identify
potential
interaction
patterns.
Significant
challenges
highlighted
including
scarcity
high-quality
negative
data,
difficulties
interpreting
model
decisions,
sampling
biases
existing
sets.
argue
that
focusing
on
improving
quality,
enhancing
robustness,
fostering
both
collaboration
competition
could
catalyze
future
advances
machine-learning-based
predictions
PLIs.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 29, 2025
Abstract
Linking
sequence
variation
to
phenotypic
effects
is
critical
for
efficient
exploitation
of
large
genomic
datasets.
Here
we
present
a
novel
approach
combining
directed
evolution
with
protein
language
modeling
characterize
naturally-evolved
variants
rice
immune
receptor.
Using
high-throughput
evolution,
engineered
the
receptor
Pik-1
bind
and
recognize
fungal
proteins
Avr-PikC
Avr-PikF,
which
evade
detection
by
currently
characterized
alleles.
A
model
was
fine-tuned
on
this
data
correlate
ligand
binding
behavior.
This
then
used
found
in
3,000
Rice
Genomes
Project
dataset.
Two
scored
highly
against
Avr-PikC,
vitro
analyses
confirmed
their
improved
over
wild-type
Overall,
machine
learning
identified
promising
sources
disease
resistance
shows
potential
utility
exploring
other
interest.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 7, 2025
Abstract
Predicting
functional
properties
of
mutations
like
the
change
in
enzyme
activity
remains
challenging
and
is
not
well
captured
by
traditional
pathogenicity
prediction.
Yet
such
predictions
are
crucial
areas
targeted
cancer
therapy
where
some
drugs
may
only
be
administered
if
a
mutation
causes
an
increase
activity.
Current
approaches
either
leverage
static
Protein-Language
Model
(PLM)
embeddings
or
complex
multi-modal
features
(e.g.,
PLM
embeddings,
structure,
evolutionary
data)
(1)
fall
short
accuracy
(2)
involve
data
processing
pre-training.
Standardized
datasets
metrics
for
robust
benchmarking
would
benefit
model
development
but
do
yet
exist
effect
To
address
these
challenges
we
develop
ESM-Effect,
optimized
PLM-based
prediction
framework
through
extensive
ablation
studies.
ESM-Effect
fine-tunes
ESM2
with
inductive
bias
regression
head
to
achieve
state-of-the-art
performance.
It
surpasses
method
PreMode,
indicating
redundancy
structural
features,
while
training
6.7-times
faster.
In
addition,
test
strategies,
propose
novel
metric
termed
relative
Bin-Mean
Error
(rBME):
rBME
emphasizes
challenging,
non-clustered,
rare
gain-of-function
regions
correlates
more
intuitively
performance
than
commonly
used
Spearman’s
rho.
Finally,
demonstrate
partial
generalization
unseen
mutational
within
same
protein,
illustrating
its
potential
precision
medicine
applications.
Extending
this
across
different
proteins
promising
direction
future
research.
available
at:
https://github.com/moritzgls/ESM-Effect
.
JACS Au,
Journal Year:
2025,
Volume and Issue:
5(2), P. 955 - 964
Published: Feb. 10, 2025
A
quenchbody
(Q-body)
is
a
fluorophore-labeled
homogeneous
immunosensor
in
which
the
fluorophore
quenched
by
tryptophan
(Trp)
residues
vicinity
of
antigen-binding
paratope
and
dequenched
response
to
antigen
binding.
Developing
Q-bodies
against
targets
on
demand
remains
challenging
due
large
sequence
space
complementarity-determining
regions
(CDRs)
related
binding
quenching.
In
this
study,
we
pioneered
strategy
using
high-throughput
screening
protein
language
model
(pLM)
predict
effects
mutations
quenching
with
single
amino
acid
resolution,
thereby
enhancing
performance
Q-bodies.
We
collected
yeasts
displaying
nanobodies
high-
low-quenching
properties
for
TAMRA
from
modified
synthetic
nanobody
library
followed
next-generation
sequencing.
The
pretrained
pLM,
connected
single-layer
perceptron,
was
trained
end-to-end
enriched
CDR
sequences.
achieved
prediction
that
focused
CDR1
+
3
performed
best
evaluation
precision-recall
curves.
Using
model,
predicted
validated
effective
two
anti-SARS-CoV-2
nanobodies,
RBD1i13
RBD10i14,
converted
them
into
For
RBD1i13,
three
Trp
mutants
were
have
high
probability
scores
through
silico
scanning.
These
verified
via
yeast
surface
display,
all
showed
enhanced
at
four
positions
close
an
existing
gave
saturation
mutagenesis
Six
eight
high-score
mutants,
derived
each
positions,
exhibited
deeper
surface.
Next,
combined
investigation
successfully
responses.
Overall,
our
allows
fluorescence
responses
solely
basis
antibody
will
be
essential
rational
selection
design
antibodies
achieve
immunosensors
larger