Briefings in Bioinformatics,
Год журнала:
2024,
Номер
26(1)
Опубликована: Ноя. 22, 2024
Abstract
Deep
machine
learning
demonstrates
a
capacity
to
uncover
evolutionary
relationships
directly
from
protein
sequences,
in
effect
internalising
notions
inherent
classical
phylogenetic
tree
inference.
We
connect
these
two
paradigms
by
assessing
the
of
protein-based
language
models
(pLMs)
discern
without
being
explicitly
trained
do
so.
evaluate
ESM2,
ProtTrans,
and
MSA-Transformer
relative
methods,
while
also
considering
sequence
insertions
deletions
(indels)
across
114
Pfam
datasets.
The
largest
ESM2
model
tends
outperform
other
pLMs
(including
multimodal
ESM3)
recovering
among
homologous
sequences
both
low-
high-gap
settings.
agree
with
conventional
methods
general,
but
more
so
for
families
fewer
implied
indels,
highlighting
indels
as
key
factor
differentiating
phylogenetics
pLMs.
find
that
preferentially
capture
broader
opposed
finer
within
specific
family,
where
has
sweet
spot
highly
divergent
at
remote
distance.
Less
than
10%
neurons
are
sufficient
broadly
recapitulate
distances;
when
used
isolation,
difference
between
is
further
diminished.
show
polysemantic,
shared
different
never
fully
overlapping.
highlight
potential
complementary
tool
analysis,
especially
extending
homologs
difficult
align
imply
complex
histories
deletions.
Implementations
analyses
available
https://github.com/santule/pLMEvo.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Апрель 16, 2025
Abstract
Generative
protein
language
models
(PLMs)
are
powerful
tools
for
designing
proteins
purpose-built
to
solve
problems
in
medicine,
agriculture,
and
industrial
processes.
Recent
work
has
trained
ever
larger
models,
but
there
been
little
systematic
study
of
the
optimal
training
distributions
influence
model
scale
on
sequences
generated
by
PLMs.
We
introduce
ProGen3
family
sparse
generative
PLMs,
we
develop
compute-optimal
scaling
laws
up
a
46B-parameter
pre-trained
1.5T
amino
acid
tokens.
ProGen3’s
pre-training
data
is
sampled
from
an
optimized
distribution
over
Profluent
Protein
Atlas
v1,
carefully
curated
dataset
3.4B
full-length
proteins.
evaluate
first
time
wet
lab
find
that
generate
viable
much
wider
diversity
families.
Finally,
both
computationally
experimentally
more
responsive
alignment
with
laboratory
data,
resulting
improved
fitness
prediction
sequence
generation
capabilities.
These
results
indicate
PLMs
like
ProGen3-46B
larger,
well-curated
datasets
foundation
push
frontier
design.
1
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Апрель 23, 2025
ABSTRACT
Protein
language
models
(pLMs)
have
revolutionized
computational
biology
by
generating
rich
protein
vector
representations—or
embeddings—enabling
major
advancements
in
de
novo
design,
structure
prediction,
variant
effect
analysis,
and
evolutionary
studies.
Despite
these
breakthroughs,
current
pLMs
often
exhibit
biases
against
proteins
from
underrepresented
species,
with
viral
being
particularly
affected—frequently
referred
to
as
the
“dark
matter”
of
biological
world
due
their
vast
diversity
ubiquity,
yet
sparse
representation
training
datasets.
Here,
we
show
that
fine-tuning
pre-trained
on
sequences,
using
diverse
learning
frameworks
parameter-efficient
strategies,
significantly
enhances
quality
improves
performance
downstream
tasks.
To
support
further
research,
provide
source
code
for
benchmarking
embedding
quality.
By
enabling
more
accurate
modeling
proteins,
our
approach
advances
tools
understanding
biology,
combating
emerging
infectious
diseases,
driving
biotechnological
innovation.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Апрель 28, 2024
Exhaustive
experimental
annotation
of
the
effect
all
known
protein
variants
remains
daunting
and
expensive,
stressing
need
for
scalable
predictions.
We
introduce
VespaG,
a
blazingly
fast
missense
amino
acid
variant
predictor,
leveraging
Language
Model
(pLM)
embeddings
as
input
to
minimal
deep
learning
model.
To
overcome
sparsity
training
data,
we
created
dataset
39
million
single
from
human
proteome
applying
multiple
sequence
alignment-based
predictor
GEMME
pseudo
standard-of-truth.
This
setup
increases
interpretability
compared
baseline
pLM
is
easily
retrainable
with
novel
or
updated
pLMs.
Assessed
against
ProteinGym
benchmark
(217
multiplex
assays
-
MAVE
2.5
variants),
VespaG
achieved
mean
Spearman
correlation
0.48
+/-
0.02,
matching
top-performing
methods
evaluated
on
same
data.
has
advantage
being
orders
magnitude
faster,
predicting
mutational
landscapes
proteins
in
proteomes
such
Homo
sapiens
Drosophila
melanogaster
under
30
minutes
consumer
laptop
(12-core
CPU,
16
GB
RAM).
BMC Bioinformatics,
Год журнала:
2024,
Номер
25(1)
Опубликована: Апрель 25, 2024
Abstract
Background
The
annotation
of
protein
sequences
in
public
databases
has
long
posed
a
challenge
molecular
biology.
This
issue
is
particularly
acute
for
viral
proteins,
which
demonstrate
limited
homology
to
known
proteins
when
using
alignment,
k-mer,
or
profile-based
search
approaches.
A
novel
methodology
employing
Large
Language
Models
(LLMs)
addresses
this
methodological
by
annotating
based
on
embeddings.
Results
Central
our
contribution
the
soft
alignment
algorithm,
drawing
from
traditional
but
leveraging
embedding
similarity
at
amino
acid
level
bypass
need
conventional
scoring
matrices.
method
not
only
surpasses
pooled
embedding-based
models
efficiency
also
interpretability,
enabling
users
easily
trace
homologous
acids
and
delve
deeper
into
alignments.
Far
being
black
box,
approach
provides
transparent,
BLAST-like
visualizations,
combining
biological
research
with
AI
advancements
elevate
through
analysis
while
ensuring
interpretability.
Tests
Virus
Orthologous
Groups
ViralZone
indicated
that
recognized
annotated
both
blastp
pooling-based
methods,
are
commonly
used
sequence
annotation,
failed
detect.
Conclusion
embeddings
shows
great
potential
LLMs
enhancing
especially
genomics.
These
findings
present
promising
avenue
more
efficient
accurate
function
inference
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 24, 2024
Abstract
Pretrained
protein
language
models
are
becoming
increasingly
popular
as
a
backbone
for
property
inference
tasks
such
structure
prediction
or
function
annotation,
accelerating
biological
research.
However,
related
research
oftentimes
does
not
consider
the
effects
of
data
leakage
from
pretraining
on
actual
downstream
task,
resulting
in
potentially
unrealistic
performance
estimates.
Reported
generalization
might
necessarily
be
reproducible
proteins
highly
dissimilar
set.
In
this
work,
we
measure
model
domain
thermostability
prediction.
Specifically,
compare
two
different
dataset
split
strategies:
pretraining-aware
split,
designed
to
avoid
similarity
between
and
held-out
test
sets,
commonly-used
naive
relying
clustering
training
task
without
taking
into
account.
Our
experiments
suggest
that
shows
consistent
melting
point
across
all
experiments,
distorting
measured
performance.
The
source
code
our
splits
available
at
https://github.com/tfiedlerdev/pretraining-aware-hotprot
.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 13, 2024
Protein
language
models
trained
on
the
masked
modeling
objective
learn
to
predict
identity
of
hidden
amino
acid
residues
within
a
sequence
using
remaining
observable
as
context.
They
do
so
by
embedding
into
high
dimensional
space
that
encapsulates
relevant
contextual
cues.
These
vectors
serve
an
informative
context-sensitive
representation
not
only
aids
with
defined
training
objective,
but
can
also
be
used
for
other
tasks
downstream
models.
We
propose
scheme
use
embeddings
unmasked
estimate
corresponding
probability
all
positions
in
single
forward
pass
through
model.
This
One
Fell
Swoop
(OFS)
approach
allows
us
efficiently
pseudo-perplexity
sequence,
measure
model's
uncertainty
its
predictions,
fitness
estimate.
find
ESM2
OFS
performs
nearly
well
true
at
estimation,
and
more
notably
it
defines
new
state
art
ProteinGym
Indels
benchmark.
The
strong
performance
prompted
investigate
if
could
detect
elevated
stability
reported
reconstructed
ancestral
sequences.
this
ranks
reconstructions
fit
than
extant
Finally,
we
show
computational
efficiency
technique
Monte
Carlo
methods
rapidly
explore
functional
space.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Окт. 15, 2024
Abstract
Proteins
perform
their
functions
by
folding
amino
acid
sequences
into
dynamic
structural
ensembles.
Despite
the
important
role
of
protein
dynamics,
complexity
and
absence
efficient
representation
methods
have
limited
integration
studies
on
function
mutation
fitness,
especially
in
deep
learning
applications.
To
address
this,
we
present
SeqDance,
a
language
model
designed
to
learn
properties
directly
from
sequence
alone.
SeqDance
is
pre-trained
biophysical
derived
over
30,400
molecular
dynamics
trajectories
28,600
normal
mode
analyses.
Our
results
show
that
effectively
captures
local
interactions,
co-movement
patterns,
global
conformational
features,
even
for
proteins
lacking
homologs
pre-training
set.
Additionally,
showed
enhances
prediction
fitness
landscapes,
disorder-to-order
transition
binding
regions,
phase-separating
proteins.
By
sequence,
complements
conventional
evolution-
static
structure-based
methods,
offering
new
insights
behavior
function.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июнь 28, 2024
1
Abstract
Molecular
structure
prediction
and
homology
detection
provide
a
promising
path
to
discovering
new
protein
function
evolutionary
relationships.
However,
current
approaches
lack
statistical
reliability
assurances,
limiting
their
practical
utility
for
selecting
proteins
further
experimental
in-silico
characterization.
To
address
this
challenge,
we
introduce
novel
approach
search
leveraging
principles
from
conformal
prediction,
offering
framework
that
ensures
guarantees
with
user-specified
risk
provides
calibrated
probabilities
(rather
than
raw
ML
scores)
any
model.
Our
method
(1)
lets
users
select
many
biologically-relevant
loss
metrics
(i.e.
false
discovery
rate)
assigns
reliable
functional
annotating
genes
of
unknown
function;
(2)
achieves
state-of-the-art
performance
in
enzyme
classification
without
training
models;
(3)
robustly
rapidly
pre-filters
computationally
intensive
structural
alignment
algorithms.
enhances
the
enables
likely
desirable
properties.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 26, 2024
Abstract
Protein
language
models
(pLMs)
have
traditionally
been
trained
in
an
unsupervised
manner
using
large
protein
sequence
databases
with
autoregressive
or
masked-language
modeling
training
paradigm.
Recent
methods
attempted
to
enhance
pLMs
by
integrating
additional
information,
the
form
of
text,
which
are
referred
as
“text+protein”
(tpLMs).
We
evaluate
and
compare
six
tpLMs
(OntoProtein,
ProteinDT,
ProtST,
ProteinCLIP,
ProTrek,
ESM3)
against
ESM2,
a
baseline
text-free
pLM,
across
downstream
tasks
designed
assess
learned
representations.
find
that
while
outperform
ESM2
five
out
benchmarks,
no
tpLM
was
consistently
best.
Thus,
we
additionally
investigate
potential
embedding
fusion,
exploring
whether
combinations
embeddings
can
improve
performance
on
benchmarks
exploiting
strengths
multiple
tpLMs.
single
highlighting
its
useful
strategy
field
machine-learning
for
proteins.
To
facilitate
practical
application
outline
heuristic
framework
efficiently
identify
optimal
combination
embeddings,
reducing
exponential
time
complexity
exhaustive
search
down
manageable
linear
complexity.
Using
our
fusion
framework,
achieve
state-of-the-art
performances
protein-protein
interaction
prediction
homologous
recovery
without
any
specific
model
adjustments
hyperparameter
tuning.
Our
experiments
suggest
is
tool
proteins
toolbox.
Lastly,
this
study
highlights
future
research
strategies
maximizing
utility
pLMs.