Briefings in Bioinformatics,
Год журнала:
2024,
Номер
26(1)
Опубликована: Ноя. 22, 2024
Abstract
Deep
machine
learning
demonstrates
a
capacity
to
uncover
evolutionary
relationships
directly
from
protein
sequences,
in
effect
internalising
notions
inherent
classical
phylogenetic
tree
inference.
We
connect
these
two
paradigms
by
assessing
the
of
protein-based
language
models
(pLMs)
discern
without
being
explicitly
trained
do
so.
evaluate
ESM2,
ProtTrans,
and
MSA-Transformer
relative
methods,
while
also
considering
sequence
insertions
deletions
(indels)
across
114
Pfam
datasets.
The
largest
ESM2
model
tends
outperform
other
pLMs
(including
multimodal
ESM3)
recovering
among
homologous
sequences
both
low-
high-gap
settings.
agree
with
conventional
methods
general,
but
more
so
for
families
fewer
implied
indels,
highlighting
indels
as
key
factor
differentiating
phylogenetics
pLMs.
find
that
preferentially
capture
broader
opposed
finer
within
specific
family,
where
has
sweet
spot
highly
divergent
at
remote
distance.
Less
than
10%
neurons
are
sufficient
broadly
recapitulate
distances;
when
used
isolation,
difference
between
is
further
diminished.
show
polysemantic,
shared
different
never
fully
overlapping.
highlight
potential
complementary
tool
analysis,
especially
extending
homologs
difficult
align
imply
complex
histories
deletions.
Implementations
analyses
available
https://github.com/santule/pLMEvo.
Directed
protein
evolution
is
central
to
biomedical
applications
but
faces
challenges
like
experimental
complexity,
inefficient
multi-property
optimization,
and
local
maxima
traps.
While
in
silico
methods
using
language
models
(PLMs)
can
provide
modeled
fitness
landscape
guidance,
they
struggle
generalize
across
diverse
families
map
activity.
We
present
EVOLVEpro,
a
few-shot
active
learning
framework
that
combines
PLMs
regression
rapidly
improve
EVOLVEpro
surpasses
current
methods,
yielding
up
100-fold
improvements
desired
properties.
demonstrate
its
effectiveness
six
proteins
RNA
production,
genome
editing,
antibody
binding
applications.
These
results
highlight
the
advantages
of
with
minimal
data
over
zero-shot
predictions.
opens
new
possibilities
for
AI-guided
engineering
biology
medicine.
Microbial Biotechnology,
Год журнала:
2025,
Номер
18(1)
Опубликована: Янв. 1, 2025
ABSTRACT
Antimicrobial
peptides
(AMPs)
are
promising
candidates
to
combat
multidrug‐resistant
pathogens.
However,
the
high
cost
of
extensive
wet‐lab
screening
has
made
AI
methods
for
identifying
and
designing
AMPs
increasingly
important,
with
machine
learning
(ML)
techniques
playing
a
crucial
role.
approaches
have
recently
revolutionised
this
field
by
accelerating
discovery
new
anti‐infective
activity,
particularly
in
preclinical
mouse
models.
Initially,
classical
ML
dominated
field,
but
there
been
shift
towards
deep
(DL)
Despite
significant
contributions,
existing
reviews
not
thoroughly
explored
potential
large
language
models
(LLMs),
graph
neural
networks
(GNNs)
structure‐guided
AMP
design.
This
review
aims
fill
that
gap
providing
comprehensive
overview
latest
advancements,
challenges
opportunities
using
methods,
particular
emphasis
on
LLMs,
GNNs
We
discuss
limitations
current
highlight
most
relevant
topics
address
coming
years
Cell,
Год журнала:
2024,
Номер
187(25), С. 7045 - 7063
Опубликована: Дек. 1, 2024
Cells
are
essential
to
understanding
health
and
disease,
yet
traditional
models
fall
short
of
modeling
simulating
their
function
behavior.
Advances
in
AI
omics
offer
groundbreaking
opportunities
create
an
virtual
cell
(AIVC),
a
multi-scale,
multi-modal
large-neural-network-based
model
that
can
represent
simulate
the
behavior
molecules,
cells,
tissues
across
diverse
states.
This
Perspective
provides
vision
on
design
how
collaborative
efforts
build
AIVCs
will
transform
biological
research
by
allowing
high-fidelity
simulations,
accelerating
discoveries,
guiding
experimental
studies,
offering
new
for
cellular
functions
fostering
interdisciplinary
collaborations
open
science.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Май 5, 2024
Abstract
The
design
of
functional
enzymes
holds
promise
for
transformative
solutions
across
various
domains
but
presents
significant
challenges.
Inspired
by
the
success
language
models
in
generating
nature-like
proteins,
we
explored
potential
an
enzyme-specific
model
designing
catalytically
active
artificial
enzymes.
Here,
introduce
ZymCTRL
(’enzyme
control’),
a
conditional
trained
on
enzyme
sequence
space,
capable
based
user-defined
specifications.
Experimental
validation
at
diverse
data
regimes
and
different
families
demonstrated
ZymCTRL’s
ability
to
generate
identity
ranges.
Specifically,
describe
carbonic
anhydrases
lactate
dehydrogenases
zero-shot,
without
requiring
further
training
model,
showcasing
activity
identities
below
40%
compared
natural
proteins.
Biophysical
analysis
confirmed
globularity
well-folded
nature
generated
sequences.
Furthermore,
fine-tuning
enabled
generation
outside
space
with
comparable
their
counterparts.
Two
were
selected
scale
production
successfully
lyophilised,
maintaining
demonstrating
preliminary
conversion
one-pot
enzymatic
cascades
under
extreme
conditions.
Our
findings
open
new
door
towards
rapid
cost-effective
proficient
dataset
are
freely
available
community.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 17, 2024
Protein
language
models
trained
on
evolutionary
data
have
emerged
as
powerful
tools
for
predictive
problems
involving
protein
sequence,
structure,
and
function.
However,
these
overlook
decades
of
research
into
biophysical
factors
governing
We
propose
Mutational
Effect
Transfer
Learning
(METL),
a
model
framework
that
unites
advanced
machine
learning
modeling.
Using
the
METL
framework,
we
pretrain
transformer-based
neural
networks
simulation
to
capture
fundamental
relationships
between
energetics.
finetune
experimental
sequence-function
harness
signals
apply
them
when
predicting
properties
like
thermostability,
catalytic
activity,
fluorescence.
excels
in
challenging
engineering
tasks
generalizing
from
small
training
sets
position
extrapolation,
although
existing
methods
train
remain
many
types
assays.
demonstrate
METL's
ability
design
functional
green
fluorescent
variants
only
64
examples,
showcasing
potential
biophysics-based
engineering.
Nature Communications,
Год журнала:
2025,
Номер
16(1)
Опубликована: Янв. 2, 2025
Abstract
Molecular
structure
prediction
and
homology
detection
offer
promising
paths
to
discovering
protein
function
evolutionary
relationships.
However,
current
approaches
lack
statistical
reliability
assurances,
limiting
their
practical
utility
for
selecting
proteins
further
experimental
in-silico
characterization.
To
address
this
challenge,
we
introduce
a
statistically
principled
approach
search
leveraging
principles
from
conformal
prediction,
offering
framework
that
ensures
guarantees
with
user-specified
risk
provides
calibrated
probabilities
(rather
than
raw
ML
scores)
any
model.
Our
method
(1)
lets
users
select
many
biologically-relevant
loss
metrics
(i.e.
false
discovery
rate)
assigns
reliable
functional
annotating
genes
of
unknown
function;
(2)
achieves
state-of-the-art
performance
in
enzyme
classification
without
training
new
models;
(3)
robustly
rapidly
pre-filters
computationally
intensive
structural
alignment
algorithms.
enhances
the
enables
uncharacterized
likely
desirable
properties.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 17, 2024
Abstract
Biological
language
model
performance
depends
heavily
on
pretraining
data
quality,
diversity,
and
size.
While
metagenomic
datasets
feature
enormous
biological
their
utilization
as
has
been
limited
due
to
challenges
in
accessibility,
quality
filtering
deduplication.
Here,
we
present
the
Open
MetaGenomic
(OMG)
corpus,
a
genomic
dataset
totalling
3.1T
base
pairs
3.3B
protein
coding
sequences,
obtained
by
combining
two
largest
repositories
(JGI’s
IMG
EMBL’s
MGnify).
We
first
document
composition
of
describe
steps
taken
remove
poor
data.
make
OMG
corpus
available
mixed-modality
sequence
that
represents
multi-gene
encoding
sequences
with
translated
amino
acids
for
nucleic
intergenic
sequences.
train
(gLM2)
leverages
context
information
learn
robust
functional
representations,
well
coevolutionary
signals
protein-protein
interfaces
regulatory
syntax.
Furthermore,
show
deduplication
embedding
space
can
be
used
balance
demonstrating
improved
downstream
tasks.
The
is
publicly
hosted
Hugging
Face
Hub
at
https://huggingface.co/datasets/tattabio/OMG
gLM2
https://huggingface.co/tattabio/gLM2_650M
.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 16, 2024
Abstract
Biological
foundation
models
hold
significant
promise
for
deciphering
complex
biological
functions.
However,
evaluating
their
performance
on
functional
tasks
remains
challenging
due
to
the
lack
of
standardized
benchmarks
encompassing
diverse
sequences
and
Existing
annotations
are
often
scarce,
biased,
susceptible
train-test
leakage,
hindering
robust
evaluation.
Furthermore,
functions
manifest
at
multiple
scales,
from
individual
residues
large
genomic
segments.
To
address
these
limitations,
we
introduce
Diverse
Genomic
Embedding
Benchmark
(DGEB),
inspired
by
natural
language
embedding
benchmarks.
DGEB
comprises
six
across
18
expert
curated
datasets,
spanning
all
domains
life
both
nucleic
acid
amino
modalities.
Notably,
four
datasets
enable
direct
comparison
between
trained
different
Benchmarking
protein
(pLMs
gLMs)
reveals
saturation
with
model
scaling
numerous
tasks,
especially
those
underrepresented
(e.g.
Archaea).
This
highlights
limitations
existing
modeling
objectives
training
data
distributions
capturing
is
available
as
an
open-source
package
a
public
leaderboard
https://github.com/TattaBio/DGEB
.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 22, 2025
Deep
learning
has
made
strides
in
modeling
protein
sequences
but
often
struggles
to
generalize
beyond
its
training
distribution.
Current
models
focus
on
individual
through
masked
language
modeling,
effective
sequence
analysis
demands
the
ability
reason
across
sequences,
a
critical
step
phylogenetic
analysis.
Training
biological
foundation
explicitly
for
intersequence
reasoning
could
enhance
their
generalizability
and
performance
inference
other
tasks
computational
biology.
Here,
we
report
an
ongoing
development
of
PHYLA,
architecture
that
operates
explicit,
higher-level
semantic
representation
trees.
PHYLA
employs
hybrid
state-space
transformer
novel
tree
loss
function
achieve
state-of-the-art
benchmarks
reconstruction.
To
validate
PHYLA's
capabilities,
applied
it
reconstruct
life,
where
accurately
reclassified
archaeal
organisms,
such
as
Lokiarchaeota,
more
closely
related
bacteria-aligning
with
recent
insights.
represents
toward
molecular
reasoning,
emphasizing
structured
over
memorization
advancing
inference.
Journal of Virology,
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 29, 2025
ABSTRACT
The
unprecedented
sequencing
efforts
during
the
COVID-19
pandemic
paved
way
for
genomic
surveillance
to
become
a
powerful
tool
monitoring
evolution
of
circulating
viruses.
Herein,
we
discuss
how
state-of-the-art
artificial
intelligence
approach
called
protein
language
models
(pLMs)
can
be
used
effectively
analyzing
pathogen
data.
We
highlight
examples
pLMs
applied
predicting
viral
properties
and
lay
out
framework
integrating
into
pipelines.