bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Oct. 11, 2023
Whereas
protein
language
models
have
demonstrated
remarkable
efficacy
in
predicting
the
effects
of
missense
variants,
DNA
counterparts
not
yet
achieved
a
similar
competitive
edge
for
genome-wide
variant
effect
predictions,
especially
complex
genomes
such
as
that
humans.
To
address
this
challenge,
we
here
introduce
GPN-MSA,
novel
framework
leverages
whole-genome
sequence
alignments
across
multiple
species
and
takes
only
few
hours
to
train.
Across
several
benchmarks
on
clinical
databases
(ClinVar,
COSMIC,
OMIM),
experimental
functional
assays
(DMS,
DepMap),
population
genomic
data
(gnomAD),
our
model
human
genome
achieves
outstanding
performance
deleteriousness
prediction
both
coding
non-coding
variants.
ACM Computing Surveys,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 26, 2025
Large
Language
Models
(LLMs)
have
emerged
as
a
transformative
power
in
enhancing
natural
language
comprehension,
representing
significant
stride
toward
artificial
general
intelligence.
The
application
of
LLMs
extends
beyond
conventional
linguistic
boundaries,
encompassing
specialized
systems
developed
within
various
scientific
disciplines.
This
growing
interest
has
led
to
the
advent
LLMs,
novel
subclass
specifically
engineered
for
facilitating
discovery.
As
burgeoning
area
community
AI
Science,
warrant
comprehensive
exploration.
However,
systematic
and
up-to-date
survey
introducing
them
is
currently
lacking.
In
this
paper,
we
endeavor
methodically
delineate
concept
“scientific
language”,
whilst
providing
thorough
review
latest
advancements
LLMs.
Given
expansive
realm
disciplines,
our
analysis
adopts
focused
lens,
concentrating
on
biological
chemical
domains.
includes
an
in-depth
examination
textual
knowledge,
small
molecules,
macromolecular
proteins,
genomic
sequences,
their
combinations,
analyzing
terms
model
architectures,
capabilities,
datasets,
evaluation.
Finally,
critically
examine
prevailing
challenges
point
out
promising
research
directions
along
with
advances
By
offering
overview
technical
developments
field,
aspires
be
invaluable
resource
researchers
navigating
intricate
landscape
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: March 24, 2024
Abstract
Optimizing
enzymes
to
function
in
novel
chemical
environments
is
a
central
goal
of
synthetic
biology,
but
optimization
often
hindered
by
rugged,
expansive
protein
search
space
and
costly
experiments.
In
this
work,
we
present
TeleProt,
an
ML
framework
that
blends
evolutionary
experimental
data
design
diverse
variant
libraries,
employ
it
improve
the
catalytic
activity
nuclease
enzyme
degrades
biofilms
accumulate
on
chronic
wounds.
After
multiple
rounds
high-throughput
experiments
using
both
TeleProt
standard
directed
evolution
(DE)
approaches
parallel,
find
our
approach
found
significantly
better
top-performing
than
DE,
had
hit
rate
at
finding
diverse,
high-activity
variants,
was
even
able
high-performance
initial
library
no
prior
data.
We
have
released
dataset
55K
one
most
extensive
genotype-phenotype
landscapes
date,
drive
further
progress
ML-guided
design.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: May 14, 2024
Abstract
Understanding
the
relationship
between
protein
sequence
and
function
is
crucial
for
accurate
genetic
variant
classification.
Variant
effect
predictors
(VEPs)
play
a
vital
role
in
deciphering
this
complex
relationship,
yet
evaluating
their
performance
remains
challenging
due
to
data
circularity,
where
same
or
related
used
training
assessment.
High-throughput
experimental
strategies
like
deep
mutational
scanning
(DMS)
offer
promising
solution.
In
study,
we
extend
upon
our
previous
benchmarking
approach,
assessing
of
84
different
VEPs
DMS
experiments
from
36
human
proteins.
addition,
new
pairwise,
VEP-centric
ranking
method
reduces
impact
VEP
score
availability
on
overall
ranking.
We
observe
remarkably
high
correspondence
DMS-based
benchmarks
clinical
classification,
especially
that
have
not
been
directly
trained
variants.
Our
results
suggest
comparing
against
diverse
functional
assays
represents
reliable
strategy
relative
However,
major
challenges
interpretation
scores
persist,
highlighting
need
further
research
fully
leverage
computational
diagnosis.
also
address
practical
considerations
end
users
terms
choice
methodology.
Genome Medicine,
Journal Year:
2024,
Volume and Issue:
16(1)
Published: July 11, 2024
Abstract
Background
One
of
the
major
hurdles
in
clinical
genetics
is
interpreting
consequences
associated
with
germline
missense
variants
humans.
Recent
significant
advances
have
leveraged
natural
variation
observed
large-scale
human
populations
to
uncover
genes
or
genomic
regions
that
show
a
depletion
variation,
indicative
selection
pressure.
We
refer
this
as
“genetic
constraint”.
Although
existing
genetic
constraint
metrics
been
demonstrated
be
successful
prioritising
diseases,
their
spatial
resolution
limited
distinguishing
pathogenic
from
benign
within
genes.
Methods
aim
identify
are
significantly
depleted
general
population.
Given
size
currently
available
exome
genome
sequencing
data,
it
not
possible
directly
detect
individual
variants,
since
average
expected
number
observations
variant
at
most
positions
less
than
one.
instead
focus
on
protein
domains,
grouping
homologous
similar
functional
impacts
examine
variations
these
comparable
sets.
To
accomplish
this,
we
develop
Homologous
Missense
Constraint
(HMC)
score.
utilise
Genome
Aggregation
Database
(gnomAD)
125
K
data
and
evaluate
quasi
amino-acid
by
combining
signals
across
homologues.
Results
one
million
under
strong
negative
domains.
Though
our
approach
annotates
only
nonetheless
allows
us
assess
22%
confidently.
It
precisely
distinguishes
for
both
early-onset
adult-onset
disorders.
outperforms
pathogenicity
meta-predictors
de
novo
mutations
probands
developmental
disorders
(DD).
also
methodologically
independent
these,
adding
power
predict
when
used
combination.
demonstrate
utility
gene
discovery
identifying
seven
newly
DD
could
act
through
an
altered-function
mechanism.
Conclusions
Grouping
effective
evaluating
constraint.
HMC
novel
accurate
predictor
consequence
improved
interpretation.
Science Advances,
Journal Year:
2024,
Volume and Issue:
10(48)
Published: Nov. 27, 2024
Designing
protein
mutants
with
both
high
stability
and
activity
is
a
critical
yet
challenging
task
in
engineering.
Here,
we
introduce
PRIME,
deep
learning
model,
which
can
suggest
improved
without
any
prior
experimental
mutagenesis
data
for
the
specified
protein.
Leveraging
temperature-aware
language
modeling,
PRIME
demonstrated
superior
predictive
ability
compared
to
current
state-of-the-art
models
on
public
dataset
across
283
assays.
Furthermore,
validated
PRIME’s
predictions
five
proteins,
examining
impact
of
top
30
45
single-site
mutations
various
properties,
including
thermal
stability,
antigen-antibody
binding
affinity,
polymerize
nonnatural
nucleic
acid
or
resilience
extreme
alkaline
conditions.
More
than
30%
PRIME-recommended
exhibited
performance
their
premutation
counterparts
all
proteins
desired
properties.
We
developed
an
efficient
effective
method
based
rapidly
obtain
multisite
enhanced
stability.
Hence,
demonstrates
broad
applicability
Current Opinion in Structural Biology,
Journal Year:
2025,
Volume and Issue:
91, P. 102997 - 102997
Published: Feb. 7, 2025
Protein
language
models
(pLMs)
capture
some
aspects
of
the
grammar
life
as
written
in
protein
sequences.
The
so-called
pLM
embeddings
implicitly
contain
this
information.
Therefore,
can
serve
exclusive
input
into
downstream
supervised
methods
for
prediction.
Over
last
33
years,
evolutionary
information
extracted
through
simple
averaging
specific
families
from
multiple
sequence
alignments
(MSAs)
has
been
most
successful
universal
key
to
success
For
many
applications,
MSA-free
pLM-based
predictions
now
have
become
significantly
more
accurate.
reason
is
often
a
combination
two
aspects.
Firstly,
condense
so
efficiently
that
prediction
succeed
with
small
models,
i.e.,
they
need
few
free
parameters
particular
era
exploding
deep
neural
networks.
Secondly,
provide
protein-specific
solutions.
As
additional
benefit,
once
pre-training
complete,
solutions
tend
consume
much
fewer
resources
than
MSA-based
In
fact,
we
appeal
community
rather
optimize
foundation
retrain
new
ones
and
evolve
incentives
require
even
at
loss
accuracy.
Although
pLMs
not,
yet,
succeeded
entirely
replace
body
developed
over
three
decades,
clearly
are
rapidly
advancing
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 21, 2025
Abstract
All
of
life
encodes
information
with
DNA.
While
tools
for
sequencing,
synthesis,
and
editing
genomic
code
have
transformed
biological
research,
intelligently
composing
new
systems
would
also
require
a
deep
understanding
the
immense
complexity
encoded
by
genomes.
We
introduce
Evo
2,
foundation
model
trained
on
9.3
trillion
DNA
base
pairs
from
highly
curated
atlas
spanning
all
domains
life.
train
2
7B
40B
parameters
to
an
unprecedented
1
million
token
context
window
single-nucleotide
resolution.
learns
sequence
alone
accurately
predict
functional
impacts
genetic
variation—from
noncoding
pathogenic
mutations
clinically
significant
BRCA1
variants—without
task-specific
finetuning.
Applying
mechanistic
interpretability
analyses,
we
reveal
that
autonomously
breadth
features,
including
exon–intron
boundaries,
transcription
factor
binding
sites,
protein
structural
elements,
prophage
regions.
Beyond
its
predictive
capabilities,
generates
mitochondrial,
prokaryotic,
eukaryotic
sequences
at
genome
scale
greater
naturalness
coherence
than
previous
methods.
Guiding
via
inference-time
search
enables
controllable
generation
epigenomic
structure,
which
demonstrate
first
scaling
results
in
biology.
make
fully
open,
parameters,
training
code,
inference
OpenGenome2
dataset,
accelerate
exploration
design
complexity.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: March 15, 2024
Abstract
While
there
has
been
substantial
progress
in
our
ability
to
predict
changes
protein
stability
due
amino
acid
substitutions,
slower
methods
the
absolute
of
a
protein.
Here
we
show
how
generative
model
for
sequence
can
be
leveraged
stability.
We
benchmark
predictions
across
broad
set
proteins
and
find
mean
error
1.5
kcal/mol
correlation
coefficient
0.7
range
natural,
small–medium
sized
up
ca.
150
residues.
analyse
current
limitations
future
directions
including
such
may
useful
predicting
conformational
free
energies.
Our
approach
is
simple
use
freely
available
via
an
online
implementation.