bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 3, 2024
Abstract
DNA
methylation
serves
as
a
powerful
biomarker
for
disease
diagnosis
and
biological
age
assessment.
However,
current
analytical
approaches
often
rely
on
linear
models
that
cannot
capture
the
complex,
context-dependent
nature
of
regulation.
Here
we
present
MethylGPT,
transformer-based
foundation
model
trained
226,555
(154,063
after
QC
deduplication)
human
profiles
spanning
diverse
tissue
types
from
5,281
datasets,
curated
49,156
CpG
sites,
7.6
billion
training
tokens.
MethylGPT
learns
biologically
meaningful
representations
capturing
both
local
genomic
context
higher-order
chromosomal
features
without
external
supervision.
The
demonstrates
robust
value
prediction
(Pearson
R=0.929)
maintains
stable
performance
in
downstream
tasks
with
up
to
70%
missing
data.
Applied
across
multiple
types,
achieves
superior
accuracy
compared
existing
methods.
Analysis
model’s
attention
patterns
reveals
distinct
signatures
between
young
old
samples,
differential
enrichment
developmental
aging-associated
pathways.
When
finetuned
mortality
60
major
conditions
using
18,859
samples
Generation
Scotland,
predictive
enables
systematic
evaluation
intervention
effects
risks,
demonstrating
potential
clinical
applications.
Our
results
demonstrate
transformer
architectures
can
effectively
while
preserving
interpretability,
suggesting
broad
utility
epigenetic
analysis
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 15, 2025
The
increasing
availability
of
microbial
genomes
is
essential
to
gain
insights
into
ecology
and
evolution
that
can
propel
biotechnological
biomedical
advances.
Recent
advances
in
genome
recovery
have
significantly
expanded
the
catalogue
from
diverse
habitats.
However,
ability
explain
how
well
a
set
account
for
diversity
given
environment
remains
challenging
individual
studies
or
biome-specific
databases.
Here
we
present
EcoPhylo,
computational
workflow
characterize
phylogeography
any
gene
family
through
integrated
analyses
metagenomes,
our
application
this
approach
ribosomal
proteins
quantify
phylogeny-aware
rates
across
three
biomes.
Our
findings
show
vary
widely
taxa
biomes,
single
amplified
genomes,
metagenome-assembled
isolate
non-uniform
yet
quantifiable
representation
environmental
microbes.
EcoPhylo
reveals
highly
resolved,
reference-free,
multi-domain
phylogenies
conjunction
with
distribution
patterns
clades
environments,
providing
means
assess
benchmark
biome-level
collections.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 22, 2025
Deep
learning
has
made
strides
in
modeling
protein
sequences
but
often
struggles
to
generalize
beyond
its
training
distribution.
Current
models
focus
on
individual
through
masked
language
modeling,
effective
sequence
analysis
demands
the
ability
reason
across
sequences,
a
critical
step
phylogenetic
analysis.
Training
biological
foundation
explicitly
for
intersequence
reasoning
could
enhance
their
generalizability
and
performance
inference
other
tasks
computational
biology.
Here,
we
report
an
ongoing
development
of
PHYLA,
architecture
that
operates
explicit,
higher-level
semantic
representation
trees.
PHYLA
employs
hybrid
state-space
transformer
novel
tree
loss
function
achieve
state-of-the-art
benchmarks
reconstruction.
To
validate
PHYLA's
capabilities,
applied
it
reconstruct
life,
where
accurately
reclassified
archaeal
organisms,
such
as
Lokiarchaeota,
more
closely
related
bacteria-aligning
with
recent
insights.
represents
toward
molecular
reasoning,
emphasizing
structured
over
memorization
advancing
inference.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 17, 2025
Abstract
Transformers
are
the
basis
for
many
state-of-the-art
machine
learning
tools,
including
those
predicting
gene
expression
data
from
DNA
sequence.
The
considerable
time
and
cost
of
training
transformer
models
has
motivated
development
alternative
approaches
inspired
by
ideas
signal-processing
literature,
such
as
state-space
(Mamba),
Fourier
transforms
(Hyena),
wavelet
(MultiResNet).
To
evaluate
these
methods
potential
replacements
(or
complements)
attention,
we
developed
a
software
library
bilby,
implemented
using
Python
Jax/Flax,
providing
convolutional,
bidirectional
Hyena,
Mamba,
striped-architecture
supervised
multi-task
in
functional
genomics.
We
report
comparison
architectures,
testing
several
hyperparameters
variations,
reporting
performance
statistics
withheld
test
set
well
downstream
SNP
classifiers.
Relative
to
comprising
convolution
attention
layers
(implemented
TensorFlow
via
Baskerville
used
Borzoi
software),
(optionally)
achieve
small
but
consistent
improvements
prediction
accuracy,
roughly
comparable
times
parameter
counts,
when
averaged
across
all
output
tracks
splits
(a
proportional
increase
3-4%
Pearson
R,
1-2%
r
2
,
with
highest
gains
achieved
Mamba
were
combined
striped
architecture).
In
contrast,
Hyena
(when
reimplemented
described
literature)
was
not
competitive
attention-based
at
tasks,
while
MultiResNet
proved
too
slow
be
practical.
accuracy
Mamba-based
do
yet
translate
significantly
improved
on
classification
tasks:
benchmarks
GTEx
eQTL
dataset
yield
similar
results
Mamba-
classifiers,
marginally
outperforming
one
metric
difference
+0.007
area
under
ROC)
slightly
underperforming
another
−0.006
Spearman
rank
correlation).
argue
that
suggest
selective
(such
Striped
Mamba)
warrant
further
exploration
genomics
tasks.
Our
code
trained
publicly
available
https://github.com/ihh/bilby
.
NAR Genomics and Bioinformatics,
Journal Year:
2024,
Volume and Issue:
6(3)
Published: July 2, 2024
Novel
applications
of
language
models
in
genomics
promise
to
have
a
large
impact
on
the
field.
The
megaDNA
model
is
first
publicly
available
generative
for
creating
synthetic
viral
genomes.
To
evaluate
megaDNA's
ability
recapitulate
nonrandom
genome
composition
viruses
and
assess
whether
genomes
can
be
algorithmically
detected,
compositional
metrics
4969
natural
bacteriophage
1002
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 3, 2024
Abstract
DNA
methylation
serves
as
a
powerful
biomarker
for
disease
diagnosis
and
biological
age
assessment.
However,
current
analytical
approaches
often
rely
on
linear
models
that
cannot
capture
the
complex,
context-dependent
nature
of
regulation.
Here
we
present
MethylGPT,
transformer-based
foundation
model
trained
226,555
(154,063
after
QC
deduplication)
human
profiles
spanning
diverse
tissue
types
from
5,281
datasets,
curated
49,156
CpG
sites,
7.6
billion
training
tokens.
MethylGPT
learns
biologically
meaningful
representations
capturing
both
local
genomic
context
higher-order
chromosomal
features
without
external
supervision.
The
demonstrates
robust
value
prediction
(Pearson
R=0.929)
maintains
stable
performance
in
downstream
tasks
with
up
to
70%
missing
data.
Applied
across
multiple
types,
achieves
superior
accuracy
compared
existing
methods.
Analysis
model’s
attention
patterns
reveals
distinct
signatures
between
young
old
samples,
differential
enrichment
developmental
aging-associated
pathways.
When
finetuned
mortality
60
major
conditions
using
18,859
samples
Generation
Scotland,
predictive
enables
systematic
evaluation
intervention
effects
risks,
demonstrating
potential
clinical
applications.
Our
results
demonstrate
transformer
architectures
can
effectively
while
preserving
interpretability,
suggesting
broad
utility
epigenetic
analysis