bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 8, 2024
Abstract
Background
Phage
lifestyle
prediction,
i.e.
classifying
phage
sequences
as
virulent
or
temperate,
is
crucial
in
biomedical
and
ecological
applications.
from
metagenome
metavirome
assemblies
are
often
fragmented,
the
diversity
of
environmental
phages
not
well
known.
Current
computational
approaches
rely
on
database
comparisons
machine
learning
algorithms
that
require
significant
effort
expertise
to
update.
We
propose
using
genomic
language
models
for
classification,
allowing
efficient
direct
analysis
nucleotide
without
need
sophisticated
preprocessing
pipelines
manually
curated
databases.
Methods
trained
three
(DNABERT-2,
Nucleotide
Transformer,
ProkBERT)
datasets
short,
fragmented
sequences.
These
were
then
compared
with
dedicated
prediction
methods
(PhaTYP,
DeePhage,
BACPHLIP)
terms
accuracy,
speed,
generalization
capability.
Results
ProkBERT
PhaStyle
consistently
outperforms
existing
various
scenarios.
It
generalizes
out-of-sample
data,
accurately
classifies
extreme
environments,
also
demonstrates
high
inference
speed.
Despite
having
up
20
times
fewer
parameters,
it
proved
be
better
performing
than
much
larger
models.
Conclusions
Genomic
offer
a
simple
computationally
alternative
solving
complex
classification
tasks,
such
prediction.
PhaStyle’s
simplicity,
performance
suggest
its
utility
clinical
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Jan. 27, 2023
Abstract
The
rise
of
large-scale
multi-species
genome
sequencing
projects
promises
to
shed
new
light
on
how
genomes
encode
gene
regulatory
instructions.
To
this
end,
algorithms
are
needed
that
can
leverage
conservation
capture
elements
while
accounting
for
their
evolution.
Here
we
introduce
species-aware
DNA
language
models
(LMs),
which
trained
more
than
800
species
spanning
over
500
million
years
Investigating
ability
predict
masked
nucleotides
from
context,
show
LMs
distinguish
transcription
factor
and
RNA-binding
protein
motifs
background
non-coding
sequence.
Owing
flexibility,
conserved
much
further
evolutionary
distances
sequence
alignment
would
allow.
Remarkably,
reconstruct
motif
instances
bound
in
vivo
better
unbound
ones
account
the
evolution
sequences
positional
constraints,
showing
these
functional
high-order
context.
We
training
yields
improved
representations
endogenous
MPRA-based
expression
prediction,
as
well
discovery.
Collectively,
results
demonstrate
a
powerful,
flexible,
scalable
tool
integrate
information
large
compendia
highly
diverged
genomes.
Briefings in Bioinformatics,
Journal Year:
2024,
Volume and Issue:
25(6)
Published: Sept. 23, 2024
Abstract
Predicting
molecular
processes
using
deep
learning
is
a
promising
approach
to
provide
biological
insights
for
non-coding
single
nucleotide
polymorphisms
identified
in
genome-wide
association
studies.
However,
most
methods
rely
on
supervised
learning,
which
requires
DNA
sequences
associated
with
functional
data,
and
whose
amount
severely
limited
by
the
finite
size
of
human
genome.
Conversely,
mammalian
growing
exponentially
due
ongoing
large-scale
sequencing
projects,
but
cases
without
data.
To
alleviate
limitations
we
propose
novel
semi-supervised
(SSL)
based
pseudo-labeling,
allows
exploit
unlabeled
from
numerous
genomes
during
model
pre-training.
We
further
improved
it
incorporating
principles
Noisy
Student
algorithm
predict
confidence
pseudo-labeled
data
used
pre-training,
showed
improvements
transcription
factor
very
few
binding
(very
small
training
data).
The
flexible
can
be
train
any
neural
architecture
including
state-of-the-art
models,
shows
strong
predictive
performance
compared
standard
learning.
Moreover,
models
trained
SSL
similar
or
better
than
large
language
DNABERT2.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 26, 2024
Genomic
Language
Models
(GLMs),
which
learn
from
nucleotide
sequences,
have
become
essential
tools
for
understanding
the
principles
of
life
and
demonstrated
outstanding
performance
in
downstream
tasks
genomic
analysis,
such
as
sequence
generation
classification.
However,
models
that
achieve
state-of-the-art
(SoTA)
results
benchmark
tests
often
exhibit
significant
differences
training
methods,
model
architectures,
tokenization
techniques,
leading
to
varied
strengths
weaknesses.
Based
on
these
differences,
we
propose
a
multi-model
fusion
approach
based
dynamic
selector.
By
effectively
integrating
three
with
significantly
different
method
enhances
overall
predictive
tasks.
Experimental
indicate
outperforms
any
single
testing
tasks(SoTA),
achieving
complementary
advantages
among
models.
Additionally,
considering
most
researchers
focus
improving
performance,
they
may
overlook
detailed
analysis
processing
capabilities
across
architectures.
To
address
this
gap,
study
conducts
comprehensive
classification
abilities
models,
hypotheses
validations
possible
underlying
causes.
The
findings
reveal
strong
correlation
between
prominence
motifs
sequences.
excessive
reliance
result
limitations
biological
functions
ultra-short
core
genes
contextual
relationships
ultra-long
We
suggest
issues
need
novel
architectural
module
compensate
deficiencies
genes.
code,
data,
pre-trained
are
available
at
https://github.com/Jacob-S-Qiu/glm_dynamic_selection.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 5, 2024
Abstract
Language
models
applied
to
protein
sequences
have
become
a
panacea,
enabling
therapeutics
development,
materials
engineering,
and
core
biology
research.
Despite
the
successes
of
language
models,
genome
remain
nascent.
Recent
studies
suggest
bottleneck
is
data
volume
or
modeling
context
size,
since
long-range
interactions
are
widely
acknowledged
but
sparsely
annotated.
However,
it
may
be
case
that
even
short
DNA
modeled
poorly
by
existing
approaches,
current
unable
represent
wide
array
functions
encoded
DNA.
To
study
this,
we
develop
AIDO.DNA,
pretrained
module
for
representation
in
an
AI-driven
Digital
Organism
[1].
AIDO.DNA
seven
billion
parameter
encoder-only
transformer
trained
on
10.6
nucleotides
from
dataset
796
species.
By
scaling
model
size
while
maintaining
length
4k
nucleotides,
shows
substantial
improvements
across
breadth
supervised,
generative,
zero-shot
tasks
relevant
functional
genomics,
synthetic
biology,
drug
development.
Notably,
outperforms
prior
architectures
without
new
data,
suggesting
laws
needed
achieve
computeoptimal
models.
Models
code
available
through
Model-Generator
https://github.com/genbio-ai/AIDO
Hugging
Face
at
https://huggingface.co/genbio-ai
.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 8, 2024
Abstract
Background
Phage
lifestyle
prediction,
i.e.
classifying
phage
sequences
as
virulent
or
temperate,
is
crucial
in
biomedical
and
ecological
applications.
from
metagenome
metavirome
assemblies
are
often
fragmented,
the
diversity
of
environmental
phages
not
well
known.
Current
computational
approaches
rely
on
database
comparisons
machine
learning
algorithms
that
require
significant
effort
expertise
to
update.
We
propose
using
genomic
language
models
for
classification,
allowing
efficient
direct
analysis
nucleotide
without
need
sophisticated
preprocessing
pipelines
manually
curated
databases.
Methods
trained
three
(DNABERT-2,
Nucleotide
Transformer,
ProkBERT)
datasets
short,
fragmented
sequences.
These
were
then
compared
with
dedicated
prediction
methods
(PhaTYP,
DeePhage,
BACPHLIP)
terms
accuracy,
speed,
generalization
capability.
Results
ProkBERT
PhaStyle
consistently
outperforms
existing
various
scenarios.
It
generalizes
out-of-sample
data,
accurately
classifies
extreme
environments,
also
demonstrates
high
inference
speed.
Despite
having
up
20
times
fewer
parameters,
it
proved
be
better
performing
than
much
larger
models.
Conclusions
Genomic
offer
a
simple
computationally
alternative
solving
complex
classification
tasks,
such
prediction.
PhaStyle’s
simplicity,
performance
suggest
its
utility
clinical