bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 8, 2024
Abstract
Background
Phage
lifestyle
prediction,
i.e.
classifying
phage
sequences
as
virulent
or
temperate,
is
crucial
in
biomedical
and
ecological
applications.
from
metagenome
metavirome
assemblies
are
often
fragmented,
the
diversity
of
environmental
phages
not
well
known.
Current
computational
approaches
rely
on
database
comparisons
machine
learning
algorithms
that
require
significant
effort
expertise
to
update.
We
propose
using
genomic
language
models
for
classification,
allowing
efficient
direct
analysis
nucleotide
without
need
sophisticated
preprocessing
pipelines
manually
curated
databases.
Methods
trained
three
(DNABERT-2,
Nucleotide
Transformer,
ProkBERT)
datasets
short,
fragmented
sequences.
These
were
then
compared
with
dedicated
prediction
methods
(PhaTYP,
DeePhage,
BACPHLIP)
terms
accuracy,
speed,
generalization
capability.
Results
ProkBERT
PhaStyle
consistently
outperforms
existing
various
scenarios.
It
generalizes
out-of-sample
data,
accurately
classifies
extreme
environments,
also
demonstrates
high
inference
speed.
Despite
having
up
20
times
fewer
parameters,
it
proved
be
better
performing
than
much
larger
models.
Conclusions
Genomic
offer
a
simple
computationally
alternative
solving
complex
classification
tasks,
such
prediction.
PhaStyle’s
simplicity,
performance
suggest
its
utility
clinical
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Oct. 11, 2023
Whereas
protein
language
models
have
demonstrated
remarkable
efficacy
in
predicting
the
effects
of
missense
variants,
DNA
counterparts
not
yet
achieved
a
similar
competitive
edge
for
genome-wide
variant
effect
predictions,
especially
complex
genomes
such
as
that
humans.
To
address
this
challenge,
we
here
introduce
GPN-MSA,
novel
framework
leverages
whole-genome
sequence
alignments
across
multiple
species
and
takes
only
few
hours
to
train.
Across
several
benchmarks
on
clinical
databases
(ClinVar,
COSMIC,
OMIM),
experimental
functional
assays
(DMS,
DepMap),
population
genomic
data
(gnomAD),
our
model
human
genome
achieves
outstanding
performance
deleteriousness
prediction
both
coding
non-coding
variants.
Genome Research,
Journal Year:
2024,
Volume and Issue:
34(9), P. 1411 - 1420
Published: Sept. 1, 2024
Cis
-regulatory
elements
(CREs),
such
as
promoters
and
enhancers,
are
DNA
sequences
that
regulate
the
expression
of
genes.
The
activity
a
CRE
is
influenced
by
order,
composition,
spacing
sequence
motifs
bound
proteins
called
transcription
factors
(TFs).
Synthetic
CREs
with
specific
properties
needed
for
biomanufacturing
well
many
therapeutic
applications
including
cell
gene
therapy.
Here,
we
present
regLM,
framework
to
design
synthetic
desired
properties,
high,
low,
or
type–specific
activity,
using
autoregressive
language
models
in
conjunction
supervised
sequence-to-function
models.
We
used
our
yeast
human
enhancers.
demonstrate
generated
approach
not
only
predicted
have
functionality
but
also
contain
biological
features
similar
experimentally
validated
CREs.
regLM
thus
facilitates
realistic
regulatory
while
providing
insights
into
cis
code.
We
pretrain
METAGENE-1,
a
7-billion-parameter
autoregressive
transformer
model,
which
we
refer
to
as
_metagenomic
foundation
model_,
on
novel
corpus
of
diverse
metagenomic
DNA
and
RNA
sequences
comprising
over
1.5
trillion
base
pairs.
This
dataset
is
sourced
from
large
collection
human
wastewater
samples,
processed
sequenced
using
deep
(next-generation)
sequencing
methods.
Unlike
genomic
models
that
focus
individual
genomes
or
curated
sets
specific
species,
the
aim
METAGENE-1
capture
full
distribution
information
present
within
this
wastewater,
aid
in
tasks
relevant
pandemic
monitoring
pathogen
detection.
carry
out
byte-pair
encoding
(BPE)
tokenization
our
dataset,
tailored
for
sequences,
then
model.
In
paper,
first
detail
pretraining
strategy,
model
architecture,
highlighting
considerations
design
choices
enable
effective
modeling
data.
show
results
providing
details
about
losses,
system
metrics,
training
stability
course
pretraining.
Finally,
demonstrate
performance
achieves
state-of-the-art
set
benchmarks
new
evaluations
focused
human-pathogen
detection
sequence
embedding,
showcasing
its
potential
public
health
applications
monitoring,
biosurveillance,
early
emerging
threats.
Website:
metagene.ai
[https://metagene.ai/]
Model
Weights:
huggingface.co/metagene-ai
[https://huggingface.co/metagene-ai]
Code
Repository:
github.com/metagene-ai
[https://github.com/metagene-ai]
Biology,
Journal Year:
2025,
Volume and Issue:
14(2), P. 172 - 172
Published: Feb. 7, 2025
The
study
discloses
the
application
of
transformer-based
deep
learning
models
for
task
super-enhancers
prediction
in
human
tumor
cell
lines
with
a
specific
focus
on
sequence-only
features
within
studied
entities
super-enhancer
and
enhancer
elements
genome.
proposed
SE-prediction
method
included
GENA-LM
at
handling
long
DNA
sequences
classification
task,
distinguishing
from
enhancers
using
H3K36me,
H3K4me1,
H3K4me3
H3K27ac
landscape
datasets
HeLa,
HEK293,
H2171,
Jurkat,
K562,
MM1S
U87
lines.
model
was
fine-tuned
relevant
sequence
data,
allowing
analysis
extended
genomic
without
need
epigenetic
markers
as
early
approaches.
achieved
balanced
accuracy
metrics,
surpassing
previous
like
SENet,
particularly
HEK293
K562
Also,
it
shown
that
frequently
co-localize
marks
such
H3K27ac.
Therefore,
attention
mechanism
provided
insights
into
contributing
to
SE
classification,
indicating
correlation
between
mentioned
landscapes.
These
findings
support
potential
transformer
use
further
bioinformatics
applications
enhancer/super-enhancer
characterization
gene
regulation
studies.
Computational and Structural Biotechnology Journal,
Journal Year:
2025,
Volume and Issue:
27, P. 992 - 1000
Published: Jan. 1, 2025
Large
language
models
(LLMs)
in
genomics
have
successfully
predicted
various
functional
genomic
elements.
While
their
performance
is
typically
evaluated
using
benchmark
datasets,
it
remains
unclear
which
LLM
best
suited
for
specific
downstream
tasks,
particularly
generating
whole-genome
annotations.
Current
LLMs
fall
into
three
main
categories:
transformer-based
models,
long
convolution-based
and
state-space
(SSMs).
In
this
study,
we
benchmarked
different
types
of
architectures
maps
G-quadruplexes
(GQ),
a
type
flipons,
or
non-B
DNA
structures,
characterized
by
distinctive
patterns
roles
diverse
regulatory
contexts.
Although
GQ
forms
from
folding
guanosine
residues
tetrads,
the
computational
task
challenging
as
bases
involved
may
be
on
strands,
separated
large
number
nucleotides,
made
RNA
rather
than
DNA.
All
performed
comparably
well,
with
DNABERT-2
HyenaDNA
achieving
superior
results
based
F1
MCC.
Analysis
annotations
revealed
that
recovered
more
quadruplexes
distal
enhancers
intronic
regions.
The
were
better
to
detecting
arrays
likely
contribute
nuclear
condensates
gene
transcription
chromosomal
scaffolds.
Caduceus
formed
separate
grouping
generated
de
novo
quadruplexes,
while
clustered
together.
Overall,
our
findings
suggest
complement
each
other.
Genomic
varying
context
lengths
can
detect
distinct
elements,
underscoring
importance
selecting
appropriate
model
task.
code
data
underlying
article
are
available
at
https://github.com/powidla/G4s-FMs.
Frontiers in Medicine,
Journal Year:
2025,
Volume and Issue:
12
Published: April 8, 2025
Deoxyribonucleic
acid
(DNA)
serves
as
fundamental
genetic
blueprint
that
governs
development,
functioning,
growth,
and
reproduction
of
all
living
organisms.
DNA
can
be
altered
through
germline
somatic
mutations.
Germline
mutations
underlie
hereditary
conditions,
while
induced
by
various
factors
including
environmental
influences,
chemicals,
lifestyle
choices,
errors
in
replication
repair
mechanisms
which
lead
to
cancer.
sequence
analysis
plays
a
pivotal
role
uncovering
the
intricate
information
embedded
within
an
organism's
understanding
modify
it.
This
helps
early
detection
diseases
design
targeted
therapies.
Traditional
wet-lab
experimental
traditional
methods
is
costly,
time-consuming,
prone
errors.
To
accelerate
large-scale
analysis,
researchers
are
developing
AI
applications
complement
methods.
These
approaches
help
generate
hypotheses,
prioritize
experiments,
interpret
results
identifying
patterns
large
genomic
datasets.
Effective
integration
with
validation
requires
scientists
understand
both
fields.
Considering
need
comprehensive
literature
bridges
gap
between
fields,
contributions
this
paper
manifold:
It
presents
diverse
range
tasks
methodologies.
equips
essential
biological
knowledge
44
distinct
aligns
these
3
AI-paradigms,
namely,
classification,
regression,
clustering.
streamlines
into
consolidating
36
databases
used
develop
benchmark
datasets
for
different
tasks.
ensure
performance
comparisons
new
existing
predictors,
it
provides
insights
140
related
word
embeddings
language
models
across
development
predictors
providing
survey
39
67
based
predictive
pipeline
values
well
top
performing
encoding-based
their
performances
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 29, 2024
ABSTRACT
The
advent
of
advanced
sequencing
technologies
has
significantly
reduced
the
cost
and
increased
feasibility
assembling
high-quality
genomes.
Yet,
annotation
genomic
elements
remains
a
complex
challenge.
Even
for
species
with
comprehensively
annotated
reference
genomes,
functional
assessment
individual
genetic
variants
is
not
straightforward.
In
response
to
these
challenges,
recent
breakthroughs
in
machine
learning
have
led
development
DNA
language
models.
These
transformer-based
architectures
are
designed
tackle
wide
array
tasks
enhanced
efficiency
accuracy.
this
context,
we
introduce
GENA-Web,
web-based
platform
that
consolidates
suite
genome
tools
powered
by
version
GENA-Web
presented
here
encompasses
diverse
set
models
trained
on
human
data,
including
prediction
promoter
activity,
splice
sites,
determination
various
chromatin
features,
model
scoring
enhancer
activity
Drosophila.
accessible
online
at
https://dnalm.airi.net/
Nucleic Acids Research,
Journal Year:
2024,
Volume and Issue:
52(11), P. 6145 - 6157
Published: May 23, 2024
Abstract
Native
prokaryotic
promoters
share
common
sequence
patterns,
but
are
species
dependent.
For
understudied
with
limited
data,
it
is
challenging
to
predict
the
strength
of
existing
and
generate
novel
promoters.
Here,
we
developed
PromoGen,
a
collection
nucleotide
language
models
species-specific
functional
promoters,
across
dozens
in
data
parameter
efficient
way.
Twenty-seven
this
were
finetuned
from
pretrained
model
which
was
trained
on
multi-species
When
systematically
compared
native
Escherichia
coli-
Bacillus
subtilis-specific
artificial
PromoGen-generated
(PGPs)
demonstrated
hold
all
distribution
patterns
A
regression
score
generated
either
by
PromoGen
or
another
competitive
neural
network,
overall
PGPs
higher.
Encouraged
silico
analysis,
further
experimentally
characterized
twenty-two
B.
subtilis
PGPs,
results
showed
that
four
tested
reached
strong
promoter
level
while
active.
Furthermore,
user-friendly
website
for
27
different
PromoGen.
This
work
presented
an
deep-learning
strategy
de
novo
generation
even
datasets,
providing
valuable
toolboxes
especially
metabolic
engineering
microorganisms.