PeerJ Computer Science,
Год журнала:
2025,
Номер
11, С. e2761 - e2761
Опубликована: Март 26, 2025
A
promoter
is
a
DNA
segment
which
plays
key
role
in
regulating
gene
expression.
Accurate
identification
of
promoters
significant
for
understanding
the
regulatory
mechanisms
involved
expression
and
genetic
disease
treatment.
Therefore,
it
an
urgent
challenge
to
develop
computational
methods
identifying
promoters.
Most
current
were
designed
recognition
on
few
species
required
complex
feature
extraction
order
attain
high
accuracy.
Spiking
neural
networks
have
inherent
recurrence
use
spike-based
sparse
coding.
they
good
property
processing
spatio-temporal
information
are
well
suited
learning
sequence
information.
In
this
study,
iPro-CSAF,
convolutional
spiking
network
combined
with
attention
mechanism
recognition.
The
method
extracts
features
by
two
parallel
branches
including
layer.
iPro-CSAF
evaluated
exhaustive
experiments
both
prokaryotic
eukaryotic
from
seven
species.
Our
results
show
that
outperforms
used
CNN
layers,
CNNs
capsule
networks,
mechanism,
LSTM
or
BiLSTM,
CNNs-based
needed
priori
biological
text
extraction,
while
our
has
much
fewer
parameters.
It
indicates
effective
low
complexity
generalization
Nucleic Acids Research,
Год журнала:
2025,
Номер
53(2)
Опубликована: Янв. 11, 2025
Abstract
Recent
advancements
in
genomics,
propelled
by
artificial
intelligence,
have
unlocked
unprecedented
capabilities
interpreting
genomic
sequences,
mitigating
the
need
for
exhaustive
experimental
analysis
of
complex,
intertwined
molecular
processes
inherent
DNA
function.
A
significant
challenge,
however,
resides
accurately
decoding
which
inherently
involves
comprehending
rich
contextual
information
dispersed
across
thousands
nucleotides.
To
address
this
need,
we
introduce
GENA
language
model
(GENA-LM),
a
suite
transformer-based
foundational
models
capable
handling
input
lengths
up
to
36
000
base
pairs.
Notably,
integrating
newly
developed
recurrent
memory
mechanism
allows
these
process
even
larger
segments.
We
provide
pre-trained
versions
GENA-LM,
including
multispecies
and
taxon-specific
models,
demonstrating
their
capability
fine-tuning
addressing
spectrum
complex
biological
tasks
with
modest
computational
demands.
While
already
achieved
breakthroughs
protein
biology,
GENA-LM
showcases
similarly
promising
potential
reshaping
landscape
genomics
multi-omics
data
analysis.
All
are
publicly
available
on
GitHub
(https://github.com/AIRI-Institute/GENA_LM)
HuggingFace
(https://huggingface.co/AIRI-Institute).
In
addition,
web
service
(https://dnalm.airi.net/)
allowing
user-friendly
annotation
models.
Frontiers in Genetics,
Год журнала:
2025,
Номер
15
Опубликована: Янв. 7, 2025
Recent
advancements
in
deep
learning,
particularly
large
language
models
(LLMs),
made
a
significant
impact
on
how
researchers
study
microbiome
and
metagenomics
data.
Microbial
protein
genomic
sequences,
like
natural
languages,
form
of
life,
enabling
the
adoption
LLMs
to
extract
useful
insights
from
complex
microbial
ecologies.
In
this
paper,
we
review
applications
learning
analyzing
We
focus
problem
formulations,
necessary
datasets,
integration
modeling
techniques.
provide
an
extensive
overview
protein/genomic
their
contributions
studies.
also
discuss
such
as
novel
viromics
modeling,
biosynthetic
gene
cluster
prediction,
knowledge
for
Cell,
Год журнала:
2024,
Номер
187(25), С. 7045 - 7063
Опубликована: Дек. 1, 2024
Cells
are
essential
to
understanding
health
and
disease,
yet
traditional
models
fall
short
of
modeling
simulating
their
function
behavior.
Advances
in
AI
omics
offer
groundbreaking
opportunities
create
an
virtual
cell
(AIVC),
a
multi-scale,
multi-modal
large-neural-network-based
model
that
can
represent
simulate
the
behavior
molecules,
cells,
tissues
across
diverse
states.
This
Perspective
provides
vision
on
design
how
collaborative
efforts
build
AIVCs
will
transform
biological
research
by
allowing
high-fidelity
simulations,
accelerating
discoveries,
guiding
experimental
studies,
offering
new
for
cellular
functions
fostering
interdisciplinary
collaborations
open
science.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 4, 2024
ABSTRACT
The
emergence
of
genomic
language
models
(gLMs)
offers
an
unsupervised
approach
to
learning
a
wide
diversity
cis
-regulatory
patterns
in
the
non-coding
genome
without
requiring
labels
functional
activity
generated
by
wet-lab
experiments.
Previous
evaluations
have
shown
that
pre-trained
gLMs
can
be
leveraged
improve
predictive
performance
across
broad
range
regulatory
genomics
tasks,
albeit
using
relatively
simple
benchmark
datasets
and
baseline
models.
Since
these
studies
were
tested
upon
fine-tuning
their
weights
for
each
downstream
task,
determining
whether
gLM
representations
embody
foundational
understanding
biology
remains
open
question.
Here
we
evaluate
representational
power
predict
interpret
cell-type-specific
data
span
DNA
RNA
regulation.
Our
findings
suggest
probing
do
not
offer
substantial
advantages
over
conventional
machine
approaches
use
one-hot
encoded
sequences.
This
work
highlights
major
gap
with
current
gLMs,
raising
potential
issues
pre-training
strategies
genome.
Current Opinion in Structural Biology,
Год журнала:
2025,
Номер
90, С. 102979 - 102979
Опубликована: Янв. 7, 2025
The
mRNA
splicing
machinery
has
been
estimated
to
generate
100,000
known
protein-coding
transcripts
for
20,000
human
genes
(Ensembl,
Sept.
2024).
However,
this
set
is
expanding
with
the
massive
and
rapidly
growing
data
coming
from
high-throughput
technologies,
particularly
single-cell
long-read
sequencing.
Yet,
implications
of
complexity
at
protein
level
remain
largely
uncharted.
In
review,
we
describe
current
advances
toward
systematically
assessing
contribution
alternative
proteome
function
diversification.
We
discuss
potential
challenges
using
artificial
intelligence-based
techniques
in
identifying
proteoforms
characterising
their
structures,
interactions,
functions.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 8, 2025
Modeling
long-range
DNA
dependencies
is
crucial
for
understanding
genome
structure
and
function
across
a
wide
range
of
biological
contexts.
However,
effectively
capturing
these
extensive
dependencies,
which
may
span
millions
base
pairs
in
tasks
such
as
three-dimensional
(3D)
chromatin
folding
prediction,
remains
significant
challenge.
Furthermore,
comprehensive
benchmark
suite
evaluating
that
rely
on
notably
absent.
To
address
this
gap,
we
introduce
DNAL
ong
B
ench
,
dataset
encompassing
five
important
genomics
consider
up
to
1
million
pairs:
enhancer-target
gene
interaction,
expression
quantitative
trait
loci,
3D
organization,
regulatory
sequence
activity,
transcription
initiation
signals.
comprehensively
assess
evaluate
the
performance
methods:
task-specific
expert
model,
convolutional
neural
network
(CNN)-based
three
fine-tuned
foundation
models
-
HyenaDNA,
Caduceus-Ph,
Caduceus-PS.
We
envision
standardized
resource
with
potential
facilitate
comparisons
rigorous
evaluations
emerging
sequence-based
deep
learning
account
dependencies.
Drug Discovery Today,
Год журнала:
2024,
Номер
29(6), С. 103990 - 103990
Опубликована: Апрель 23, 2024
The
enormous
growth
in
the
amount
of
data
generated
by
life
sciences
is
continuously
shifting
field
from
model-driven
science
towards
data-driven
science.
need
for
efficient
processing
has
led
to
adoption
massively
parallel
accelerators
such
as
graphics
units
(GPUs).
Consequently,
development
bioinformatics
methods
nowadays
often
heavily
depends
on
effective
use
these
powerful
technologies.
Furthermore,
progress
computational
techniques
and
architectures
continues
be
highly
dynamic,
involving
novel
deep
neural
network
models
artificial
intelligence
(AI)
accelerators,
potentially
quantum
future.
These
are
expected
disruptive
a
whole
drug
discovery
particular.
Here,
we
identify
three
waves
acceleration
their
applications
context:
(i)
GPU
computing,
(ii)
AI
(iii)
next-generation
computers.