Research Square (Research Square),
Год журнала:
2023,
Номер
unknown
Опубликована: Сен. 29, 2023
Abstract
Long
extrachromosomal
circular
DNA
(leccDNA)
regulates
several
biological
processes
such
as
genomic
instability,
gene
amplification,
and
oncogenesis.
The
identification
of
leccDNA
holds
significant
importance
to
investigate
its
potential
associations
with
cancer,
autoimmune,
cardiovascular,
neurological
diseases.
In
addition,
understanding
these
can
provide
valuable
insights
about
disease
mechanisms
therapeutic
approaches.
Conventionally
,
wet
lab-based
methods
are
utilized
identify
leccDNA,
which
hindered
by
the
need
for
prior
knowledge,
resource-intensive
processes,
potentially
limiting
their
broader
applicability.
To
empower
process
across
multiple
species,
paper
in
hand
presents
very
first
computational
predictor.
proposed
iLEC-DNA
predictor
makes
use
SVM
classifier
along
sequence-derived
nucleotide
distribution
patterns
physico-chemical
properties-based
features.
study
introduces
a
set
12
benchmark
datasets
related
three
namely
Homo
sapiens
(HM),
Arabidopsis
Thaliana
(AT),
Saccharomyces
cerevisiae
(SC/YS).
It
performs
large-scale
experimentation
under
different
experimental
settings
using
more
than
140
baseline
predictors.
outperforms
predictors
diverse
producing
average
performance
values
80.699%,
61.45%
80.7%
terms
ACC,
MCC
AUC-ROC
all
datasets.
source
code
is
available
at
https://github.com/FAhtisham/Extrachrosmosomal-DNA-Prediction.
facilitate
scientific
community,
web
application
https://sds_genetic_analysis.opendfki.de/iLEC_DNA//.
arXiv (Cornell University),
Год журнала:
2024,
Номер
unknown
Опубликована: Янв. 1, 2024
Large
language
models
(LLMs)
are
a
class
of
artificial
intelligence
based
on
deep
learning,
which
have
great
performance
in
various
tasks,
especially
natural
processing
(NLP).
typically
consist
neural
networks
with
numerous
parameters,
trained
large
amounts
unlabeled
input
using
self-supervised
or
semi-supervised
learning.
However,
their
potential
for
solving
bioinformatics
problems
may
even
exceed
proficiency
modeling
human
language.
In
this
review,
we
will
present
summary
the
prominent
used
processing,
such
as
BERT
and
GPT,
focus
exploring
applications
at
different
omics
levels
bioinformatics,
mainly
including
genomics,
transcriptomics,
proteomics,
drug
discovery
single
cell
analysis.
Finally,
review
summarizes
prospects
bioinformatic
problems.
Computers,
Год журнала:
2024,
Номер
13(4), С. 92 - 92
Опубликована: Апрель 4, 2024
An
increasing
demand
for
model
explainability
has
accompanied
the
widespread
adoption
of
transformers
in
various
fields
applications.
In
this
paper,
we
conduct
a
survey
existing
literature
on
transformers.
We
provide
taxonomy
methods
based
combination
transformer
components
that
are
leveraged
to
arrive
at
explanation.
For
each
method,
describe
its
mechanism
and
find
out
attention-based
methods,
both
alone
conjunction
with
activation-based
gradient-based
most
employed
ones.
A
growing
attention
is
also
devoted
deployment
visualization
techniques
help
explanation
process.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Фев. 8, 2025
Abstract
DNA
methylation
(DNAm),
an
epigenetic
modification,
regulates
gene
expression,
influences
phenotypes,
and
encodes
inheritable
information,
making
it
critical
for
disease
diagnosis,
treatment,
prevention.
While
human
genome
contains
approximately
28
million
CpG
sites
where
DNAm
can
be
measured,
only
1–3%
of
these
are
typically
available
in
most
datasets
due
to
complex
experimental
protocols
high
costs,
hindering
insights
from
data.
Leveraging
the
relationship
between
expression
offers
promise
computational
inference,
but
existing
statistical,
machine
learning,
masking-based
generative
Transformers
face
limitations:
they
cannot
infer
at
unmeasured
CpGs
or
new
samples
effectively.
To
overcome
challenges,
we
introduce
MethylProphet,
a
gene-guided,
context-aware
Transformer
model
designed
inference.
MethylProphet
employs
Bottleneck
MLP
efficient
profile
compression
specialized
sequence
tokenizer,
integrating
global
patterns
with
local
context
through
encoder
architecture.
Trained
on
whole-genome
bisulfite
sequencing
data
ENCODE
(1.6B
training
CpG-sample
pairs;
322B
tokens),
demonstrates
strong
performance
hold-out
evaluations,
effectively
inferring
samples.
In
addition,
its
application
10842
pairs
TCGA
chromosome
1
(450M
CpGsample
91B
tokens)
highlights
potential
facilitate
pan-cancer
landscape
offering
powerful
tool
advancing
research
precision
medicine.
All
codes,
data,
protocols,
models
publicly
via
https://github.com/xk-huang/methylprophet/
.
Journal of the American Medical Informatics Association,
Год журнала:
2025,
Номер
unknown
Опубликована: Фев. 25, 2025
Abstract
Objectives
The
vast
and
complex
nature
of
human
genomic
sequencing
data
presents
challenges
for
effective
analysis.
This
review
aims
to
investigate
the
application
natural
language
processing
(NLP)
techniques,
particularly
large
models
(LLMs)
transformer
architectures,
in
deciphering
codes,
focusing
on
tokenization,
models,
regulatory
annotation
prediction.
goal
this
is
assess
model
accessibility
most
recent
literature,
gaining
a
better
understanding
existing
capabilities
constraints
these
tools
data.
Materials
Methods
Following
Preferred
Reporting
Items
Systematic
Reviews
Meta-Analyses
(PRISMA)
guidelines,
our
scoping
was
conducted
across
PubMed,
Medline,
Scopus,
Web
Science,
Embase,
ACM
Digital
Library.
Studies
were
included
if
they
focused
NLP
methodologies
applied
analysis,
without
restrictions
publication
date
or
article
type.
Results
A
total
26
studies
published
between
2021
April
2024
selected
review.
highlights
that
tokenization
enhance
data,
with
applications
predicting
annotations
like
transcription-factor
binding
sites
chromatin
accessibility.
Discussion
LLMs
interpretation
promising
field
can
help
streamline
large-scale
while
also
providing
its
structures.
It
has
potential
drive
advancements
personalized
medicine
by
offering
more
efficient
scalable
solutions
Further
research
needed
discuss
overcome
current
limitations,
enhancing
transparency
applicability.
Conclusion
growing
role
NLP,
LLMs,
While
improve
prediction,
remain
interpretability.
refine
their
genomics.
Medical
digital
twins
(MDTs)
are
virtual
representations
of
patients
that
simulate
the
biological,
physiological,
and
clinical
processes
individuals
to
enable
personalized
medicine.
With
increasing
complexity
omics
data,
particularly
multiomics,
there
is
a
growing
need
for
advanced
computational
frameworks
interpret
these
data
effectively.
Foundation
models
(FMs),
large‐scale
machine
learning
pretrained
on
diverse
types,
have
recently
emerged
as
powerful
tools
improving
interpretability
decision‐making
in
precision
This
review
discusses
integration
FMs
into
MDT
systems,
their
role
enhancing
multiomics
data.
We
examine
current
challenges,
recent
advancements,
future
opportunities
leveraging
analysis
MDTs,
with
focus
application
Abstract
Transformer-based
language
models
are
successfully
used
to
address
massive
text-related
tasks.
DNA
methylation
is
an
important
epigenetic
mechanism,
and
its
analysis
provides
valuable
insights
into
gene
regulation
biomarker
identification.
Several
deep
learning–based
methods
have
been
proposed
identify
methylation,
each
seeks
strike
a
balance
between
computational
effort
accuracy.
Here,
we
introduce
MuLan-Methyl,
learning
framework
for
predicting
sites,
which
based
on
5
popular
transformer-based
models.
The
identifies
sites
3
different
types
of
methylation:
N6-adenine,
N4-cytosine,
5-hydroxymethylcytosine.
Each
the
employed
adapted
task
using
“pretrain
fine-tune”
paradigm.
Pretraining
performed
custom
corpus
fragments
taxonomy
lineages
self-supervised
learning.
Fine-tuning
aims
at
status
type.
collectively
predict
status.
We
report
excellent
performance
MuLan-Methyl
benchmark
dataset.
Moreover,
argue
that
model
captures
characteristic
differences
species
relevant
methylation.
This
work
demonstrates
can
be
applications
in
biological
sequence
joint
utilization
improves
performance.
Mulan-Methyl
open
source,
provide
web
server
implements
approach.
Heliyon,
Год журнала:
2025,
Номер
11(2), С. e41488 - e41488
Опубликована: Янв. 1, 2025
Deciphering
information
of
RNA
sequences
reveals
their
diverse
roles
in
living
organisms,
including
gene
regulation
and
protein
synthesis.
Aberrations
sequence
such
as
dysregulation
mutations
can
drive
a
spectrum
diseases
cancers,
genetic
disorders,
neurodegenerative
conditions.
Furthermore,
researchers
are
harnessing
RNA's
therapeutic
potential
for
transforming
traditional
treatment
paradigms
into
personalized
therapies
through
the
development
RNA-based
drugs
therapies.
To
gain
insights
biological
functions
to
detect
at
early
stages
develop
potent
therapeutics,
performing
types
analysis
tasks.
conventional
wet-lab
methods
is
expensive,
time-consuming
error
prone.
enable
large-scale
analysis,
empowerment
experimental
with
Artificial
Intelligence
(AI)
applications
necessitates
scientists
have
comprehensive
knowledge
both
DNA
AI
fields.
While
molecular
biologists
encounter
challenges
understanding
methods,
computer
often
lack
basic
foundations
Considering
absence
literature
that
bridges
this
research
gap
promotes
AI-driven
applications,
contributions
manuscript
manifold:
It
equips
47
distinct
sets
stage
benchmark
datasets
related
tasks
by
facilitating
cruxes
64
different
databases.
presents
word
embeddings
language
models
across
streamlines
new
predictors
providing
survey
58
70
based
predictive
pipelines
performance
values
well
top
encoding
performances