arXiv (Cornell University),
Год журнала:
2023,
Номер
unknown
Опубликована: Янв. 1, 2023
This
paper
presents
the
Ensemble
Nucleotide
Byte-level
Encoder-Decoder
(ENBED)
foundation
model,
analyzing
DNA
sequences
at
byte-level
precision
with
an
encoder-decoder
Transformer
architecture.
ENBED
uses
a
sub-quadratic
implementation
of
attention
to
develop
efficient
model
capable
sequence-to-sequence
transformations,
generalizing
previous
genomic
models
encoder-only
or
decoder-only
architectures.
We
use
Masked
Language
Modeling
pre-train
using
reference
genome
and
apply
it
in
following
downstream
tasks:
(1)
identification
enhancers,
promotors
splice
sites,
(2)
recognition
containing
base
call
mismatches
insertion/deletion
errors,
advantage
over
tokenization
schemes
involving
multiple
pairs,
which
lose
ability
analyze
precision,
(3)
biological
function
annotations
sequences,
(4)
generating
mutations
Influenza
virus
architecture
validating
them
against
real-world
observations.
In
each
these
tasks,
we
demonstrate
significant
improvement
as
compared
existing
state-of-the-art
results.
Frontiers in Medicine,
Год журнала:
2025,
Номер
12
Опубликована: Апрель 8, 2025
Deoxyribonucleic
acid
(DNA)
serves
as
fundamental
genetic
blueprint
that
governs
development,
functioning,
growth,
and
reproduction
of
all
living
organisms.
DNA
can
be
altered
through
germline
somatic
mutations.
Germline
mutations
underlie
hereditary
conditions,
while
induced
by
various
factors
including
environmental
influences,
chemicals,
lifestyle
choices,
errors
in
replication
repair
mechanisms
which
lead
to
cancer.
sequence
analysis
plays
a
pivotal
role
uncovering
the
intricate
information
embedded
within
an
organism's
understanding
modify
it.
This
helps
early
detection
diseases
design
targeted
therapies.
Traditional
wet-lab
experimental
traditional
methods
is
costly,
time-consuming,
prone
errors.
To
accelerate
large-scale
analysis,
researchers
are
developing
AI
applications
complement
methods.
These
approaches
help
generate
hypotheses,
prioritize
experiments,
interpret
results
identifying
patterns
large
genomic
datasets.
Effective
integration
with
validation
requires
scientists
understand
both
fields.
Considering
need
comprehensive
literature
bridges
gap
between
fields,
contributions
this
paper
manifold:
It
presents
diverse
range
tasks
methodologies.
equips
essential
biological
knowledge
44
distinct
aligns
these
3
AI-paradigms,
namely,
classification,
regression,
clustering.
streamlines
into
consolidating
36
databases
used
develop
benchmark
datasets
for
different
tasks.
ensure
performance
comparisons
new
existing
predictors,
it
provides
insights
140
related
word
embeddings
language
models
across
development
predictors
providing
survey
39
67
based
predictive
pipeline
values
well
top
performing
encoding-based
their
performances
Frontiers in Artificial Intelligence,
Год журнала:
2024,
Номер
7
Опубликована: Сен. 24, 2024
Cultured
meat
has
the
potential
to
provide
a
complementary
industry
with
reduced
environmental,
ethical,
and
health
impacts.
However,
major
technological
challenges
remain
which
require
time-and
resource-intensive
research
development
efforts.
Machine
learning
accelerate
cultured
technology
by
streamlining
experiments,
predicting
optimal
results,
reducing
experimentation
time
resources.
use
of
machine
in
is
its
infancy.
This
review
covers
work
available
date
on
explores
future
possibilities.
We
address
four
areas
development:
establishing
cell
lines,
culture
media
design,
microscopy
image
analysis,
bioprocessing
food
processing
optimization.
In
addition,
we
have
included
survey
datasets
relevant
CM
research.
aims
foundation
necessary
for
both
scientists
identify
opportunities
at
intersection
between
learning.
Journal of Translational Medicine,
Год журнала:
2024,
Номер
22(1)
Опубликована: Авг. 12, 2024
Decoding
human
genomic
sequences
requires
comprehensive
analysis
of
DNA
sequence
functionality.
Through
computational
and
experimental
approaches,
researchers
have
studied
the
genotype-phenotype
relationship
generate
important
datasets
that
help
unravel
complicated
genetic
blueprints.
Thus,
recently
developed
artificial
intelligence
methods
can
be
used
to
interpret
functions
those
sequences.
This
study
explores
use
deep
learning,
particularly
pre-trained
models
like
DNA_bert_6
human_gpt2-v1,
in
interpreting
representing
genome
Initially,
we
meticulously
constructed
multiple
linking
genotypes
phenotypes
fine-tune
for
precise
classification.
Additionally,
evaluate
influence
length
on
classification
results
analyze
impact
feature
extraction
hidden
layers
our
model
using
HERV
dataset.
To
enhance
understanding
phenotype-specific
patterns
recognized
by
model,
perform
enrichment,
pathogenicity
conservation
analyzes
specific
motifs
endogenous
retrovirus
(HERV)
with
high
average
local
representation
weight
(ALRW)
scores.
We
displaying
commendable
performance
comparison
random
sequences,
dataset,
which
achieved
binary
multi-classification
accuracies
F1
values
exceeding
0.935
0.888,
respectively.
Notably,
fine-tuning
dataset
not
only
improved
ability
identify
distinguish
diverse
information
types
within
but
also
successfully
identified
associated
neurological
disorders
cancers
regions
ALRW
Subsequent
these
shed
light
adaptive
responses
species
environmental
pressures
their
co-evolution
pathogens.
These
findings
highlight
potential
learning
representations,
when
utilizing
provide
valuable
insights
future
research
endeavors.
represents
an
innovative
strategy
combines
representations
classical
analyzing
functionality
thereby
promoting
cross-fertilization
between
genomics
intelligence.
The Plant Journal,
Год журнала:
2024,
Номер
unknown
Опубликована: Дек. 12, 2024
SUMMARY
Due
to
its
excellent
performance
in
processing
large
amounts
of
data
and
capturing
complex
non‐linear
relationships,
deep
learning
has
been
widely
applied
many
fields
plant
biology.
Here
we
first
review
the
application
analyzing
genome
sequences
predict
gene
expression,
chromatin
interactions,
epigenetic
features
(open
chromatin,
transcription
factor
binding
sites,
methylation
sites)
plants.
Then,
current
motif
mining
functional
component
design
synthesis
based
on
generative
adversarial
networks,
models,
attention
mechanisms
are
elaborated
detail.
The
progress
protein
structure
function
prediction,
genomic
model
applications
is
also
discussed.
Finally,
this
work
provides
prospects
for
future
development
plants
with
regard
multiple
omics
data,
algorithm
optimization,
language
sequence
design,
intelligent
breeding.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Дек. 20, 2023
Abstract
Large
Language
Models
(LLMs)
have
shown
great
promise
in
their
knowledge
integration
and
problem-solving
capabilities,
but
ability
to
assist
bioinformatics
research
has
not
been
systematically
evaluated.
To
bridge
this
gap,
we
present
BioLLMBench,
a
novel
benchmarking
framework
coupled
with
scoring
metric
scheme
for
comprehensively
evaluating
LLMs
solving
tasks.
Through
conducted
thorough
evaluation
of
2,160
experimental
runs
the
three
most
widely
used
models,
GPT-4,
Bard
LLaMA,
focusing
on
36
distinct
tasks
within
field
bioinformatics.
The
come
from
six
key
areas
emphasis
that
directly
relate
daily
challenges
faced
by
individuals
field.
These
are
domain
expertise,
mathematical
problem-solving,
coding
proficiency,
data
visualization,
summarizing
papers,
developing
machine
learning
models.
also
span
across
varying
levels
complexity,
ranging
fundamental
concepts
expert-level
challenges.
Each
area
was
evaluated
using
seven
specifically
designed
task
metrics,
which
were
then
conduct
an
overall
LLM’s
response.
enhance
our
understanding
model
responses
under
conditions,
implemented
Contextual
Response
Variability
Analysis.
Our
results
reveal
diverse
spectrum
performance,
GPT-4
leading
all
except
problem
solving.
GPT4
able
achieve
proficiency
score
91.3%
tasks,
while
excelled
97.5%
success
rate.
While
outperformed
development
average
accuracy
65.32%,
both
LLaMA
unable
generate
executable
end-to-end
code.
All
models
considerable
paper
summarization,
none
them
exceeding
40%
Recall-Oriented
Understudy
Gisting
Evaluation
(ROUGE)
score,
highlighting
significant
future
improvement.
We
observed
increase
performance
variance
when
new
chatting
window
compared
same
chat,
although
scores
between
two
contextual
environments
remained
similar.
Lastly,
discuss
various
limitations
these
acknowledge
risks
associated
potential
misuse.
Accurate
modeling
of
DNA
sequences
requires
capturing
distant
semantic
relationships
between
the
nucleotide
acid
bases.
Most
existing
deep
neural
network
models
face
two
challenges:
1)
they
are
limited
to
short
fragments
and
cannot
capture
long-range
interactions,
2)
require
many
supervised
labels,
which
is
often
expensive
in
practice.
We
propose
a
new
model
called
SwanDNA
address
above
challenges.
By
using
sparse
wide
architecture,
our
enables
inferences
over
very
long
sequences.
incorporating
into
self-supervised
learning
framework,
method
can
give
accurate
predictions
while
less
labels.
evaluate
three
sequence
inference
tasks,
human
variant
effect,
open
chromatin
regions
detection
plant
genes,
GenomicBenchmarks.
outperforms
all
competitors
first
tasks
achieves
state-of-art
seven
eight
datasets
Our
code
available
at
https://github.com/wiedersehne/SwanDNA.
UNSTRUCTURED
In
the
complex
and
multidimensional
field
of
medicine,
multimodal
data
are
prevalent
crucial
for
informed
clinical
decisions.
Multimodal
span
a
broad
spectrum
types,
including
medical
images
(eg,
MRI
CT
scans),
time-series
sensor
from
wearable
devices
electronic
health
records),
audio
recordings
heart
respiratory
sounds
patient
interviews),
text
notes
research
articles),
videos
surgical
procedures),
omics
genomics
proteomics).
While
advancements
in
large
language
models
(LLMs)
have
enabled
new
applications
knowledge
retrieval
processing
field,
most
LLMs
remain
limited
to
unimodal
data,
typically
text-based
content,
often
overlook
importance
integrating
diverse
modalities
encountered
practice.
This
paper
aims
present
detailed,
practical,
solution-oriented
perspective
on
use
(M-LLMs)
field.
Our
investigation
spanned
M-LLM
foundational
principles,
current
potential
applications,
technical
ethical
challenges,
future
directions.
By
connecting
these
elements,
we
aimed
provide
comprehensive
framework
that
links
aspects
M-LLMs,
offering
unified
vision
their
care.
approach
guide
both
practical
implementations
M-LLMs
care,
positioning
them
as
paradigm
shift
toward
integrated,
data–driven
We
anticipate
this
work
will
spark
further
discussion
inspire
development
innovative
approaches
next
generation
systems.