arXiv (Cornell University),
Год журнала:
2023,
Номер
unknown
Опубликована: Янв. 1, 2023
This
paper
presents
the
Ensemble
Nucleotide
Byte-level
Encoder-Decoder
(ENBED)
foundation
model,
analyzing
DNA
sequences
at
byte-level
precision
with
an
encoder-decoder
Transformer
architecture.
ENBED
uses
a
sub-quadratic
implementation
of
attention
to
develop
efficient
model
capable
sequence-to-sequence
transformations,
generalizing
previous
genomic
models
encoder-only
or
decoder-only
architectures.
We
use
Masked
Language
Modeling
pre-train
using
reference
genome
and
apply
it
in
following
downstream
tasks:
(1)
identification
enhancers,
promotors
splice
sites,
(2)
recognition
containing
base
call
mismatches
insertion/deletion
errors,
advantage
over
tokenization
schemes
involving
multiple
pairs,
which
lose
ability
analyze
precision,
(3)
biological
function
annotations
sequences,
(4)
generating
mutations
Influenza
virus
architecture
validating
them
against
real-world
observations.
In
each
these
tasks,
we
demonstrate
significant
improvement
as
compared
existing
state-of-the-art
results.
Journal of Medical Internet Research,
Год журнала:
2024,
Номер
26, С. e59505 - e59505
Опубликована: Авг. 20, 2024
In
the
complex
and
multidimensional
field
of
medicine,
multimodal
data
are
prevalent
crucial
for
informed
clinical
decisions.
Multimodal
span
a
broad
spectrum
types,
including
medical
images
(eg,
MRI
CT
scans),
time-series
sensor
from
wearable
devices
electronic
health
records),
audio
recordings
heart
respiratory
sounds
patient
interviews),
text
notes
research
articles),
videos
surgical
procedures),
omics
genomics
proteomics).
While
advancements
in
large
language
models
(LLMs)
have
enabled
new
applications
knowledge
retrieval
processing
field,
most
LLMs
remain
limited
to
unimodal
data,
typically
text-based
content,
often
overlook
importance
integrating
diverse
modalities
encountered
practice.
This
paper
aims
present
detailed,
practical,
solution-oriented
perspective
on
use
(M-LLMs)
field.
Our
investigation
spanned
M-LLM
foundational
principles,
current
potential
applications,
technical
ethical
challenges,
future
directions.
By
connecting
these
elements,
we
aimed
provide
comprehensive
framework
that
links
aspects
M-LLMs,
offering
unified
vision
their
care.
approach
guide
both
practical
implementations
M-LLMs
care,
positioning
them
as
paradigm
shift
toward
integrated,
data–driven
We
anticipate
this
work
will
spark
further
discussion
inspire
development
innovative
approaches
next
generation
systems.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 4, 2024
ABSTRACT
The
emergence
of
genomic
language
models
(gLMs)
offers
an
unsupervised
approach
to
learning
a
wide
diversity
cis
-regulatory
patterns
in
the
non-coding
genome
without
requiring
labels
functional
activity
generated
by
wet-lab
experiments.
Previous
evaluations
have
shown
that
pre-trained
gLMs
can
be
leveraged
improve
predictive
performance
across
broad
range
regulatory
genomics
tasks,
albeit
using
relatively
simple
benchmark
datasets
and
baseline
models.
Since
these
studies
were
tested
upon
fine-tuning
their
weights
for
each
downstream
task,
determining
whether
gLM
representations
embody
foundational
understanding
biology
remains
open
question.
Here
we
evaluate
representational
power
predict
interpret
cell-type-specific
data
span
DNA
RNA
regulation.
Our
findings
suggest
probing
do
not
offer
substantial
advantages
over
conventional
machine
approaches
use
one-hot
encoded
sequences.
This
work
highlights
major
gap
with
current
gLMs,
raising
potential
issues
pre-training
strategies
genome.
Experimental & Molecular Medicine,
Год журнала:
2024,
Номер
56(6), С. 1293 - 1321
Опубликована: Июнь 14, 2024
Abstract
The
exponential
growth
of
big
data
in
RNA
biology
(RB)
has
led
to
the
development
deep
learning
(DL)
models
that
have
driven
crucial
discoveries.
As
constantly
evidenced
by
DL
studies
other
fields,
successful
implementation
RB
depends
heavily
on
effective
utilization
large-scale
datasets
from
public
databases.
In
achieving
this
goal,
encoding
methods,
algorithms,
and
techniques
align
well
with
biological
domain
knowledge
played
pivotal
roles.
review,
we
provide
guiding
principles
for
applying
these
concepts
various
problems
demonstrating
examples
associated
methodologies.
We
also
discuss
remaining
challenges
developing
suggest
strategies
overcome
challenges.
Overall,
review
aims
illuminate
compelling
potential
ways
apply
powerful
technology
investigate
intriguing
more
effectively.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июнь 5, 2024
Interpreting
function
and
fitness
effects
in
diverse
plant
genomes
requires
transferable
models.
Language
models
(LMs)
pre-trained
on
large-scale
biological
sequences
can
learn
evolutionary
conservation
offer
cross-species
prediction
better
than
supervised
through
fine-tuning
limited
labeled
data.
We
introduce
PlantCaduceus,
a
DNA
LM
based
the
Caduceus
Mamba
architectures,
curated
dataset
of
16
Angiosperm
genomes.
Fine-tuning
PlantCaduceus
Arabidopsis
data
for
four
tasks,
including
predicting
translation
initiation/termination
sites
splice
donor
acceptor
sites,
demonstrated
high
transferability
to
160
million
year
diverged
maize,
outperforming
best
existing
by
1.45
7.23-fold.
is
competitive
state-of-the-art
protein
LMs
terms
deleterious
mutation
identification,
threefold
PhyloP.
Additionally,
successfully
identifies
well-known
causal
variants
both
maize.
Overall,
versatile
that
accelerate
genomics
crop
breeding
applications.
Since
the
completion
of
human
genome
sequencing
project
in
2001,
significant
progress
has
been
made
areas
such
as
gene
regulation
editing
and
protein
structure
prediction.
However,
given
vast
amount
genomic
data,
segments
that
can
be
fully
annotated
understood
remain
relatively
limited.
If
we
consider
a
book,
constructing
its
equivalents
words,
sentences,
paragraphs
long-standing
popular
research
direction.
Recently,
studies
on
transfer
learning
large
language
models
have
provided
novel
approach
to
this
challenge.
Multilingual
ability,
which
assesses
how
well
fine-tuned
source
applied
other
languages,
extensively
studied
multilingual
pre-trained
models.
Similarly,
natural
capabilities
“DNA
language”
also
validated.
Building
upon
these
findings,
first
trained
foundational
model
capable
transferring
linguistic
from
English
DNA
sequences.
Using
model,
constructed
vocabulary
words
mapped
their
equivalents.
Subsequently,
using
datasets
for
paragraphing
sentence
segmentation
develop
segmenting
sequences
into
sentences
paragraphs.
Leveraging
models,
processed
GRCh38.p14
by
segmenting,
tokenizing,
organizing
it
“book”
comprised
“words,”
“sentences,”
“paragraphs.”
Additionally,
based
DNA-to-English
mapping,
created
an
“English
version”
book.
This
study
offers
perspective
understanding
provides
exciting
possibilities
developing
innovative
tools
search,
generation,
analysis.
Medical
digital
twins
(MDTs)
are
virtual
representations
of
patients
that
simulate
the
biological,
physiological,
and
clinical
processes
individuals
to
enable
personalized
medicine.
With
increasing
complexity
omics
data,
particularly
multiomics,
there
is
a
growing
need
for
advanced
computational
frameworks
interpret
these
data
effectively.
Foundation
models
(FMs),
large‐scale
machine
learning
pretrained
on
diverse
types,
have
recently
emerged
as
powerful
tools
improving
interpretability
decision‐making
in
precision
This
review
discusses
integration
FMs
into
MDT
systems,
their
role
enhancing
multiomics
data.
We
examine
current
challenges,
recent
advancements,
future
opportunities
leveraging
analysis
MDTs,
with
focus
application
We
pretrain
METAGENE-1,
a
7-billion-parameter
autoregressive
transformer
model,
which
we
refer
to
as
_metagenomic
foundation
model_,
on
novel
corpus
of
diverse
metagenomic
DNA
and
RNA
sequences
comprising
over
1.5
trillion
base
pairs.
This
dataset
is
sourced
from
large
collection
human
wastewater
samples,
processed
sequenced
using
deep
(next-generation)
sequencing
methods.
Unlike
genomic
models
that
focus
individual
genomes
or
curated
sets
specific
species,
the
aim
METAGENE-1
capture
full
distribution
information
present
within
this
wastewater,
aid
in
tasks
relevant
pandemic
monitoring
pathogen
detection.
carry
out
byte-pair
encoding
(BPE)
tokenization
our
dataset,
tailored
for
sequences,
then
model.
In
paper,
first
detail
pretraining
strategy,
model
architecture,
highlighting
considerations
design
choices
enable
effective
modeling
data.
show
results
providing
details
about
losses,
system
metrics,
training
stability
course
pretraining.
Finally,
demonstrate
performance
achieves
state-of-the-art
set
benchmarks
new
evaluations
focused
human-pathogen
detection
sequence
embedding,
showcasing
its
potential
public
health
applications
monitoring,
biosurveillance,
early
emerging
threats.
Website:
metagene.ai
[https://metagene.ai/]
Model
Weights:
huggingface.co/metagene-ai
[https://huggingface.co/metagene-ai]
Code
Repository:
github.com/metagene-ai
[https://github.com/metagene-ai]