bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Дек. 8, 2024
Abstract
Background
Phage
lifestyle
prediction,
i.e.
classifying
phage
sequences
as
virulent
or
temperate,
is
crucial
in
biomedical
and
ecological
applications.
from
metagenome
metavirome
assemblies
are
often
fragmented,
the
diversity
of
environmental
phages
not
well
known.
Current
computational
approaches
rely
on
database
comparisons
machine
learning
algorithms
that
require
significant
effort
expertise
to
update.
We
propose
using
genomic
language
models
for
classification,
allowing
efficient
direct
analysis
nucleotide
without
need
sophisticated
preprocessing
pipelines
manually
curated
databases.
Methods
trained
three
(DNABERT-2,
Nucleotide
Transformer,
ProkBERT)
datasets
short,
fragmented
sequences.
These
were
then
compared
with
dedicated
prediction
methods
(PhaTYP,
DeePhage,
BACPHLIP)
terms
accuracy,
speed,
generalization
capability.
Results
ProkBERT
PhaStyle
consistently
outperforms
existing
various
scenarios.
It
generalizes
out-of-sample
data,
accurately
classifies
extreme
environments,
also
demonstrates
high
inference
speed.
Despite
having
up
20
times
fewer
parameters,
it
proved
be
better
performing
than
much
larger
models.
Conclusions
Genomic
offer
a
simple
computationally
alternative
solving
complex
classification
tasks,
such
prediction.
PhaStyle’s
simplicity,
performance
suggest
its
utility
clinical
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Янв. 15, 2023
Closing
the
gap
between
measurable
genetic
information
and
observable
traits
is
a
longstanding
challenge
in
genomics.
Yet,
prediction
of
molecular
phenotypes
from
DNA
sequences
alone
remains
limited
inaccurate,
often
driven
by
scarcity
annotated
data
inability
to
transfer
learnings
tasks.
Here,
we
present
an
extensive
study
foundation
models
pre-trained
on
sequences,
named
Nucleotide
Transformer,
ranging
50M
up
2.5B
parameters
integrating
3,202
diverse
human
genomes,
as
well
850
genomes
selected
across
phyla,
including
both
model
non-model
organisms.
These
transformer
yield
transferable,
context-specific
representations
nucleotide
which
allow
for
accurate
phenotype
even
low-data
settings.
We
show
that
developed
can
be
fine-tuned
at
low
cost
despite
available
regime
solve
variety
genomics
applications.
Despite
no
supervision,
learned
focus
attention
key
genomic
elements,
those
regulate
gene
expression,
such
enhancers.
Lastly,
demonstrate
utilizing
improve
prioritization
functional
variants.
The
training
application
foundational
explored
this
provide
widely
applicable
stepping
stone
bridge
sequence.
Code
weights
at:
https://github.com/instadeepai/nucleotide-transformer
Jax
https://huggingface.co/InstaDeepAI
Pytorch.
Example
notebooks
apply
these
any
downstream
task
are
https://huggingface.co/docs/transformers/notebooks#pytorch-bio.
The
genome
is
a
sequence
that
encodes
the
DNA,
RNA,
and
proteins
orchestrate
an
organism’s
function.
We
present
Evo,
long-context
genomic
foundation
model
with
frontier
architecture
trained
on
millions
of
prokaryotic
phage
genomes,
report
scaling
laws
DNA
to
complement
observations
in
language
vision.
Evo
generalizes
across
proteins,
enabling
zero-shot
function
prediction
competitive
domain-specific
models
generation
functional
CRISPR-Cas
transposon
systems,
representing
first
examples
protein-RNA
protein-DNA
codesign
model.
also
learns
how
small
mutations
affect
whole-organism
fitness
generates
megabase-scale
sequences
plausible
architecture.
These
capabilities
span
molecular
scales
complexity,
advancing
our
understanding
control
biology.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Фев. 27, 2024
The
genome
is
a
sequence
that
completely
encodes
the
DNA,
RNA,
and
proteins
orchestrate
function
of
whole
organism.
Advances
in
machine
learning
combined
with
massive
datasets
genomes
could
enable
biological
foundation
model
accelerates
mechanistic
understanding
generative
design
complex
molecular
interactions.
We
report
Evo,
genomic
enables
prediction
generation
tasks
from
to
scale.
Using
an
architecture
based
on
advances
deep
signal
processing,
we
scale
Evo
7
billion
parameters
context
length
131
kilobases
(kb)
at
single-nucleotide,
byte
resolution.
Trained
prokaryotic
genomes,
can
generalize
across
three
fundamental
modalities
central
dogma
biology
perform
zero-shot
competitive
with,
or
outperforms,
leading
domain-specific
language
models.
also
excels
multi-element
tasks,
which
demonstrate
by
generating
synthetic
CRISPR-Cas
complexes
entire
transposable
systems
for
first
time.
information
learned
over
predict
gene
essentiality
nucleotide
resolution
generate
coding-rich
sequences
up
650
kb
length,
orders
magnitude
longer
than
previous
methods.
multi-modal
multi-scale
provides
promising
path
toward
improving
our
control
multiple
levels
complexity.
The
prediction
of
molecular
phenotypes
from
DNA
sequences
remains
a
longstanding
challenge
in
genomics,
often
driven
by
limited
annotated
data
and
the
inability
to
transfer
learnings
between
tasks.
Here,
we
present
an
extensive
study
foundation
models
pre-trained
on
sequences,
named
Nucleotide
Transformer,
ranging
50
million
up
2.5
billion
parameters
integrating
information
3,202
human
genomes
850
diverse
species.
These
transformer
yield
context-specific
representations
nucleotide
which
allow
for
accurate
predictions
even
low-data
settings.
We
show
that
developed
can
be
fine-tuned
at
low
cost
solve
variety
genomics
applications.
Despite
no
supervision,
learned
focus
attention
key
genomic
elements
used
improve
prioritization
genetic
variants.
training
application
foundational
provides
widely
applicable
approach
phenotype
sequence.
Transformer
is
series
different
parameter
sizes
datasets
applied
various
downstream
tasks
fine-tuning.
ACM Computing Surveys,
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 26, 2025
Large
Language
Models
(LLMs)
have
emerged
as
a
transformative
power
in
enhancing
natural
language
comprehension,
representing
significant
stride
toward
artificial
general
intelligence.
The
application
of
LLMs
extends
beyond
conventional
linguistic
boundaries,
encompassing
specialized
systems
developed
within
various
scientific
disciplines.
This
growing
interest
has
led
to
the
advent
LLMs,
novel
subclass
specifically
engineered
for
facilitating
discovery.
As
burgeoning
area
community
AI
Science,
warrant
comprehensive
exploration.
However,
systematic
and
up-to-date
survey
introducing
them
is
currently
lacking.
In
this
paper,
we
endeavor
methodically
delineate
concept
“scientific
language”,
whilst
providing
thorough
review
latest
advancements
LLMs.
Given
expansive
realm
disciplines,
our
analysis
adopts
focused
lens,
concentrating
on
biological
chemical
domains.
includes
an
in-depth
examination
textual
knowledge,
small
molecules,
macromolecular
proteins,
genomic
sequences,
their
combinations,
analyzing
terms
model
architectures,
capabilities,
datasets,
evaluation.
Finally,
critically
examine
prevailing
challenges
point
out
promising
research
directions
along
with
advances
By
offering
overview
technical
developments
field,
aspires
be
invaluable
resource
researchers
navigating
intricate
landscape
Abstract
Background
The
rise
of
large-scale
multi-species
genome
sequencing
projects
promises
to
shed
new
light
on
how
genomes
encode
gene
regulatory
instructions.
To
this
end,
algorithms
are
needed
that
can
leverage
conservation
capture
elements
while
accounting
for
their
evolution.
Results
Here,
we
introduce
species-aware
DNA
language
models,
which
trained
more
than
800
species
spanning
over
500
million
years
Investigating
ability
predict
masked
nucleotides
from
context,
show
models
distinguish
transcription
factor
and
RNA-binding
protein
motifs
background
non-coding
sequence.
Owing
flexibility,
conserved
much
further
evolutionary
distances
sequence
alignment
would
allow.
Remarkably,
reconstruct
motif
instances
bound
in
vivo
better
unbound
ones
account
the
evolution
sequences
positional
constraints,
showing
these
functional
high-order
context.
We
training
yields
improved
representations
endogenous
MPRA-based
expression
prediction,
as
well
discovery.
Conclusions
Collectively,
results
demonstrate
a
powerful,
flexible,
scalable
tool
integrate
information
large
compendia
highly
diverged
genomes.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 4, 2024
ABSTRACT
The
emergence
of
genomic
language
models
(gLMs)
offers
an
unsupervised
approach
to
learning
a
wide
diversity
cis
-regulatory
patterns
in
the
non-coding
genome
without
requiring
labels
functional
activity
generated
by
wet-lab
experiments.
Previous
evaluations
have
shown
that
pre-trained
gLMs
can
be
leveraged
improve
predictive
performance
across
broad
range
regulatory
genomics
tasks,
albeit
using
relatively
simple
benchmark
datasets
and
baseline
models.
Since
these
studies
were
tested
upon
fine-tuning
their
weights
for
each
downstream
task,
determining
whether
gLM
representations
embody
foundational
understanding
biology
remains
open
question.
Here
we
evaluate
representational
power
predict
interpret
cell-type-specific
data
span
DNA
RNA
regulation.
Our
findings
suggest
probing
do
not
offer
substantial
advantages
over
conventional
machine
approaches
use
one-hot
encoded
sequences.
This
work
highlights
major
gap
with
current
gLMs,
raising
potential
issues
pre-training
strategies
genome.
Since
the
completion
of
human
genome
sequencing
project
in
2001,
significant
progress
has
been
made
areas
such
as
gene
regulation
editing
and
protein
structure
prediction.
However,
given
vast
amount
genomic
data,
segments
that
can
be
fully
annotated
understood
remain
relatively
limited.
If
we
consider
a
book,
constructing
its
equivalents
words,
sentences,
paragraphs
long-standing
popular
research
direction.
Recently,
studies
on
transfer
learning
large
language
models
have
provided
novel
approach
to
this
challenge.
Multilingual
ability,
which
assesses
how
well
fine-tuned
source
applied
other
languages,
extensively
studied
multilingual
pre-trained
models.
Similarly,
natural
capabilities
“DNA
language”
also
validated.
Building
upon
these
findings,
first
trained
foundational
model
capable
transferring
linguistic
from
English
DNA
sequences.
Using
model,
constructed
vocabulary
words
mapped
their
equivalents.
Subsequently,
using
datasets
for
paragraphing
sentence
segmentation
develop
segmenting
sequences
into
sentences
paragraphs.
Leveraging
models,
processed
GRCh38.p14
by
segmenting,
tokenizing,
organizing
it
“book”
comprised
“words,”
“sentences,”
“paragraphs.”
Additionally,
based
DNA-to-English
mapping,
created
an
“English
version”
book.
This
study
offers
perspective
understanding
provides
exciting
possibilities
developing
innovative
tools
search,
generation,
analysis.