Frontiers in Microbiology,
Год журнала:
2024,
Номер
15
Опубликована: Дек. 23, 2024
dna2bit
is
an
ultra-fast
software
specifically
engineered
for
microbial
genome
analysis,
particularly
adept
at
calculating
distances
within
metagenome
and
single
amplified
datasets.
Distinguished
from
existing
such
as
Mash
Dashing,
employs
feature
hashing
technique
Hamming
distance
to
achieve
enhanced
speed
memory
utilization,
without
sacrifice
in
the
accuracy
of
average
nucleotide
identity
calculations.
has
promising
applications
various
domains
approximation,
metagenomic
sequence
clustering,
homology
querying.
significantly
boosts
computational
efficiency
handling
large
datasets
including
genomes,
thereby
facilitating
a
better
understanding
population
heterogeneity
comparative
genomics
microorganisms.
available
https://github.com/lijuzeng/dna2bit
.
It
has
been
over
a
decade
since
the
first
publication
of
method
dedicated
entirely
to
mapping
long-reads.
The
distinctive
characteristics
long
reads
resulted
in
methods
moving
from
seed-and-extend
framework
used
for
short
seed-and-chain
due
seed
abundance
each
read.
main
novelties
are
based
on
alternative
constructs
or
chaining
formulations.
Dozens
tools
now
exist,
whose
heuristics
have
evolved
considerably.
We
provide
an
overview
long-read
mappers.
Since
they
driven
by
implementation-specific
parameters,
we
develop
original
visualization
tool
understand
parameter
settings
(
http://bcazaux.polytech-lille.net/Minimap2/
).
Computational and Structural Biotechnology Journal,
Год журнала:
2022,
Номер
20, С. 4579 - 4599
Опубликована: Янв. 1, 2022
We
now
need
more
than
ever
to
make
genome
analysis
intelligent.
read,
analyze,
and
interpret
our
genomes
not
only
quickly,
but
also
accurately
efficiently
enough
scale
the
population
level.
There
currently
exist
major
computational
bottlenecks
inefficiencies
throughout
entire
pipeline,
because
state-of-the-art
sequencing
technologies
are
still
able
read
a
in
its
entirety.
describe
ongoing
journey
significantly
improving
performance,
accuracy,
efficiency
of
using
intelligent
algorithms
hardware
architectures.
explain
algorithmic
methods
hardware-based
acceleration
approaches
for
each
step
pipeline
provide
experimental
evaluations.
Algorithmic
exploit
structure
as
well
underlying
hardware.
Hardware-based
specialized
microarchitectures
or
various
execution
paradigms
(e.g.,
processing
inside
near
memory)
along
with
changes,
leading
new
hardware/software
co-designed
systems.
conclude
foreshadowing
future
challenges,
benefits,
research
directions
triggered
by
development
both
very
low
cost
yet
highly
error
prone
chips
genomics.
hope
that
these
efforts
challenges
we
discuss
foundation
work
making
The
script
data
used
evaluation
available
at:
https://github.com/CMU-SAFARI/Molecules2Variations
Bioinformatics,
Год журнала:
2023,
Номер
39(Supplement_1), С. i297 - i307
Опубликована: Июнь 1, 2023
Nanopore
sequencers
generate
electrical
raw
signals
in
real-time
while
sequencing
long
genomic
strands.
These
can
be
analyzed
as
they
are
generated,
providing
an
opportunity
for
genome
analysis.
An
important
feature
of
nanopore
sequencing,
Read
Until,
eject
strands
from
without
fully
them,
which
provides
opportunities
to
computationally
reduce
the
time
and
cost.
However,
existing
works
utilizing
Until
either
(i)
require
powerful
computational
resources
that
may
not
available
portable
or
(ii)
lack
scalability
large
genomes,
rendering
them
inaccurate
ineffective.
We
propose
RawHash,
first
mechanism
accurately
efficiently
perform
analysis
genomes
using
a
hash-based
similarity
search.
To
enable
this,
RawHash
ensures
corresponding
same
DNA
content
lead
hash
value,
regardless
slight
variations
these
signals.
achieves
accurate
search
via
effective
quantization
such
have
quantized
value
and,
subsequently,
value.
evaluate
on
three
applications:
read
mapping,
relative
abundance
estimation,
(iii)
contamination
Our
evaluations
show
is
only
tool
provide
high
accuracy
throughput
analyzing
real-time.
When
compared
state-of-the-art
techniques,
UNCALLED
Sigmap,
25.8×
3.4×
better
average
significantly
respectively.
Source
code
at
https://github.com/CMU-SAFARI/RawHash.
The
Jaccard
similarity
on
k-mer
sets
has
shown
to
be
a
convenient
proxy
for
sequence
identity.
By
avoiding
expensive
base-level
alignments
and
comparing
reduced
representations,
tools
such
as
MashMap
can
scale
massive
numbers
of
pairwise
comparisons
while
still
providing
useful
estimates.
However,
due
their
reliance
minimizer
winnowing,
previous
versions
were
biased
inconsistent
estimators
similarity.
This
directly
impacts
downstream
that
rely
the
accuracy
these
DNA
sequencing
data
continue
to
progress
toward
longer
reads
with
increasingly
lower
error
rates.
We
focus
on
the
critical
problem
of
mapping,
or
aligning,
low-divergence
sequences
from
long
(e.g.,
Pacific
Biosciences
[PacBio]
HiFi)
a
reference
genome,
which
poses
challenges
in
terms
accuracy
and
computational
resources
when
using
cutting-edge
read
mapping
approaches
that
are
designed
for
all
types
alignments.
A
natural
idea
would
be
optimize
efficiency
seeds
reduce
probability
extraneous
matches;
however,
contiguous
exact
quickly
reach
sensitivity
limit.
introduce
mapquik,
novel
strategy
creates
accurate
by
anchoring
alignments
through
matches
k
consecutively
sampled
minimizers
(
-min-mers)
only
indexing
-min-mers
occur
once
thereby
unlocking
ultrafast
while
retaining
high
sensitivity.
show
mapquik
significantly
accelerates
seeding
chaining
steps—fundamental
bottlenecks
mapping—for
both
human
maize
genomes
>
96%
near-perfect
specificity.
On
real
simulated
reads,
achieves
37
×
speedup
over
state-of-the-art
tool
minimap2,
410
making
fastest
mapper
date.
These
accelerations
enabled
not
minimizer-space
but
also
heuristic
O(n)
pseudochaining
algorithm,
improves
upon
long-standing
mathvariant="script">O(nlogn)
bound.
Minimizer-space
computation
builds
foundation
achieving
real-time
analysis
long-read
data.
Raw
nanopore
signals
can
be
analyzed
while
they
are
being
generated,
a
process
known
as
real-time
analysis.
Real-time
analysis
of
raw
is
essential
to
utilize
the
unique
features
that
sequencing
provides,
enabling
early
stopping
read
or
entire
run
based
on
The
state-of-the-art
mechanism,
RawHash,
offers
first
hash-based
efficient
and
accurate
similarity
identification
between
reference
genome
by
quickly
matching
their
hash
values.
In
this
work,
we
introduce
RawHash2,
which
provides
major
improvements
over
including
more
sensitive
quantization
chaining
algorithms,
weighted
mapping
decisions,
frequency
filters
reduce
ambiguous
seed
hits,
minimizers
for
sketching,
support
R10.4
flow
cell
version
POD5
SLOW5
file
formats.
Compared
RawHash2
better
F1
accuracy
(on
average
10.57%
up
20.25%)
throughput
4.0×
9.9×)
than
RawHash.
The
development
of
long-read
sequencing
is
promising
for
the
high-quality
and
comprehensive
de
novo
assembly
various
species
around
world.
However,
it
still
challenging
assemblers
to
handle
thousands
genomes,
tens
gigabase-level
sizes,
terabase-level
datasets
efficiently,
which
a
bottleneck
large-scale
studies.
A
major
cause
read
overlapping
graph
construction
that
state-of-the-art
tools
usually
have
cost
terabyte-level
RAM
space
days
large
genomes.
Such
lower
performance
scalability
are
not
suited
numerous
samples
being
sequenced.
Herein,
we
propose
xRead,
novel
iterative
approach
achieves
high
performance,
scalability,
yield
simultaneously.
Under
guidance
its
coverage-based
model,
xRead
converts
read-overlapping
heuristic
read-mapping
incremental
tasks
with
highly
controllable
faster
speed.
It
enables
processing
very
(such
as
1.28
Tb
Ambystoma
mexicanum
dataset)
less
than
64
GB
obviously
time
costs.
Moreover,
benchmarks
suggest
can
produce
accurate
well-connected
graphs,
also
supportive
kinds
downstream
strategies.
able
break
through
lays
new
foundation
assembly.
This
tool
number
from
genomes
may
play
important
roles
in
many
Abstract
Motivation
Substrings
of
length
k,
commonly
referred
to
as
k-mers,
play
a
vital
role
in
sequence
analysis.
However,
k-mers
are
limited
exact
matches
between
sequences
leading
alternative
constructs.
We
recently
introduced
class
new
constructs,
strobemers,
that
can
match
across
substitutions
and
smaller
insertions
deletions.
Randstrobes,
the
most
sensitive
strobemer
proposed
Sahlin
(Effective
similarity
detection
with
strobemers.
Genome
Res
2021a;31:2080–94.
https://doi.org/10.1101/gr.275648.121),
has
been
used
several
bioinformatics
applications
such
read
classification,
short-read
mapping,
overlap
detection.
Recently,
we
showed
more
pseudo-random
behavior
construction
(measured
entropy),
efficient
seeds
for
The
level
pseudo-randomness
depends
on
operators,
but
no
study
investigated
efficacy.
Results
In
this
study,
introduce
novel
methods,
including
Binary
Search
Tree-based
approach
improves
time
complexity
over
previous
methods.
To
our
knowledge,
also
first
address
biases
design
three
metrics
measuring
bias.
Our
evaluation
shows
methods
have
favorable
speed
sampling
uniformity
compared
existing
approaches.
Lastly,
guided
by
results,
change
seed
strobealign,
mapper,
find
results
substantially.
suggest
combining
two
improve
strobealign’s
accuracy
shortest
reads
evaluated
datasets.
highlights
occur
provides
guidance
which
operators
use
when
implementing
randstrobes.
Availability
implementation
All
benchmarks
available
public
Github
repository
at
https://github.com/Moein-Karami/RandStrobes.
scripts
running
strobealign
analysis
found
https://github.com/NBISweden/strobealign-evaluation.
Nucleic Acids Research,
Год журнала:
2024,
Номер
52(14), С. e61 - e61
Опубликована: Июнь 17, 2024
Horizontal
gene
transfer
(HGT)
phenomena
pervade
the
gut
microbiome
and
significantly
impact
human
health.
Yet,
no
current
method
can
accurately
identify
complete
HGT
events,
including
transferred
sequence
associated
deletion
insertion
breakpoints
from
shotgun
metagenomic
data.
Here,
we
develop
LocalHGT,
which
facilitates
reliable
swift
detection
of
events
data,
delivering
an
accuracy
99.4%-verified
by
Nanopore
data-across
200
samples,
achieving
average
F1
score
0.99
on
100
simulated
LocalHGT
enables
a
systematic
characterization
within
across
2098
revealing
that
multiple
recipient
genome
sites
become
targets
sequence,
microhomology
is
enriched
in
breakpoint
junctions
(P-value
=
3.3e-58),
HGTs
function
as
host-specific
fingerprints
indicated
higher
similarity
intra-personal
temporal
samples
than
inter-personal
4.3e-303).
Crucially,
showed
potential
contributions
to
colorectal
cancer
(CRC)
acute
diarrhoea,
evidenced
enrichment
butyrate
metabolism
pathway
3.8e-17)
shigellosis
5.9e-13)
respective
HGTs.
Furthermore,
differential
demonstrated
promise
biomarkers
for
predicting
various
diseases.
Integrating
into
CRC
prediction
model
achieved
AUC
0.87.
Frontiers in Genetics,
Год журнала:
2024,
Номер
15
Опубликована: Окт. 28, 2024
Basecalling
is
an
essential
step
in
nanopore
sequencing
analysis
where
the
raw
signals
of
sequencers
are
converted
into
nucleotide
sequences,
that
is,
reads.
State-of-the-art
basecallers
use
complex
deep
learning
models
to
achieve
high
basecalling
accuracy.
This
makes
computationally
inefficient
and
memory-hungry,
bottlenecking
entire
genome
pipeline.
However,
for
many
applications,
most
reads
do
not
match
reference
interest
(i.e.,
target
reference)
thus
discarded
later
steps
genomics
pipeline,
wasting
computation.
To
overcome
this
issue,
we
propose
TargetCall,
first
pre-basecalling
filter
eliminate
wasted
computation
basecalling.
TargetCall’s
key
idea
discard
will
off-target
reads)
prior
TargetCall
consists
two
main
components:
(1)
LightCall,
a
lightweight
neural
network
basecaller
produces
noisy
reads,
(2)
Similarity
Check,
which
labels
each
these
as
on-target
or
by
matching
them
reference.
Our
thorough
experimental
evaluations
show
1)
improves
end-to-end
runtime
performance
state-of-the-art
3.31×
while
maintaining
id="m2">(98.88%)
recall
keeping
2)
maintains
accuracy
downstream
analysis,
3)
achieves
better
performance,
throughput,
recall,
precision,
generality
than
works.
available
at
https://github.com/CMU-SAFARI/TargetCall
.