2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM),
Год журнала:
2023,
Номер
unknown, С. 920 - 925
Опубликована: Дек. 5, 2023
To
accelerate
the
mapping
process
of
vast
amounts
long
reads
to
references,
a
novel
mapper
mapquik
has
an
over
30
times
speedup
compared
with
de
facto
standard
minimap2
but
maintains
comparable
quality
on
human
genome.
However,
is
only
available
accurately
map
PacBio
HiFi
sequencing
errors
lower
than
1%.
Since
3rd
generation
modest
error
rates
higher
1%
are
still
widely
used,
like
from
Nanopore
DNA
technology,
versatile
long-read
should
consider
more
cases.
This
paper
adopts
mapping-friendly
sequence
reduction
idea
compress
different
technologies
boost
seed
sensitivity
mapquik.
For
relatively
high
rates,
we
combine
error-sensitive
order
deep
learning
algorithm
replace
random
universe
minimizer
in
An
improved
ultra-fast
read
named
mapquikPLUS
verified
handle
most
tasks
efficiently
and
pipeline
followed
by
shows
better
performance
against
other
approximate
mappers
for
ANI
evaluation.
It
has
been
over
a
decade
since
the
first
publication
of
method
dedicated
entirely
to
mapping
long-reads.
The
distinctive
characteristics
long
reads
resulted
in
methods
moving
from
seed-and-extend
framework
used
for
short
seed-and-chain
due
seed
abundance
each
read.
main
novelties
are
based
on
alternative
constructs
or
chaining
formulations.
Dozens
tools
now
exist,
whose
heuristics
have
evolved
considerably.
We
provide
an
overview
long-read
mappers.
Since
they
driven
by
implementation-specific
parameters,
we
develop
original
visualization
tool
understand
parameter
settings
(
http://bcazaux.polytech-lille.net/Minimap2/
).
Cancers,
Год журнала:
2024,
Номер
16(7), С. 1275 - 1275
Опубликована: Март 25, 2024
Cancer
is
a
multifaceted
disease
arising
from
numerous
genomic
aberrations
that
have
been
identified
as
result
of
advancements
in
sequencing
technologies.
While
next-generation
(NGS),
which
uses
short
reads,
has
transformed
cancer
research
and
diagnostics,
it
limited
by
read
length.
Third-generation
(TGS),
led
the
Pacific
Biosciences
Oxford
Nanopore
Technologies
platforms,
employs
long-read
sequences,
marked
paradigm
shift
research.
genomes
often
harbour
complex
events,
TGS,
with
its
ability
to
span
large
regions,
facilitated
their
characterisation,
providing
better
understanding
how
rearrangements
affect
initiation
progression.
TGS
also
characterised
entire
transcriptome
various
cancers,
revealing
cancer-associated
isoforms
could
serve
biomarkers
or
therapeutic
targets.
Furthermore,
advanced
improving
genome
assemblies,
detecting
variants,
more
complete
picture
transcriptomes
epigenomes.
This
review
focuses
on
growing
role
We
investigate
advantages
limitations,
rigorous
scientific
analysis
use
previously
hidden
missed
NGS.
promising
technology
holds
immense
potential
for
both
clinical
applications,
far-reaching
implications
diagnosis
treatment.
The
Jaccard
similarity
on
k-mer
sets
has
shown
to
be
a
convenient
proxy
for
sequence
identity.
By
avoiding
expensive
base-level
alignments
and
comparing
reduced
representations,
tools
such
as
MashMap
can
scale
massive
numbers
of
pairwise
comparisons
while
still
providing
useful
estimates.
However,
due
their
reliance
minimizer
winnowing,
previous
versions
were
biased
inconsistent
estimators
similarity.
This
directly
impacts
downstream
that
rely
the
accuracy
these
The
exponential
increase
in
sequencing
data
calls
for
conceptual
and
computational
advances
to
extract
useful
biological
insights.
One
such
advance,
minimizers,
allows
reducing
the
quantity
of
handled
while
maintaining
some
its
key
properties.
We
provide
a
basic
introduction
cover
recent
methodological
developments,
review
diverse
applications
minimizers
analyze
genomic
data,
including
de
novo
genome
assembly,
metagenomics,
read
alignment,
correction,
pangenomes.
also
touch
on
alternative
sketching
techniques
universal
hitting
sets,
syncmers,
or
strobemers.
Minimizers
their
alternatives
have
rapidly
become
indispensable
tools
handling
vast
amounts
data.
The
development
of
long-read
sequencing
is
promising
for
the
high-quality
and
comprehensive
de
novo
assembly
various
species
around
world.
However,
it
still
challenging
assemblers
to
handle
thousands
genomes,
tens
gigabase-level
sizes,
terabase-level
datasets
efficiently,
which
a
bottleneck
large-scale
studies.
A
major
cause
read
overlapping
graph
construction
that
state-of-the-art
tools
usually
have
cost
terabyte-level
RAM
space
days
large
genomes.
Such
lower
performance
scalability
are
not
suited
numerous
samples
being
sequenced.
Herein,
we
propose
xRead,
novel
iterative
approach
achieves
high
performance,
scalability,
yield
simultaneously.
Under
guidance
its
coverage-based
model,
xRead
converts
read-overlapping
heuristic
read-mapping
incremental
tasks
with
highly
controllable
faster
speed.
It
enables
processing
very
(such
as
1.28
Tb
Ambystoma
mexicanum
dataset)
less
than
64
GB
obviously
time
costs.
Moreover,
benchmarks
suggest
can
produce
accurate
well-connected
graphs,
also
supportive
kinds
downstream
strategies.
able
break
through
lays
new
foundation
assembly.
This
tool
number
from
genomes
may
play
important
roles
in
many
Abstract
Motivation
Substrings
of
length
k,
commonly
referred
to
as
k-mers,
play
a
vital
role
in
sequence
analysis.
However,
k-mers
are
limited
exact
matches
between
sequences
leading
alternative
constructs.
We
recently
introduced
class
new
constructs,
strobemers,
that
can
match
across
substitutions
and
smaller
insertions
deletions.
Randstrobes,
the
most
sensitive
strobemer
proposed
Sahlin
(Effective
similarity
detection
with
strobemers.
Genome
Res
2021a;31:2080–94.
https://doi.org/10.1101/gr.275648.121),
has
been
used
several
bioinformatics
applications
such
read
classification,
short-read
mapping,
overlap
detection.
Recently,
we
showed
more
pseudo-random
behavior
construction
(measured
entropy),
efficient
seeds
for
The
level
pseudo-randomness
depends
on
operators,
but
no
study
investigated
efficacy.
Results
In
this
study,
introduce
novel
methods,
including
Binary
Search
Tree-based
approach
improves
time
complexity
over
previous
methods.
To
our
knowledge,
also
first
address
biases
design
three
metrics
measuring
bias.
Our
evaluation
shows
methods
have
favorable
speed
sampling
uniformity
compared
existing
approaches.
Lastly,
guided
by
results,
change
seed
strobealign,
mapper,
find
results
substantially.
suggest
combining
two
improve
strobealign’s
accuracy
shortest
reads
evaluated
datasets.
highlights
occur
provides
guidance
which
operators
use
when
implementing
randstrobes.
Availability
implementation
All
benchmarks
available
public
Github
repository
at
https://github.com/Moein-Karami/RandStrobes.
scripts
running
strobealign
analysis
found
https://github.com/NBISweden/strobealign-evaluation.
Nucleic Acids Research,
Год журнала:
2024,
Номер
52(17), С. e82 - e82
Опубликована: Авг. 16, 2024
Abstract
Viral
subgenomic
RNA
(sgRNA)
plays
a
major
role
in
SARS-COV2’s
replication,
pathogenicity,
and
evolution.
Recent
sequencing
protocols,
such
as
the
ARTIC
protocol,
have
been
established.
However,
due
to
viral-specific
biological
processes,
analyzing
sgRNA
through
read
data
is
computational
challenge.
Current
methods
rely
on
tools
designed
for
eukaryote
genomes,
resulting
gap
specifically
detection.
To
address
this,
we
make
two
contributions.
Firstly,
present
sgENERATE,
an
evaluation
pipeline
study
accuracy
efficacy
of
detection
using
popular
protocol.
Using
evaluate
periscope,
recently
introduced
tool
that
detects
from
data.
We
find
periscope
has
biased
predictions
high
costs.
Secondly,
information
produced
redesign
algorithm
use
multiple
references
canonical
sgRNAs
mitigate
alignment
issues
improve
non-canonical
our
algorithm,
periscope_multi,
simulated
datasets
demonstrate
periscope_multi’s
enhanced
accuracy.
Our
contribution
advances
studying
viral
sgRNA,
paving
way
more
accurate
efficient
analyses
context
discovery.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 3, 2024
Abstract
A
key
step
in
sequence
similarity
search
is
to
identify
seeds
that
are
found
both
the
query
and
reference
sequence.
seed
a
shorter
substring
(e.g.,
k
-mer)
or
pattern
spaced
constructed
from
sequences.
well-known
trade-off
applications
such
as
read
mapping
longer
offer
fast
searches
through
fewer
spurious
matches
but
lower
sensitivity
variable
regions
more
likely
harbor
mutations.
Some
recent
developments
on
constructs
have
considered
approximate
(or
fuzzy)
-min-mers,
strobemers,
BLEND,
SubSeqHash,
TensorSketch,
more,
can
match
over
smaller
mutations
and,
thus,
suffer
less
issues
regions.
Nevertheless,
sensitivity-to-speed
still
exists
for
constructs.
In
other
applications,
genome
assembly,
using
multiple
sizes
of
-mers
effective.
While
this
be
achieved
through,
e.g.,
MEM
construction
an
FM-index,
typically
much
slower
than
hash-based
To
end,
we
introduce
multi-context
(MCS).
brief,
MCS
strobemers
where
hashes
individual
strobes
partitioned
hash
value
representing
seed.
Such
partitioning
enables
cache-friendly
approach
full
partial
subset
strobes.
For
example,
strobemer
first
strobe
(a
queried.
We
demonstrate
improves
matching
statistics
standard
without
compromising
uniqueness.
practical
applicability
by
implementing
them
strobealign.
Strobealign
with
comes
at
no
cost
memory
only
little
runtime
while
offering
increased
accuracy
default
strobealign
simulated
Illumina
reads
across
genomes
various
complexity.
also
show
outperforms
minimap2
short-read
comparable
BWA-MEM
high-variability
provides
alternative
addresses
trade-offs
between
length
alignment
accuracy.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 28, 2024
ABSTRACT
The
rapid
advancements
in
DNA
sequencing
technology
have
led
to
an
unprecedented
increase
the
generation
of
genomic
datasets,
with
modern
sequencers
now
capable
producing
up
ten
terabases
per
run.
However,
effective
indexing
and
analysis
this
vast
amount
data
pose
significant
challenges
scientific
community.
K-mer
has
proven
crucial
managing
extensive
datasets
across
a
wide
range
applications,
including
alignment,
compression,
dataset
comparison,
error
correction,
assembly,
quantification.
As
result,
developing
efficient
scalable
k
-mer
methods
become
increasingly
important
area
research.
Despite
progress
made,
current
state-of-the-art
structures
are
predominantly
static,
necessitating
resource-intensive
index
reconstruction
when
integrating
new
data.
Recently,
need
for
dynamic
been
recognized.
many
proposed
solutions
only
pseudo-dynamic,
requiring
substantial
updates
justify
costs
adding
datasets.
In
practice,
applications
often
rely
on
standard
hash
tables
associate
their
-mers,
leading
high
encoding
rates
exceeding
64
bits
-mer.
work,
we
introduce
Brisk,
drop-in
replacement
most
dictionary
applications.
This
novel
hashmap-like
structure
provides
throughput
while
significantly
reducing
memory
usage
compared
existing
associative
indexes,
particularly
large
sizes.
Brisk
achieves
by
leveraging
hierarchical
minimizer
memory-efficient
super-
representation.
We
also
techniques
efficiently
probing
-mers
within
set
duplicated
minimizers.
believe
that
methodologies
developed
work
represent
advancement
creation
dictionaries,
greatly
facilitating
routine
use
analysis.