Frontiers in Microbiology,
Год журнала:
2024,
Номер
15
Опубликована: Дек. 23, 2024
dna2bit
is
an
ultra-fast
software
specifically
engineered
for
microbial
genome
analysis,
particularly
adept
at
calculating
distances
within
metagenome
and
single
amplified
datasets.
Distinguished
from
existing
such
as
Mash
Dashing,
employs
feature
hashing
technique
Hamming
distance
to
achieve
enhanced
speed
memory
utilization,
without
sacrifice
in
the
accuracy
of
average
nucleotide
identity
calculations.
has
promising
applications
various
domains
approximation,
metagenomic
sequence
clustering,
homology
querying.
significantly
boosts
computational
efficiency
handling
large
datasets
including
genomes,
thereby
facilitating
a
better
understanding
population
heterogeneity
comparative
genomics
microorganisms.
available
https://github.com/lijuzeng/dna2bit
.
ACM Transactions on Architecture and Code Optimization,
Год журнала:
2023,
Номер
21(1), С. 1 - 29
Опубликована: Дек. 28, 2023
Profile
hidden
Markov
models
(pHMMs)
are
widely
employed
in
various
bioinformatics
applications
to
identify
similarities
between
biological
sequences,
such
as
DNA
or
protein
sequences.
In
pHMMs,
sequences
represented
graph
structures,
where
states
and
edges
capture
modifications
(i.e.,
insertions,
deletions,
substitutions)
by
assigning
probabilities
them.
These
subsequently
used
compute
the
similarity
score
a
sequence
pHMM
graph.
The
Baum-Welch
algorithm,
prevalent
highly
accurate
method,
utilizes
these
optimize
scores.
Accurate
computation
of
is
essential
for
correct
identification
similarities.
However,
algorithm
computationally
intensive,
existing
solutions
offer
either
software-only
hardware-only
approaches
with
fixed
designs.
When
we
analyze
state-of-the-art
works,
an
urgent
need
flexible,
high-performance,
energy-efficient
hardware-software
co-design
address
major
inefficiencies
pHMMs.
We
introduce
ApHMM
,
first
flexible
acceleration
framework
designed
significantly
reduce
both
computational
energy
overheads
associated
employs
tackle
(1)
designing
hardware
accommodate
designs,
(2)
exploiting
predictable
data
dependency
patterns
through
on-chip
memory
memoization
techniques,
(3)
rapidly
filtering
out
unnecessary
computations
using
hardware-based
filter,
(4)
minimizing
redundant
computations.
achieves
substantial
speedups
15.55×–260.03×,
1.83×–5.34×,
27.97×
when
compared
CPU,
GPU,
FPGA
implementations
respectively.
outperforms
CPU
three
key
applications:
error
correction,
family
search,
multiple
alignment,
1.29×–59.94×,
1.03×–1.75×,
1.03×–1.95×,
respectively,
while
improving
their
efficiency
64.24×–115.46×,
1.75×,
1.96×.
Nucleic Acids Research,
Год журнала:
2024,
Номер
52(17), С. e82 - e82
Опубликована: Авг. 16, 2024
Abstract
Viral
subgenomic
RNA
(sgRNA)
plays
a
major
role
in
SARS-COV2’s
replication,
pathogenicity,
and
evolution.
Recent
sequencing
protocols,
such
as
the
ARTIC
protocol,
have
been
established.
However,
due
to
viral-specific
biological
processes,
analyzing
sgRNA
through
read
data
is
computational
challenge.
Current
methods
rely
on
tools
designed
for
eukaryote
genomes,
resulting
gap
specifically
detection.
To
address
this,
we
make
two
contributions.
Firstly,
present
sgENERATE,
an
evaluation
pipeline
study
accuracy
efficacy
of
detection
using
popular
protocol.
Using
evaluate
periscope,
recently
introduced
tool
that
detects
from
data.
We
find
periscope
has
biased
predictions
high
costs.
Secondly,
information
produced
redesign
algorithm
use
multiple
references
canonical
sgRNAs
mitigate
alignment
issues
improve
non-canonical
our
algorithm,
periscope_multi,
simulated
datasets
demonstrate
periscope_multi’s
enhanced
accuracy.
Our
contribution
advances
studying
viral
sgRNA,
paving
way
more
accurate
efficient
analyses
context
discovery.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Окт. 6, 2024
Abstract
Measurement
of
DNA
contents
genomes
is
valuable
for
understanding
genome
biology,
including
assessments
assemblies,
but
it
not
a
trivial
problem.
Measuring
shotgun
reads
complicated
by
several
factors:
biological
at
species,
individual
and
tissue
or
cell
levels,
laboratory
methods,
sequencing
technology
computational
processing
measurement
assembly.
This
compares,
shares,
complications
with
cytometric
(Cym)
related
molecular
measurements
size
contents.
There
an
obvious
discrepancy
between
current
long-read
assemblies
(Asm):
average
12%
below
Cym
measured
sizes,
differing
in
amounts
duplicated
content.
report
examines
five
read
types
to
see
if
they
can
be
used
more
precise
reliable
discrimination
major
sizes.
The
are
short,
accurate
Illumina,
long
Pacific
Biosciences,
low
high
accuracy,
Oxford
Nanopore
Technology
accuracy.
Gnodes
the
tool
used,
which
maps
assembly,
measures
copy
numbers
genes,
transposons,
repeats,
others,
using
as
unit
single
copies
unique
conserved
genes.
Public
data
well
studied
genomes,
human,
corn,
zebrafish,
sorghum
rice,
primary
evidence
this
work.
Results
mixed
open
interpretations:
In
broad
terms,
all
measure
about
same
contents,
90%
agreement,
level
that
other
contribute.
For
precision
above
level,
differ
supporting
larger
sizes
(low
accuracy
reads),
smaller
assembly
(high
short-reads
roughly
between.
weight
suggests
less
biased
measurement,
have
bias
reduced
duplications
introduced
averaging
filtering.
complicating
factors
noted
produce
discrepancies
than
-
Asm
difference,
problem
control.
Briefings in Bioinformatics,
Год журнала:
2024,
Номер
25(6)
Опубликована: Сен. 23, 2024
Abstract
Sequences
derived
from
organisms
sharing
common
evolutionary
origins
exhibit
similarity,
while
unique
sequences,
absent
in
related
organisms,
act
as
good
diagnostic
marker
candidates.
However,
the
approach
focused
on
identifying
dissimilar
regions
among
closely-related
poses
challenges
it
requires
complex
multiple
sequence
alignments,
making
computation
and
parsing
difficult.
To
address
this,
we
have
developed
a
biologically
inspired
universal
NAUniSeq
algorithm
to
find
sequences
for
microorganism
diagnosis
by
traveling
through
phylogeny
of
life.
Mapping
phylogenetic
tree
ensures
low
number
cross-contamination
false
positives.
We
downloaded
complete
taxonomy
data
Taxadb
database
National
Center
Biotechnology
Information
Reference
Sequence
Database
(NCBI-Refseq)
and,
with
help
NetworkX,
created
tree.
were
assigned
over
graph
nodes,
k-mers
target
non-target
nodes
search
was
performed
using
depth
first
algorithm.
In
memory
efficient
alternative
NoSQL
approach,
collection
Refseq
MongoDB
tax-id
path
FASTA
files.
queried
sequences.
both
approaches,
used
an
alignment
free
sliding
window
k-mer–based
procedure
that
quickly
compares
returns
are
not
present
non-target.
validated
our
Mycobacterium
tuberculosis,
Neisseria
gonorrhoeae,
Monkeypox
generated
This
is
powerful
tool
generating
enabling
accurate
identification
microbial
strains
high
precision.
Seed
design
is
important
for
sequence
similarity
search
applications
such
as
read
mapping
and
average
nucleotide
identity
(ANI)
estimation.
Although
k
-mers
spaced
are
likely
the
most
well-known
used
seeds,
sensitivity
suffers
at
high
error
rates,
particularly
when
indels
present.
Recently,
we
developed
a
pseudorandom
seeding
construct,
strobemers,
which
was
empirically
shown
to
have
also
indel
rates.
However,
study
lacked
deeper
understanding
of
why.
In
this
study,
propose
model
estimate
entropy
seed
find
that
seeds
with
entropy,
according
our
model,
in
cases
match
sensitivity.
Our
discovered
randomness–sensitivity
relationship
explains
why
some
perform
better
than
others,
provides
framework
designing
even
more
sensitive
seeds.
We
present
three
new
strobemer
constructs:
mixedstrobes,
altstrobes,
multistrobes.
use
both
simulated
biological
data
show
constructs
improve
sequence-matching
other
strobemers.
useful
ANI
For
mapping,
implement
strobemers
into
minimap2
observe
30%
faster
alignment
time
0.2%
higher
accuracy
using
reads
As
estimation,
rank
correlation
between
estimated
true
ANI.
High-throughput
sequencing
(HTS)
technologies
have
revolutionized
the
field
of
genomics,
enabling
rapid
and
cost-effective
genome
analysis
for
various
applications.
However,
increasing
volume
genomic
data
generated
by
HTS
presents
significant
challenges
computational
techniques
to
effectively
analyze
genomes.
To
address
these
challenges,
several
algorithm-architecture
co-design
works
been
proposed,
targeting
different
steps
pipeline.
These
explore
emerging
provide
fast,
accurate,
low-power
analysis.This
paper
provides
a
brief
review
recent
advancements
in
accelerating
analysis,
covering
opportunities
associated
with
acceleration
key
Our
highlights
importance
integrating
multiple
using
suitable
architectures
unlock
performance
improvements
reduce
movement
energy
consumption.
We
conclude
emphasizing
need
novel
strategies
growing
demands
generation
analysis.
The
high
execution
time
of
DNA
sequence
alignment
negatively
affects
many
genomic
studies
that
rely
on
results.
Pre-alignment
filtering
was
introduced
as
a
step
before
to
reduce
the
short-read
greatly.
With
its
success,
i.e.,
achieving
accuracy
and
thus
removing
unnecessary
alignments,
itself
now
constitutes
larger
portion
time.
A
significant
contributing
factor
entails
movement
sequences
from
memory
processing
units,
while
majority
will
filter
out
they
do
not
result
in
an
acceptable
alignment.
State-of-the-art
(SotA)
pre-alignment
accelerators
suffer
same
overhead
for
data
movements.
Furthermore,
these
lack
support
future
algorithms
using
operations
underlying
hardware.
This
paper
addresses
shortcomings
by
introducing
SieveMem.
SieveMem
is
architecture
exploits
Computation-in-Memory
paradigm
with
memristive-based
devices
shared
kernels
filters
inside
(i.e.,
preventing
movements).
also
provides
algorithms.
supports
more
than
47.6%
among
all
top
5
SotA
filters.
Moreover,
includes
hardware-friendly
algorithm
called
BandedKrait,
inspired
combination
mentioned
kernels.
Our
evaluations
show
up
331.1
x
$\mathbf{446.8}\times$
improvement
two
most-common
BandedKrait
at
level.
Using
SieveMem,
design
we
call
Mem-BandedKrait,
one
can
improve
end-to-end
irrespective
dataset,
which
go
xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{91.4}\times$
compared
accelerator
GPU.
Abstract
Motivation
Pairwise
sequence
alignment
is
a
heavy
computational
burden,
particularly
in
the
context
of
third-generation
sequencing
technologies.
This
issue
commonly
addressed
by
approximately
estimating
similarities
using
hash-based
method
such
as
MinHash.
In
MinHash,
all
k-mers
read
are
hashed
and
minimum
hash
value,
min-hash,
stored.
can
then
be
estimated
counting
number
min-hash
matches
between
pair
reads,
across
many
distinct
functions.
The
choice
parameter
k
controls
an
important
tradeoff
task
identifying
alignments:
larger
k-values
give
greater
confidence
identification
alignments
(high
precision)
but
lead
to
missing
(low
recall),
presence
significant
noise.
Results
this
work,
we
introduce
LexicHash,
new
similarity
estimation
that
effectively
independent
attains
high
precision
large-k
sensitivity
small-k
LexicHash
variant
MinHash
with
carefully
designed
function.
When
two
instead
simply
checking
whether
min-hashes
match
(as
standard
MinHash),
one
checks
how
“lexicographically
similar”
are.
our
experiments
on
40
PacBio
datasets,
area
under
precision–recall
curves
obtained
had
average
improvement
20.9%
over
Additionally,
framework
lends
itself
naturally
efficient
search
largest
alignments,
yielding
O(n)
time
algorithm,
circumventing
seemingly
fundamental
O(n2)
scaling
associated
pairwise
search.
Availability
implementation
available
GitHub
at
https://github.com/gcgreenberg/LexicHash.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2023,
Номер
unknown
Опубликована: Дек. 21, 2023
Abstract
Significant
discrepancies
in
genome
sizes
measured
by
cytometric
methods
versus
DNA
sequence
estimates
are
frequent,
including
recent
long-read
assemblies
of
plant
and
animal
genomes.
A
new
measure
using
a
baseline
unique
conserved
genes,
Gnodes,
finds
the
larger
measures
often
accurate.
DNA-informatic
size,
as
well
assembly
methods,
have
errors
methodology
that
under-measure
duplicated
spans.
Major
contents
several
model
discrepant
genomes
assessed
here,
human,
corn,
chicken,
insects,
crustaceans,
plant.
Transposons
dominate
genomes,
structural
repeats
major
portion
smaller
ones.
Gene
coding
sequences
found
similar
amounts
across
taxonomic
spread.
The
largest
contributors
to
size
higher-order
repeats,
but
significant
missed
content,
transposons
some
examined
species.
Informatics
measuring
producing
assemblies,
telomere
approaches,
subject
mistakes
operation
and/or
interpretation
biased
against
duplications.
Mistaken
aspects
include
alignment
inaccurate
for
high-copy
spans;
misclassification
true
repetitive
heterozygosity
artifact;
software
default
settings
exclude
DNA;
overly
conservative
data
processing
reduces
genomic
Re-assemblies
with
balanced
recover
missing
portions
problem
plant,
water
fleas
fire
ant.
bioRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Март 8, 2024
Abstract
Motivation
Genomic
distance
estimation
is
a
critical
workload
since
exact
computation
for
whole-genome
similarity
metrics
such
as
Average
Nucleotide
Identity
(ANI)
incurs
prohibitive
runtime
overhead.
Genome
sketching
fast
and
memory-efficient
solution
to
estimate
ANI
by
distilling
representative
k
-mers
from
the
original
sequences.
In
this
work,
we
present
HyperGen
that
improves
accuracy,
performance,
memory
efficiency
large-scale
estimation.
Unlike
existing
genome
algorithms
convert
large
files
into
discrete
-mer
hashes,
leverages
emerging
hyperdimensional
computing
(HDC)
encode
genomes
quasi-orthogonal
vectors
(Hypervector,
HV)
in
high-dimensional
space.
HV
compact
can
preserve
more
information,
allowing
accurate
while
reducing
required
sketch
sizes.
particular,
representation
allows
efficient
using
vector
multiplication,
which
naturally
benefits
highly
optimized
general
matrix
multiply
(GEMM)
routines.
As
result,
enables
massive
collections.
Results
We
evaluate
HyperGen’s
database
search
performance
several
datasets
at
various
scales.
able
achieve
comparable
or
superior
error
linearity
compared
other
sketch-based
counterparts.
The
measurement
results
show
one
of
fastest
tools
both
search.
Meanwhile,
produces
ensuring
high
accuracy.
Availability
A
Rust
implementation
freely
available
under
MIT
license
an
open-source
software
project
https://github.com/wh-xu/Hyper-Gen
.
scripts
reproduce
experimental
be
accessed
https://github.com/wh-xu/experiment-hyper-gen
Contact
[email protected]