dna2bit: high performance genomic distance estimation software for microbial genome analysis DOI Creative Commons

J. Li,

Yuxin Tian,

Zhichao Wang

и другие.

Frontiers in Microbiology, Год журнала: 2024, Номер 15

Опубликована: Дек. 23, 2024

dna2bit is an ultra-fast software specifically engineered for microbial genome analysis, particularly adept at calculating distances within metagenome and single amplified datasets. Distinguished from existing such as Mash Dashing, employs feature hashing technique Hamming distance to achieve enhanced speed memory utilization, without sacrifice in the accuracy of average nucleotide identity calculations. has promising applications various domains approximation, metagenomic sequence clustering, homology querying. significantly boosts computational efficiency handling large datasets including genomes, thereby facilitating a better understanding population heterogeneity comparative genomics microorganisms. available https://github.com/lijuzeng/dna2bit .

Язык: Английский

ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-efficient Genome Analysis DOI Open Access
Can Fırtına, Kamlesh Pillai, Gurpreet S. Kalsi

и другие.

ACM Transactions on Architecture and Code Optimization, Год журнала: 2023, Номер 21(1), С. 1 - 29

Опубликована: Дек. 28, 2023

Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences represented graph structures, where states and edges capture modifications (i.e., insertions, deletions, substitutions) by assigning probabilities them. These subsequently used compute the similarity score a sequence pHMM graph. The Baum-Welch algorithm, prevalent highly accurate method, utilizes these optimize scores. Accurate computation of is essential for correct identification similarities. However, algorithm computationally intensive, existing solutions offer either software-only hardware-only approaches with fixed designs. When we analyze state-of-the-art works, an urgent need flexible, high-performance, energy-efficient hardware-software co-design address major inefficiencies pHMMs. We introduce ApHMM , first flexible acceleration framework designed significantly reduce both computational energy overheads associated employs tackle (1) designing hardware accommodate designs, (2) exploiting predictable data dependency patterns through on-chip memory memoization techniques, (3) rapidly filtering out unnecessary computations using hardware-based filter, (4) minimizing redundant computations. achieves substantial speedups 15.55×–260.03×, 1.83×–5.34×, 27.97× when compared CPU, GPU, FPGA implementations respectively. outperforms CPU three key applications: error correction, family search, multiple alignment, 1.29×–59.94×, 1.03×–1.75×, 1.03×–1.95×, respectively, while improving their efficiency 64.24×–115.46×, 1.75×, 1.96×.

Язык: Английский

Процитировано

4

Improved sub-genomic RNA prediction with the ARTIC protocol DOI Creative Commons
Thomas Baudeau, Kristoffer Sahlin

Nucleic Acids Research, Год журнала: 2024, Номер 52(17), С. e82 - e82

Опубликована: Авг. 16, 2024

Abstract Viral subgenomic RNA (sgRNA) plays a major role in SARS-COV2’s replication, pathogenicity, and evolution. Recent sequencing protocols, such as the ARTIC protocol, have been established. However, due to viral-specific biological processes, analyzing sgRNA through read data is computational challenge. Current methods rely on tools designed for eukaryote genomes, resulting gap specifically detection. To address this, we make two contributions. Firstly, present sgENERATE, an evaluation pipeline study accuracy efficacy of detection using popular protocol. Using evaluate periscope, recently introduced tool that detects from data. We find periscope has biased predictions high costs. Secondly, information produced redesign algorithm use multiple references canonical sgRNAs mitigate alignment issues improve non-canonical our algorithm, periscope_multi, simulated datasets demonstrate periscope_multi’s enhanced accuracy. Our contribution advances studying viral sgRNA, paving way more accurate efficient analyses context discovery.

Язык: Английский

Процитировано

1

Measuring DNA contents of animal and plant genomes with Gnodes, the long and short of it. DOI Creative Commons
Don Gilbert

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Окт. 6, 2024

Abstract Measurement of DNA contents genomes is valuable for understanding genome biology, including assessments assemblies, but it not a trivial problem. Measuring shotgun reads complicated by several factors: biological at species, individual and tissue or cell levels, laboratory methods, sequencing technology computational processing measurement assembly. This compares, shares, complications with cytometric (Cym) related molecular measurements size contents. There an obvious discrepancy between current long-read assemblies (Asm): average 12% below Cym measured sizes, differing in amounts duplicated content. report examines five read types to see if they can be used more precise reliable discrimination major sizes. The are short, accurate Illumina, long Pacific Biosciences, low high accuracy, Oxford Nanopore Technology accuracy. Gnodes the tool used, which maps assembly, measures copy numbers genes, transposons, repeats, others, using as unit single copies unique conserved genes. Public data well studied genomes, human, corn, zebrafish, sorghum rice, primary evidence this work. Results mixed open interpretations: In broad terms, all measure about same contents, 90% agreement, level that other contribute. For precision above level, differ supporting larger sizes (low accuracy reads), smaller assembly (high short-reads roughly between. weight suggests less biased measurement, have bias reduced duplications introduced averaging filtering. complicating factors noted produce discrepancies than - Asm difference, problem control.

Язык: Английский

Процитировано

1

Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection DOI Creative Commons
Gulshan Sharma, R. P. Sharma, Kavita Joshi

и другие.

Briefings in Bioinformatics, Год журнала: 2024, Номер 25(6)

Опубликована: Сен. 23, 2024

Abstract Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related poses challenges it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find sequences for microorganism diagnosis by traveling through phylogeny of life. Mapping phylogenetic tree ensures low number cross-contamination false positives. We downloaded complete taxonomy data Taxadb database National Center Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with help NetworkX, created tree. were assigned over graph nodes, k-mers target non-target nodes search was performed using depth first algorithm. In memory efficient alternative NoSQL approach, collection Refseq MongoDB tax-id path FASTA files. queried sequences. both approaches, used an alignment free sliding window k-mer–based procedure that quickly compares returns are not present non-target. validated our Mycobacterium tuberculosis, Neisseria gonorrhoeae, Monkeypox generated This is powerful tool generating enabling accurate identification microbial strains high precision.

Язык: Английский

Процитировано

1

Entropy predicts sensitivity of pseudorandom seeds DOI Creative Commons
Benjamin Dominik Maier, Kristoffer Sahlin

Genome Research, Год журнала: 2023, Номер unknown

Опубликована: Май 22, 2023

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k -mers spaced are likely the most well-known used seeds, sensitivity suffers at high error rates, particularly when indels present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have also indel rates. However, study lacked deeper understanding of why. In this study, propose model estimate entropy seed find that seeds with entropy, according our model, in cases match sensitivity. Our discovered randomness–sensitivity relationship explains why some perform better than others, provides framework designing even more sensitive seeds. We present three new strobemer constructs: mixedstrobes, altstrobes, multistrobes. use both simulated biological data show constructs improve sequence-matching other strobemers. useful ANI For mapping, implement strobemers into minimap2 observe 30% faster alignment time 0.2% higher accuracy using reads As estimation, rank correlation between estimated true ANI.

Язык: Английский

Процитировано

3

Invited: Accelerating Genome Analysis via Algorithm-Architecture Co-Design DOI
Onur Mutlu, Can Fırtına

Опубликована: Июль 9, 2023

High-throughput sequencing (HTS) technologies have revolutionized the field of genomics, enabling rapid and cost-effective genome analysis for various applications. However, increasing volume genomic data generated by HTS presents significant challenges computational techniques to effectively analyze genomes. To address these challenges, several algorithm-architecture co-design works been proposed, targeting different steps pipeline. These explore emerging provide fast, accurate, low-power analysis.This paper provides a brief review recent advancements in accelerating analysis, covering opportunities associated with acceleration key Our highlights importance integrating multiple using suitable architectures unlock performance improvements reduce movement energy consumption. We conclude emphasizing need novel strategies growing demands generation analysis.

Язык: Английский

Процитировано

3

SieveMem: A Computation-in-Memory Architecture for Fast and Accurate Pre-Alignment DOI
Taha Shahroodi,

Michael Miao,

Mahdi Zahedi

и другие.

Опубликована: Июль 1, 2023

The high execution time of DNA sequence alignment negatively affects many genomic studies that rely on results. Pre-alignment filtering was introduced as a step before to reduce the short-read greatly. With its success, i.e., achieving accuracy and thus removing unnecessary alignments, itself now constitutes larger portion time. A significant contributing factor entails movement sequences from memory processing units, while majority will filter out they do not result in an acceptable alignment. State-of-the-art (SotA) pre-alignment accelerators suffer same overhead for data movements. Furthermore, these lack support future algorithms using operations underlying hardware. This paper addresses shortcomings by introducing SieveMem. SieveMem is architecture exploits Computation-in-Memory paradigm with memristive-based devices shared kernels filters inside (i.e., preventing movements). also provides algorithms. supports more than 47.6% among all top 5 SotA filters. Moreover, includes hardware-friendly algorithm called BandedKrait, inspired combination mentioned kernels. Our evaluations show up 331.1 x $\mathbf{446.8}\times$ improvement two most-common BandedKrait at level. Using SieveMem, design we call Mem-BandedKrait, one can improve end-to-end irrespective dataset, which go xmlns:xlink="http://www.w3.org/1999/xlink">$\mathbf{91.4}\times$ compared accelerator GPU.

Язык: Английский

Процитировано

3

LexicHash: sequence similarity estimation via lexicographic comparison of hashes DOI Creative Commons
Grant Greenberg, Aditya Narayan Ravi, Ilan Shomorony

и другие.

Bioinformatics, Год журнала: 2023, Номер 39(11)

Опубликована: Окт. 25, 2023

Abstract Motivation Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue commonly addressed by approximately estimating similarities using hash-based method such as MinHash. In MinHash, all k-mers read are hashed and minimum hash value, min-hash, stored. can then be estimated counting number min-hash matches between pair reads, across many distinct functions. The choice parameter k controls an important tradeoff task identifying alignments: larger k-values give greater confidence identification alignments (high precision) but lead to missing (low recall), presence significant noise. Results this work, we introduce LexicHash, new similarity estimation that effectively independent attains high precision large-k sensitivity small-k LexicHash variant MinHash with carefully designed function. When two instead simply checking whether min-hashes match (as standard MinHash), one checks how “lexicographically similar” are. our experiments on 40 PacBio datasets, area under precision–recall curves obtained had average improvement 20.9% over Additionally, framework lends itself naturally efficient search largest alignments, yielding O(n) time algorithm, circumventing seemingly fundamental O(n2) scaling associated pairwise search. Availability implementation available GitHub at https://github.com/gcgreenberg/LexicHash.

Язык: Английский

Процитировано

2

Measure of major contents in animal and plant genomes, using Gnodes, finds under-assemblies of model plant, Daphnia, fire ant and others DOI Creative Commons
Don Gilbert

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Дек. 21, 2023

Abstract Significant discrepancies in genome sizes measured by cytometric methods versus DNA sequence estimates are frequent, including recent long-read assemblies of plant and animal genomes. A new measure using a baseline unique conserved genes, Gnodes, finds the larger measures often accurate. DNA-informatic size, as well assembly methods, have errors methodology that under-measure duplicated spans. Major contents several model discrepant genomes assessed here, human, corn, chicken, insects, crustaceans, plant. Transposons dominate genomes, structural repeats major portion smaller ones. Gene coding sequences found similar amounts across taxonomic spread. The largest contributors to size higher-order repeats, but significant missed content, transposons some examined species. Informatics measuring producing assemblies, telomere approaches, subject mistakes operation and/or interpretation biased against duplications. Mistaken aspects include alignment inaccurate for high-copy spans; misclassification true repetitive heterozygosity artifact; software default settings exclude DNA; overly conservative data processing reduces genomic Re-assemblies with balanced recover missing portions problem plant, water fleas fire ant.

Язык: Английский

Процитировано

2

HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors DOI Creative Commons
Weihong Xu, Po‐Kai Hsu, Niema Moshiri

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Март 8, 2024

Abstract Motivation Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching fast and memory-efficient solution to estimate ANI by distilling representative k -mers from the original sequences. In this work, we present HyperGen that improves accuracy, performance, memory efficiency large-scale estimation. Unlike existing genome algorithms convert large files into discrete -mer hashes, leverages emerging hyperdimensional computing (HDC) encode genomes quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV compact can preserve more information, allowing accurate while reducing required sketch sizes. particular, representation allows efficient using vector multiplication, which naturally benefits highly optimized general matrix multiply (GEMM) routines. As result, enables massive collections. Results We evaluate HyperGen’s database search performance several datasets at various scales. able achieve comparable or superior error linearity compared other sketch-based counterparts. The measurement results show one of fastest tools both search. Meanwhile, produces ensuring high accuracy. Availability A Rust implementation freely available under MIT license an open-source software project https://github.com/wh-xu/Hyper-Gen . scripts reproduce experimental be accessed https://github.com/wh-xu/experiment-hyper-gen Contact [email protected]

Язык: Английский

Процитировано

0