dna2bit: high performance genomic distance estimation software for microbial genome analysis DOI Creative Commons

J. Li,

Yuxin Tian,

Zhichao Wang

и другие.

Frontiers in Microbiology, Год журнала: 2024, Номер 15

Опубликована: Дек. 23, 2024

dna2bit is an ultra-fast software specifically engineered for microbial genome analysis, particularly adept at calculating distances within metagenome and single amplified datasets. Distinguished from existing such as Mash Dashing, employs feature hashing technique Hamming distance to achieve enhanced speed memory utilization, without sacrifice in the accuracy of average nucleotide identity calculations. has promising applications various domains approximation, metagenomic sequence clustering, homology querying. significantly boosts computational efficiency handling large datasets including genomes, thereby facilitating a better understanding population heterogeneity comparative genomics microorganisms. available https://github.com/lijuzeng/dna2bit .

Язык: Английский

A survey of mapping algorithms in the long-reads era DOI Creative Commons
Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux

и другие.

Genome biology, Год журнала: 2023, Номер 24(1)

Опубликована: Июнь 1, 2023

It has been over a decade since the first publication of method dedicated entirely to mapping long-reads. The distinctive characteristics long reads resulted in methods moving from seed-and-extend framework used for short seed-and-chain due seed abundance each read. main novelties are based on alternative constructs or chaining formulations. Dozens tools now exist, whose heuristics have evolved considerably. We provide an overview long-read mappers. Since they driven by implementation-specific parameters, we develop original visualization tool understand parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).

Язык: Английский

Процитировано

35

From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures DOI Creative Commons
Mohammed Alser, Joël Lindegger, Can Fırtına

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2022, Номер 20, С. 4579 - 4599

Опубликована: Янв. 1, 2022

We now need more than ever to make genome analysis intelligent. read, analyze, and interpret our genomes not only quickly, but also accurately efficiently enough scale the population level. There currently exist major computational bottlenecks inefficiencies throughout entire pipeline, because state-of-the-art sequencing technologies are still able read a in its entirety. describe ongoing journey significantly improving performance, accuracy, efficiency of using intelligent algorithms hardware architectures. explain algorithmic methods hardware-based acceleration approaches for each step pipeline provide experimental evaluations. Algorithmic exploit structure as well underlying hardware. Hardware-based specialized microarchitectures or various execution paradigms (e.g., processing inside near memory) along with changes, leading new hardware/software co-designed systems. conclude foreshadowing future challenges, benefits, research directions triggered by development both very low cost yet highly error prone chips genomics. hope that these efforts challenges we discuss foundation work making The script data used evaluation available at: https://github.com/CMU-SAFARI/Molecules2Variations

Язык: Английский

Процитировано

37

RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes DOI Creative Commons
Can Fırtına, Nika Mansouri Ghiasi, Joël Lindegger

и другие.

Bioinformatics, Год журнала: 2023, Номер 39(Supplement_1), С. i297 - i307

Опубликована: Июнь 1, 2023

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These can be analyzed as they are generated, providing an opportunity for genome analysis. An important feature of nanopore sequencing, Read Until, eject strands from without fully them, which provides opportunities to computationally reduce the time and cost. However, existing works utilizing Until either (i) require powerful computational resources that may not available portable or (ii) lack scalability large genomes, rendering them inaccurate ineffective. We propose RawHash, first mechanism accurately efficiently perform analysis genomes using a hash-based similarity search. To enable this, RawHash ensures corresponding same DNA content lead hash value, regardless slight variations these signals. achieves accurate search via effective quantization such have quantized value and, subsequently, value. evaluate on three applications: read mapping, relative abundance estimation, (iii) contamination Our evaluations show is only tool provide high accuracy throughput analyzing real-time. When compared state-of-the-art techniques, UNCALLED Sigmap, 25.8× 3.4× better average significantly respectively. Source code at https://github.com/CMU-SAFARI/RawHash.

Язык: Английский

Процитировано

19

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation DOI Creative Commons
Bryce Kille, Erik Garrison, Todd J. Treangen

и другие.

Bioinformatics, Год журнала: 2023, Номер 39(9)

Опубликована: Авг. 21, 2023

The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced representations, tools such as MashMap can scale massive numbers of pairwise comparisons while still providing useful estimates. However, due their reliance minimizer winnowing, previous versions were biased inconsistent estimators similarity. This directly impacts downstream that rely the accuracy these

Язык: Английский

Процитировано

18

Efficient mapping of accurate long reads in minimizer space with mapquik DOI Creative Commons
Barış Ekim, Kristoffer Sahlin, Paul Medvedev

и другие.

Genome Research, Год журнала: 2023, Номер unknown

Опубликована: Июнь 30, 2023

DNA sequencing data continue to progress toward longer reads with increasingly lower error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long (e.g., Pacific Biosciences [PacBio] HiFi) a reference genome, which poses challenges in terms accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types alignments. A natural idea would be optimize efficiency seeds reduce probability extraneous matches; however, contiguous exact quickly reach sensitivity limit. introduce mapquik, novel strategy creates accurate by anchoring alignments through matches k consecutively sampled minimizers ( -min-mers) only indexing -min-mers occur once thereby unlocking ultrafast while retaining high sensitivity. show mapquik significantly accelerates seeding chaining steps—fundamental bottlenecks mapping—for both human maize genomes > 96% near-perfect specificity. On real simulated reads, achieves 37 × speedup over state-of-the-art tool minimap2, 410 making fastest mapper date. These accelerations enabled not minimizer-space but also heuristic O(n) pseudochaining algorithm, improves upon long-standing mathvariant="script">O(nlogn) bound. Minimizer-space computation builds foundation achieving real-time analysis long-read data.

Язык: Английский

Процитировано

12

RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization DOI Creative Commons
Can Fırtına, Melina Soysal, Joël Lindegger

и другие.

Bioinformatics, Год журнала: 2024, Номер 40(8)

Опубликована: Июль 30, 2024

Raw nanopore signals can be analyzed while they are being generated, a process known as real-time analysis. Real-time analysis of raw is essential to utilize the unique features that sequencing provides, enabling early stopping read or entire run based on The state-of-the-art mechanism, RawHash, offers first hash-based efficient and accurate similarity identification between reference genome by quickly matching their hash values. In this work, we introduce RawHash2, which provides major improvements over including more sensitive quantization chaining algorithms, weighted mapping decisions, frequency filters reduce ambiguous seed hits, minimizers for sketching, support R10.4 flow cell version POD5 SLOW5 file formats. Compared RawHash2 better F1 accuracy (on average 10.57% up 20.25%) throughput 4.0× 9.9×) than RawHash.

Язык: Английский

Процитировано

4

xRead: a coverage-guided approach for scalable construction of read overlapping graph DOI Creative Commons
Tangchao Kong, Yadong Wang, Bo Liu

и другие.

GigaScience, Год журнала: 2025, Номер 14

Опубликована: Янв. 1, 2025

The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly various species around world. However, it still challenging assemblers to handle thousands genomes, tens gigabase-level sizes, terabase-level datasets efficiently, which a bottleneck large-scale studies. A major cause read overlapping graph construction that state-of-the-art tools usually have cost terabyte-level RAM space days large genomes. Such lower performance scalability are not suited numerous samples being sequenced. Herein, we propose xRead, novel iterative approach achieves high performance, scalability, yield simultaneously. Under guidance its coverage-based model, xRead converts read-overlapping heuristic read-mapping incremental tasks with highly controllable faster speed. It enables processing very (such as 1.28 Tb Ambystoma mexicanum dataset) less than 64 GB obviously time costs. Moreover, benchmarks suggest can produce accurate well-connected graphs, also supportive kinds downstream strategies. able break through lays new foundation assembly. This tool number from genomes may play important roles in many

Язык: Английский

Процитировано

0

Designing efficient randstrobes for sequence similarity analyses DOI Creative Commons

Moein Karami,

Aryan Soltani Mohammadi,

Marcel Martin

и другие.

Bioinformatics, Год журнала: 2024, Номер 40(4)

Опубликована: Март 29, 2024

Abstract Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited exact matches between sequences leading alternative constructs. We recently introduced class new constructs, strobemers, that can match across substitutions and smaller insertions deletions. Randstrobes, the most sensitive strobemer proposed Sahlin (Effective similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used several bioinformatics applications such read classification, short-read mapping, overlap detection. Recently, we showed more pseudo-random behavior construction (measured entropy), efficient seeds for The level pseudo-randomness depends on operators, but no study investigated efficacy. Results In this study, introduce novel methods, including Binary Search Tree-based approach improves time complexity over previous methods. To our knowledge, also first address biases design three metrics measuring bias. Our evaluation shows methods have favorable speed sampling uniformity compared existing approaches. Lastly, guided by results, change seed strobealign, mapper, find results substantially. suggest combining two improve strobealign’s accuracy shortest reads evaluated datasets. highlights occur provides guidance which operators use when implementing randstrobes. Availability implementation All benchmarks available public Github repository at https://github.com/Moein-Karami/RandStrobes. scripts running strobealign analysis found https://github.com/NBISweden/strobealign-evaluation.

Язык: Английский

Процитировано

2

Enhancing insights into diseases through horizontal gene transfer event detection from gut microbiome DOI Creative Commons
Shuai Wang, Yiqi Jiang,

Lijia Che

и другие.

Nucleic Acids Research, Год журнала: 2024, Номер 52(14), С. e61 - e61

Опубликована: Июнь 17, 2024

Horizontal gene transfer (HGT) phenomena pervade the gut microbiome and significantly impact human health. Yet, no current method can accurately identify complete HGT events, including transferred sequence associated deletion insertion breakpoints from shotgun metagenomic data. Here, we develop LocalHGT, which facilitates reliable swift detection of events data, delivering an accuracy 99.4%-verified by Nanopore data-across 200 samples, achieving average F1 score 0.99 on 100 simulated LocalHGT enables a systematic characterization within across 2098 revealing that multiple recipient genome sites become targets sequence, microhomology is enriched in breakpoint junctions (P-value = 3.3e-58), HGTs function as host-specific fingerprints indicated higher similarity intra-personal temporal samples than inter-personal 4.3e-303). Crucially, showed potential contributions to colorectal cancer (CRC) acute diarrhoea, evidenced enrichment butyrate metabolism pathway 3.8e-17) shigellosis 5.9e-13) respective HGTs. Furthermore, differential demonstrated promise biomarkers for predicting various diseases. Integrating into CRC prediction model achieved AUC 0.87.

Язык: Английский

Процитировано

2

TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering DOI Creative Commons
Meryem Banu Cavlak, Gagandeep Singh, Mohammed Alser

и другие.

Frontiers in Genetics, Год журнала: 2024, Номер 15

Опубликована: Окт. 28, 2024

Basecalling is an essential step in nanopore sequencing analysis where the raw signals of sequencers are converted into nucleotide sequences, that is, reads. State-of-the-art basecallers use complex deep learning models to achieve high basecalling accuracy. This makes computationally inefficient and memory-hungry, bottlenecking entire genome pipeline. However, for many applications, most reads do not match reference interest (i.e., target reference) thus discarded later steps genomics pipeline, wasting computation. To overcome this issue, we propose TargetCall, first pre-basecalling filter eliminate wasted computation basecalling. TargetCall’s key idea discard will off-target reads) prior TargetCall consists two main components: (1) LightCall, a lightweight neural network basecaller produces noisy reads, (2) Similarity Check, which labels each these as on-target or by matching them reference. Our thorough experimental evaluations show 1) improves end-to-end runtime performance state-of-the-art 3.31× while maintaining id="m2">(98.88%) recall keeping 2) maintains accuracy downstream analysis, 3) achieves better performance, throughput, recall, precision, generality than works. available at https://github.com/CMU-SAFARI/TargetCall .

Язык: Английский

Процитировано

2