Ultra-fast and High-quality Mapping of Error-prone Long Reads DOI

Boyuan Sun

2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Год журнала: 2023, Номер unknown, С. 920 - 925

Опубликована: Дек. 5, 2023

To accelerate the mapping process of vast amounts long reads to references, a novel mapper mapquik has an over 30 times speedup compared with de facto standard minimap2 but maintains comparable quality on human genome. However, is only available accurately map PacBio HiFi sequencing errors lower than 1%. Since 3rd generation modest error rates higher 1% are still widely used, like from Nanopore DNA technology, versatile long-read should consider more cases. This paper adopts mapping-friendly sequence reduction idea compress different technologies boost seed sensitivity mapquik. For relatively high rates, we combine error-sensitive order deep learning algorithm replace random universe minimizer in An improved ultra-fast read named mapquikPLUS verified handle most tasks efficiently and pipeline followed by shows better performance against other approximate mappers for ANI evaluation.

Язык: Английский

A survey of mapping algorithms in the long-reads era DOI Creative Commons
Kristoffer Sahlin, Thomas Baudeau, Bastien Cazaux

и другие.

Genome biology, Год журнала: 2023, Номер 24(1)

Опубликована: Июнь 1, 2023

It has been over a decade since the first publication of method dedicated entirely to mapping long-reads. The distinctive characteristics long reads resulted in methods moving from seed-and-extend framework used for short seed-and-chain due seed abundance each read. main novelties are based on alternative constructs or chaining formulations. Dozens tools now exist, whose heuristics have evolved considerably. We provide an overview long-read mappers. Since they driven by implementation-specific parameters, we develop original visualization tool understand parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).

Язык: Английский

Процитировано

35

The Application of Long-Read Sequencing to Cancer DOI Open Access
Luca Ermini, Patrick Driguez

Cancers, Год журнала: 2024, Номер 16(7), С. 1275 - 1275

Опубликована: Март 25, 2024

Cancer is a multifaceted disease arising from numerous genomic aberrations that have been identified as result of advancements in sequencing technologies. While next-generation (NGS), which uses short reads, has transformed cancer research and diagnostics, it limited by read length. Third-generation (TGS), led the Pacific Biosciences Oxford Nanopore Technologies platforms, employs long-read sequences, marked paradigm shift research. genomes often harbour complex events, TGS, with its ability to span large regions, facilitated their characterisation, providing better understanding how rearrangements affect initiation progression. TGS also characterised entire transcriptome various cancers, revealing cancer-associated isoforms could serve biomarkers or therapeutic targets. Furthermore, advanced improving genome assemblies, detecting variants, more complete picture transcriptomes epigenomes. This review focuses on growing role We investigate advantages limitations, rigorous scientific analysis use previously hidden missed NGS. promising technology holds immense potential for both clinical applications, far-reaching implications diagnosis treatment.

Язык: Английский

Процитировано

9

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation DOI Creative Commons
Bryce Kille, Erik Garrison, Todd J. Treangen

и другие.

Bioinformatics, Год журнала: 2023, Номер 39(9)

Опубликована: Авг. 21, 2023

The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced representations, tools such as MashMap can scale massive numbers of pairwise comparisons while still providing useful estimates. However, due their reliance minimizer winnowing, previous versions were biased inconsistent estimators similarity. This directly impacts downstream that rely the accuracy these

Язык: Английский

Процитировано

18

When less is more: sketching with minimizers in genomics DOI Creative Commons
Malick Ndiaye,

Silvia Prieto-Baños,

Lucy M. Fitzgerald

и другие.

Genome biology, Год журнала: 2024, Номер 25(1)

Опубликована: Окт. 14, 2024

The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows reducing the quantity of handled while maintaining some its key properties. We provide a basic introduction cover recent methodological developments, review diverse applications minimizers analyze genomic data, including de novo genome assembly, metagenomics, read alignment, correction, pangenomes. also touch on alternative sketching techniques universal hitting sets, syncmers, or strobemers. Minimizers their alternatives have rapidly become indispensable tools handling vast amounts data.

Язык: Английский

Процитировано

5

UniAligner: a parameter-free framework for fast sequence alignment DOI
Andrey V. Bzikadze, Pavel A. Pevzner

Nature Methods, Год журнала: 2023, Номер 20(9), С. 1346 - 1354

Опубликована: Авг. 14, 2023

Язык: Английский

Процитировано

11

xRead: a coverage-guided approach for scalable construction of read overlapping graph DOI Creative Commons
Tangchao Kong, Yadong Wang, Bo Liu

и другие.

GigaScience, Год журнала: 2025, Номер 14

Опубликована: Янв. 1, 2025

The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly various species around world. However, it still challenging assemblers to handle thousands genomes, tens gigabase-level sizes, terabase-level datasets efficiently, which a bottleneck large-scale studies. A major cause read overlapping graph construction that state-of-the-art tools usually have cost terabyte-level RAM space days large genomes. Such lower performance scalability are not suited numerous samples being sequenced. Herein, we propose xRead, novel iterative approach achieves high performance, scalability, yield simultaneously. Under guidance its coverage-based model, xRead converts read-overlapping heuristic read-mapping incremental tasks with highly controllable faster speed. It enables processing very (such as 1.28 Tb Ambystoma mexicanum dataset) less than 64 GB obviously time costs. Moreover, benchmarks suggest can produce accurate well-connected graphs, also supportive kinds downstream strategies. able break through lays new foundation assembly. This tool number from genomes may play important roles in many

Язык: Английский

Процитировано

0

Designing efficient randstrobes for sequence similarity analyses DOI Creative Commons

Moein Karami,

Aryan Soltani Mohammadi,

Marcel Martin

и другие.

Bioinformatics, Год журнала: 2024, Номер 40(4)

Опубликована: Март 29, 2024

Abstract Motivation Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited exact matches between sequences leading alternative constructs. We recently introduced class new constructs, strobemers, that can match across substitutions and smaller insertions deletions. Randstrobes, the most sensitive strobemer proposed Sahlin (Effective similarity detection with strobemers. Genome Res 2021a;31:2080–94. https://doi.org/10.1101/gr.275648.121), has been used several bioinformatics applications such read classification, short-read mapping, overlap detection. Recently, we showed more pseudo-random behavior construction (measured entropy), efficient seeds for The level pseudo-randomness depends on operators, but no study investigated efficacy. Results In this study, introduce novel methods, including Binary Search Tree-based approach improves time complexity over previous methods. To our knowledge, also first address biases design three metrics measuring bias. Our evaluation shows methods have favorable speed sampling uniformity compared existing approaches. Lastly, guided by results, change seed strobealign, mapper, find results substantially. suggest combining two improve strobealign’s accuracy shortest reads evaluated datasets. highlights occur provides guidance which operators use when implementing randstrobes. Availability implementation All benchmarks available public Github repository at https://github.com/Moein-Karami/RandStrobes. scripts running strobealign analysis found https://github.com/NBISweden/strobealign-evaluation.

Язык: Английский

Процитировано

2

Improved sub-genomic RNA prediction with the ARTIC protocol DOI Creative Commons
Thomas Baudeau, Kristoffer Sahlin

Nucleic Acids Research, Год журнала: 2024, Номер 52(17), С. e82 - e82

Опубликована: Авг. 16, 2024

Abstract Viral subgenomic RNA (sgRNA) plays a major role in SARS-COV2’s replication, pathogenicity, and evolution. Recent sequencing protocols, such as the ARTIC protocol, have been established. However, due to viral-specific biological processes, analyzing sgRNA through read data is computational challenge. Current methods rely on tools designed for eukaryote genomes, resulting gap specifically detection. To address this, we make two contributions. Firstly, present sgENERATE, an evaluation pipeline study accuracy efficacy of detection using popular protocol. Using evaluate periscope, recently introduced tool that detects from data. We find periscope has biased predictions high costs. Secondly, information produced redesign algorithm use multiple references canonical sgRNAs mitigate alignment issues improve non-canonical our algorithm, periscope_multi, simulated datasets demonstrate periscope_multi’s enhanced accuracy. Our contribution advances studying viral sgRNA, paving way more accurate efficient analyses context discovery.

Язык: Английский

Процитировано

1

Multi-context seeds enable fast and high-accuracy read mapping DOI Creative Commons
Ivan Tolstoganov, Marcel Martin, Kristoffer Sahlin

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Ноя. 3, 2024

Abstract A key step in sequence similarity search is to identify seeds that are found both the query and reference sequence. seed a shorter substring (e.g., k -mer) or pattern spaced constructed from sequences. well-known trade-off applications such as read mapping longer offer fast searches through fewer spurious matches but lower sensitivity variable regions more likely harbor mutations. Some recent developments on constructs have considered approximate (or fuzzy) -min-mers, strobemers, BLEND, SubSeqHash, TensorSketch, more, can match over smaller mutations and, thus, suffer less issues regions. Nevertheless, sensitivity-to-speed still exists for constructs. In other applications, genome assembly, using multiple sizes of -mers effective. While this be achieved through, e.g., MEM construction an FM-index, typically much slower than hash-based To end, we introduce multi-context (MCS). brief, MCS strobemers where hashes individual strobes partitioned hash value representing seed. Such partitioning enables cache-friendly approach full partial subset strobes. For example, strobemer first strobe (a queried. We demonstrate improves matching statistics standard without compromising uniqueness. practical applicability by implementing them strobealign. Strobealign with comes at no cost memory only little runtime while offering increased accuracy default strobealign simulated Illumina reads across genomes various complexity. also show outperforms minimap2 short-read comparable BWA-MEM high-variability provides alternative addresses trade-offs between length alignment accuracy.

Язык: Английский

Процитировано

0

Brisk: Exact resource-efficient dictionary for k-mers DOI Creative Commons

Carol Smith,

Igor Martayan, Antoine Limasset

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Ноя. 28, 2024

ABSTRACT The rapid advancements in DNA sequencing technology have led to an unprecedented increase the generation of genomic datasets, with modern sequencers now capable producing up ten terabases per run. However, effective indexing and analysis this vast amount data pose significant challenges scientific community. K-mer has proven crucial managing extensive datasets across a wide range applications, including alignment, compression, dataset comparison, error correction, assembly, quantification. As result, developing efficient scalable k -mer methods become increasingly important area research. Despite progress made, current state-of-the-art structures are predominantly static, necessitating resource-intensive index reconstruction when integrating new data. Recently, need for dynamic been recognized. many proposed solutions only pseudo-dynamic, requiring substantial updates justify costs adding datasets. In practice, applications often rely on standard hash tables associate their -mers, leading high encoding rates exceeding 64 bits -mer. work, we introduce Brisk, drop-in replacement most dictionary applications. This novel hashmap-like structure provides throughput while significantly reducing memory usage compared existing associative indexes, particularly large sizes. Brisk achieves by leveraging hierarchical minimizer memory-efficient super- representation. We also techniques efficiently probing -mers within set duplicated minimizers. believe that methodologies developed work represent advancement creation dictionaries, greatly facilitating routine use analysis.

Язык: Английский

Процитировано

0