Improved pangenomic classification accuracy with chain statistics DOI Creative Commons
Nathaniel K. Brown, Vikram S. Shivakumar, Ben Langmead

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 2, 2024

Abstract Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index used matching statistics pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail coarse-grained information about whether seeds appear co-linearly in reference. We present novel approach that additionally obtains (“chain”) statistics. do this without using chaining algorithm, which would require superlinear time number start with collection strings, avoiding multiple-alignment step required by graph approaches. rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs correspond multi-MUMs. From these, we select those can be “tunneled,” mark corresponding multi-MUM identifiers. This yields an ℴ( r + n/d )-space index for d sequences having length- n consisting maximal equal-character runs. Using index, simultaneously chain linear respect query length. found substantially improves accuracy compared past compressed-indexing approaches reaches same level as less alignmentbased methods.

Language: Английский

Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment DOI Creative Commons
Sam Kovaka, Paul W. Hook, Katharine M. Jenike

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 11, 2024

Abstract Nanopore signal analysis enables detection of nucleotide modifications from native DNA and RNA sequencing, providing both accurate genetic/transcriptomic epigenetic information without additional library preparation. Presently, only a limited set can be directly basecalled (e.g. 5-methylcytosine), while most others require exploratory methods that often begin with alignment nanopore to reference. We present Uncalled4, toolkit for alignment, analysis, visualization. Uncalled4 features an efficient banded algorithm, BAM file format, statistics comparing methods, reproducible de novo training method k-mer-based pore models, revealing potential errors in ONT’s state-of-the-art model. apply 6-methyladenine (m6A) seven human cell lines, identifying 26% more than Nanopolish using m6Anet, including several genes where m6A has known implications cancer. is available open-source at github.com/skovaka/uncalled4 .

Language: Английский

Citations

8

RawHash2: Mapping Raw Nanopore Signals Using Hash-Based Seeding and Adaptive Quantization DOI Creative Commons
Can Fırtına, Melina Soysal, Joël Lindegger

et al.

Bioinformatics, Journal Year: 2024, Volume and Issue: 40(8)

Published: July 30, 2024

Raw nanopore signals can be analyzed while they are being generated, a process known as real-time analysis. Real-time analysis of raw is essential to utilize the unique features that sequencing provides, enabling early stopping read or entire run based on The state-of-the-art mechanism, RawHash, offers first hash-based efficient and accurate similarity identification between reference genome by quickly matching their hash values. In this work, we introduce RawHash2, which provides major improvements over including more sensitive quantization chaining algorithms, weighted mapping decisions, frequency filters reduce ambiguous seed hits, minimizers for sketching, support R10.4 flow cell version POD5 SLOW5 file formats. Compared RawHash2 better F1 accuracy (on average 10.57% up 20.25%) throughput 4.0× 9.9×) than RawHash.

Language: Английский

Citations

4

Uncalled4 improves nanopore DNA and RNA modification detection via fast and accurate signal alignment DOI Creative Commons
Sam Kovaka, Paul W. Hook, Katharine M. Jenike

et al.

Nature Methods, Journal Year: 2025, Volume and Issue: unknown

Published: March 28, 2025

Nanopore signal analysis enables detection of nucleotide modifications from native DNA and RNA sequencing, providing both accurate genetic or transcriptomic epigenetic information without additional library preparation. At present, only a limited set can be directly basecalled (for example, 5-methylcytosine), while most others require exploratory methods that often begin with alignment nanopore to reference. We present Uncalled4, toolkit for alignment, visualization. Uncalled4 features an efficient banded algorithm, BAM file format, statistics comparing reproducible de novo training method k-mer-based pore models, revealing potential errors in Oxford Technologies' state-of-the-art model. apply 6-methyladenine (m6A) seven human cell lines, identifying 26% more than Nanopolish using m6Anet, including several genes where m6A has known implications cancer. is available open source at github.com/skovaka/uncalled4 .

Language: Английский

Citations

0

Improved Pangenomic Classification Accuracy with Chain Statistics DOI
Nathaniel K. Brown, Vikram S. Shivakumar, Ben Langmead

et al.

Lecture notes in computer science, Journal Year: 2025, Volume and Issue: unknown, P. 190 - 208

Published: Jan. 1, 2025

Language: Английский

Citations

0

Faster Maximal Exact Matches with Lazy LCP Evaluation DOI

Adrián Goga,

Lore Depuydt, Nathaniel K. Brown

et al.

Published: March 19, 2024

MONI (Rossi et al.,

Language: Английский

Citations

3

Improved pangenomic classification accuracy with chain statistics DOI Creative Commons
Nathaniel K. Brown, Vikram S. Shivakumar, Ben Langmead

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 2, 2024

Abstract Compressed full-text indexes enable efficient sequence classification against a pangenome or tree-of-life index. Past work on compressed-index used matching statistics pseudo-matching lengths to capture the fine-grained co-linearity of exact matches. But these fail coarse-grained information about whether seeds appear co-linearly in reference. We present novel approach that additionally obtains (“chain”) statistics. do this without using chaining algorithm, which would require superlinear time number start with collection strings, avoiding multiple-alignment step required by graph approaches. rapidly compute multi-maximal unique matches (multi-MUMs) and identify BWT sub-runs correspond multi-MUMs. From these, we select those can be “tunneled,” mark corresponding multi-MUM identifiers. This yields an ℴ( r + n/d )-space index for d sequences having length- n consisting maximal equal-character runs. Using index, simultaneously chain linear respect query length. found substantially improves accuracy compared past compressed-indexing approaches reaches same level as less alignmentbased methods.

Language: Английский

Citations

1