Analysis-ready VCF at Biobank scale using Zarr DOI Creative Commons
Eric Czech, Timothy R. Millar,

Will Tyler

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 12, 2024

Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.

Язык: Английский

Inference and applications of ancestral recombination graphs DOI
Rasmus Nielsen, Andrew H. Vaughn, Yun Deng

и другие.

Nature Reviews Genetics, Год журнала: 2024, Номер unknown

Опубликована: Сен. 30, 2024

Язык: Английский

Процитировано

14

A general and efficient representation of ancestral recombination graphs DOI Creative Commons
Yan Wong, Anastasia Ignatieva, Jere Koskela

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Ноя. 4, 2023

Abstract As a result of recombination, adjacent nucleotides can have different paths genetic inheritance and therefore the genealogical trees for sample DNA sequences vary along genome. The structure capturing details these intricately interwoven is referred to as an ancestral recombination graph (ARG). Classical formalisms focused on mapping coalescence events nodes in ARG. This approach out step with modern developments, which do not represent terms or explicitly infer them. We present simple formalism that defines ARG specific genomes their intervals inheritance, show how it generalises classical treatments encompasses outputs recent methods. discuss nuances arising from this more general structure, argue forms appropriate basis software standard rapidly growing field.

Язык: Английский

Процитировано

15

Analysis-ready VCF at Biobank scale using Zarr DOI Creative Commons
Eric Czech, Timothy R. Millar,

Will Tyler

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 12, 2024

Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.

Язык: Английский

Процитировано

1