Analysis-ready VCF at Biobank scale using Zarr DOI Creative Commons
Eric Czech, Timothy R. Millar,

Will Tyler

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 12, 2024

Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.

Language: Английский

Biologically inspired graphs to explore massive genetic datasets DOI
Ryan M. Layer

Nature Computational Science, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 31, 2025

Language: Английский

Citations

0

IGD: A simple, efficient genotype data format DOI Creative Commons

Drew DeHaas,

Xinzhu Wei

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 8, 2025

Abstract Motivation While there are a variety of file formats for storing reference-sequence-aligned genotype data, many complex or inefficient. Programming language support such is often limited. A format that simple to understand and implement – yet fast small helpful research on highly scalable bioinformatics. Results We present the Indexable Genotype Data (IGD) format, uncompressed binary can be more than 100 times faster 3.5 smaller vcf.gz Biobank-scale whole-genome sequence data. The implementation reading writing IGD in Python under 350 lines code, which reflects simplicity format. Availability C++ library IGD, tooling convert . files, found at https://github.com/aprilweilab/picovcf https://github.com/aprilweilab/pyigd

Language: Английский

Citations

0

On ARGs, pedigrees, and genetic relatedness matrices DOI Creative Commons
Brieuc Lehmann, Hanbin Lee, Luke Anderson-Trocmé

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: March 5, 2025

Abstract Genetic relatedness is a central concept in genetics, underpinning studies of population and quantitative genetics human, animal, plant settings. It typically stored as genetic matrix (GRM), whose elements are pairwise values between individuals. This has been defined various contexts based on pedigree, genotype, phylogeny, coalescent times, and, recently, ancestral recombination graph (ARG). ARG-based GRMs have found to better capture the structure improve association relative genotype GRM. However, calculating further operations with them fundamentally challenging due inherent quadratic time space complexity. Here, we first discuss different definitions unifying context, making use additive model trait provide definition “branch relatedness” corresponding GRM”. We explore relationship branch pedigree through case study French-Canadian individuals that known pedigree. Through tree sequence encoding an ARG, then derive efficient algorithm for computing products GRM general vector, without explicitly forming leverages sparse genomes hence enables large-scale computations demonstrate power this by developing randomized principal components sequences easily scales millions genomes. All algorithms implemented open source tskit Python package. Taken together, work consolidates notions leveraging ARG it provides enable scale mega-scale genomic datasets.

Language: Английский

Citations

0

Analysis-ready VCF at Biobank scale using Zarr DOI Creative Commons
Eric Czech, Timothy R. Millar,

Will Tyler

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 12, 2024

Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.

Language: Английский

Citations

1