Compressive Pangenomics Using Mutation-Annotated Networks DOI
Sumit Walia, Harsh Motwani, Kyle Smith

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: July 4, 2024

Abstract Pangenomics is an emerging field that uses a collection of genomes species instead single reference genome to overcome bias and study the within-species genetic diversity. Future pangenomics applications will require analyzing large ever-growing collections genomes. Therefore, choice data representation key determinant scope, as well computational memory performance pangenomic analyses. Current pangenome formats, while capable storing variations across multiple genomes, fail capture shared evolutionary mutational histories among them, thereby limiting their applications. They are also inefficient for storage, therefore face significant scaling challenges. In this manuscript, we propose PanMAN, novel structure information-wise richer than all existing formats – in addition representing alignment variation PanMAN represents inferred between those By using “evolutionary compression”, achieves 5.2 680-fold compression over other variation-preserving formats. PanMAN’s relative generally improves with larger datasets it compatible any method inferring phylogenies ancestral nucleotide states. Using SARS-CoV-2 case study, show offers detailed accurate portrayal pathogen’s history, facilitating discovery new biological insights. We present panmanUtils , software toolkit supports common analyses makes PanMANs interoperable tools poised enhance scale, speed, resolution, overall scope sharing.

Language: Английский

The era of the ARG: An introduction to ancestral recombination graphs and their significance in empirical evolutionary genomics DOI Creative Commons
Alexander L. Lewanski, Michael C. Gründler, Gideon S. Bradburd

et al.

PLoS Genetics, Journal Year: 2024, Volume and Issue: 20(1), P. e1011110 - e1011110

Published: Jan. 18, 2024

In the presence of recombination, evolutionary relationships between a set sampled genomes cannot be described by single genealogical tree. Instead, are related complex, interwoven collection genealogies formalized in structure called an ancestral recombination graph (ARG). An ARG extensively encodes ancestry genome(s) and thus is replete with valuable information for addressing diverse questions biology. Despite its potential utility, technological methodological limitations, along lack approachable literature, have severely restricted awareness application ARGs evolution research. Excitingly, recent progress reconstruction simulation made ARG-based approaches feasible many systems. this review, we provide accessible introduction exploration ARGs, survey breakthroughs, describe to further existing goals open avenues inquiry that were previously inaccessible genomics. Through discussion, aim more widely disseminate promise genomics encourage broader development adoption inference.

Language: Английский

Citations

35

A geographic history of human genetic ancestry DOI Creative Commons
Michael C. Gründler, Jonathan Terhorst, Gideon S. Bradburd

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 29, 2024

Describing the distribution of genetic variation across individuals is a fundamental goal population genetics. In humans, traditional approaches for describing often rely on discrete ancestry labels, which, despite their utility, can obscure complex, multi-faceted nature human history. These labels risk oversimplifying by ignoring its temporal depth and geographic continuity, may therefore conflate notions race, ethnicity, geography, ancestry. Here, we present method that capitalizes rich genealogical information encoded in genomic tree sequences to infer locations shared ancestors sample sequenced individuals. We use this history set genomes sampled from Europe, Asia, Africa, accurately recovering major movements those continents. Our findings demonstrate importance defining spatial-temporal context caution against oversimplified interpretations data prevalent contemporary discussions race

Language: Английский

Citations

9

A geographic history of human genetic ancestry DOI
Michael C. Gründler, Jonathan Terhorst, Gideon S. Bradburd

et al.

Science, Journal Year: 2025, Volume and Issue: 387(6741), P. 1391 - 1397

Published: March 27, 2025

Describing the distribution of genetic variation across individuals is a fundamental goal population genetics. We present method that capitalizes on rich genealogical information encoded in genomic tree sequences to infer geographic locations shared ancestors sample sequenced individuals. used this history ancestry set human genomes sampled from Europe, Asia, and Africa, accurately recovering major movements those continents. Our findings demonstrate importance defining spatiotemporal context when describing caution against oversimplified interpretations data prevalent contemporary discussions race ancestry.

Language: Английский

Citations

1

A general and efficient representation of ancestral recombination graphs DOI Creative Commons
Yan Wong, Anastasia Ignatieva, Jere Koskela

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Nov. 4, 2023

Abstract As a result of recombination, adjacent nucleotides can have different paths genetic inheritance and therefore the genealogical trees for sample DNA sequences vary along genome. The structure capturing details these intricately interwoven is referred to as an ancestral recombination graph (ARG). Classical formalisms focused on mapping coalescence events nodes in ARG. This approach out step with modern developments, which do not represent terms or explicitly infer them. We present simple formalism that defines ARG specific genomes their intervals inheritance, show how it generalises classical treatments encompasses outputs recent methods. discuss nuances arising from this more general structure, argue forms appropriate basis software standard rapidly growing field.

Language: Английский

Citations

15

Enabling efficient analysis of biobank-scale data with genotype representation graphs DOI

Drew DeHaas,

Ziqing Pan, Xinzhu Wei

et al.

Nature Computational Science, Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 5, 2024

Language: Английский

Citations

4

Tsbrowse: an interactive browser for Ancestral Recombination Graphs DOI Creative Commons
Savita Karthikeyan, Ben Jeffery, Duncan Mbuli-Robertson

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: April 23, 2025

Abstract Ancestral Recombination Graphs (ARGs) represent the interwoven paths of genetic ancestry for a set recombining sequences. The ability to capture evolutionary history samples makes ARGs valuable in wide range applications population and statistical genetics. ARG-based approaches are increasingly becoming part data analysis pipelines due breakthroughs enabling ARG inference at biobank-scale. However, there is lack visualisation tools, which crucial validating inferences generating hypotheses. We present tsbrowse , an open-source Python web-app interactive fundamental building-blocks ARGs, i.e., nodes, edges mutations. demonstrate application various sources scenarios, highlight its key features browsability along genome, user interactivity, scalability very large sample sizes. Availability package: https://pypi.org/project/tsbrowse/ Development version: https://github.com/tskit.dev/tsbrowse Documentation: https://tskit.dev/tsbrowse/docs/

Language: Английский

Citations

0

The length of haplotype blocks and signals of structural variation in reconstructed genealogies DOI Creative Commons
Anastasia Ignatieva, Martina Favero, Jere Koskela

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: July 11, 2023

Abstract Recent breakthroughs have enabled the inference of genealogies from large sequencing data-sets, accurately reconstructing local trees that describe genetic ancestry at each locus. These should also capture correlation structure along genome, reflecting historical recombination events and factors like demography natural selection. However, whether reconstructed do this has not been rigorously explored. This is important to address, since uncovering regions depart expectations can drive discovery new biological phenomena. Addressing crucial, as deviate reveal phenomena, such suppression allowing linked selection over broad regions, evidenced in humans adaptive introgression various species. We use a theoretical framework characterise properties genealogies, distribution genomic spans clades edges, demonstrate our results match observations simulated scenarios. Testing using leading approaches, we find departures for all methods. method Relate, set simple corrections almost complete recovery target distributions. Applying these Relate 2504 human genomes, observe an excess with unexpectedly long (125 p < 1 · 10 − 12 clustering into 50 regions), indicating localised recombination. The strongest signal corresponds known inversion on chromosome 17, while second represents previously unknown 10, which most common (21%) S. Asians correlates GWAS hits range phenotypes including immunological traits. Other signals suggest additional inversions (4), copy number changes (2), complex rearrangements or other variants (12), well 28 strong support but no clear classification. Our approach be readily applied species, show offer untapped potential study structural variation its impacts population level, revealing phenomena impacting evolution.

Language: Английский

Citations

8

Genotype Representation Graphs: Enabling Efficient Analysis of Biobank-Scale Data DOI Creative Commons

Drew DeHaas,

Ziqing Pan,

Xinzhu Wei

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 28, 2024

Abstract Computational analysis of a large number genomes requires data structure that can represent the dataset compactly while also enabling efficient operations on variants and samples. Current practice is to store large-scale genetic polymorphism using tabular structures file formats, where rows columns samples variants. However, encoding in such formats has become unsustainable. For example, UK Biobank 200,000 phased whole exceeded 350 terabytes (TB) Variant Call Format (VCF), cumbersome inefficient work with. To mitigate computational burden, we introduce Genotype Representation Graph (GRG), an extremely compact losslessly present whole-genome polymorphisms. A GRG fully connected hierarchical graph exploits variant-sharing across samples, leveraging ideas inspired by Ancestral Recombination Graphs. Capturing multitree compresses biobank-scale human point it fit typical server’s RAM (5-26 gigabytes (GB) per chromosome), enables graph-traversal algorithms trivially reuse computed values, both which significantly reduce computation time. We have developed command-line tool library usable via C++ Python for constructing processing files scales million genomes. It takes 160GB disk space encode information as GRG, more than 13 times smaller size compressed VCF. show summaries allele frequency association effect be traversal runs faster all tested alternatives, including vcf.gz , PLINK BED, tree sequence, XSI, Savvy. Furthermore, particularly suitable doing repeated calculations interactive analysis. anticipate GRG-based will improve scalability various types generally lower cost analyzing genomic datasets.

Language: Английский

Citations

1

Analysis-ready VCF at Biobank scale using Zarr DOI Creative Commons
Eric Czech, Timothy R. Millar,

Will Tyler

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 12, 2024

Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.

Language: Английский

Citations

1

Compressive Pangenomics Using Mutation-Annotated Networks DOI
Sumit Walia, Harsh Motwani, Kyle Smith

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: July 4, 2024

Abstract Pangenomics is an emerging field that uses a collection of genomes species instead single reference genome to overcome bias and study the within-species genetic diversity. Future pangenomics applications will require analyzing large ever-growing collections genomes. Therefore, choice data representation key determinant scope, as well computational memory performance pangenomic analyses. Current pangenome formats, while capable storing variations across multiple genomes, fail capture shared evolutionary mutational histories among them, thereby limiting their applications. They are also inefficient for storage, therefore face significant scaling challenges. In this manuscript, we propose PanMAN, novel structure information-wise richer than all existing formats – in addition representing alignment variation PanMAN represents inferred between those By using “evolutionary compression”, achieves 5.2 680-fold compression over other variation-preserving formats. PanMAN’s relative generally improves with larger datasets it compatible any method inferring phylogenies ancestral nucleotide states. Using SARS-CoV-2 case study, show offers detailed accurate portrayal pathogen’s history, facilitating discovery new biological insights. We present panmanUtils , software toolkit supports common analyses makes PanMANs interoperable tools poised enhance scale, speed, resolution, overall scope sharing.

Language: Английский

Citations

0