Biases in ARG-based inference of historical population size in populations experiencing selection DOI Creative Commons
Jacob I. Marsh, Parul Johri

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 26, 2024

Abstract Inferring the demographic history of populations provides fundamental insights into species dynamics and is essential for developing a null model to accurately study selective processes. However, background selection sweeps can produce genomic signatures at linked sites that mimic or mask signals associated with historical population size change. While theoretical biases introduced by effects have been well established, it unclear whether ARG-based approaches inference in typical empirical analyses are susceptible mis-inference due these effects. To address this, we developed highly realistic forward simulations human Drosophila melanogaster populations, including empirically estimated variability gene density, mutation rates, recombination purifying positive selection, across different scenarios, broadly assess impact on using genealogy-based approach. Our results indicate minimally though could cause similar genome architecture parameters experiencing more frequent recurrent sweeps. We found accurate D. methods compromised presence pervasive alone, leading spurious inferences recent expansion which may be further worsened sweeps, depending proportion strength beneficial mutations. Caution additional testing species-specific needed when inferring non-human avoid selection.

Language: Английский

Biases in ARG-based inference of historical population size in populations experiencing selection DOI Creative Commons
Jacob I. Marsh, Parul Johri

Molecular Biology and Evolution, Journal Year: 2024, Volume and Issue: 41(7)

Published: June 14, 2024

Inferring the demographic history of populations provides fundamental insights into species dynamics and is essential for developing a null model to accurately study selective processes. However, background selection sweeps can produce genomic signatures at linked sites that mimic or mask signals associated with historical population size change. While theoretical biases introduced by effects have been well established, it unclear whether ancestral recombination graph (ARG)-based approaches inference in typical empirical analyses are susceptible misinference due these effects. To address this, we developed highly realistic forward simulations human Drosophila melanogaster populations, including empirically estimated variability gene density, mutation rates, purifying, positive selection, across different scenarios, broadly assess impact on using genealogy-based approach. Our results indicate minimally although could cause similar genome architecture parameters experiencing more frequent recurrent sweeps. We found accurate D. ARG-based methods compromised presence pervasive alone, leading spurious inferences recent expansion, which may be further worsened sweeps, depending proportion strength beneficial mutations. Caution additional testing species-specific needed when inferring non-human avoid selection.

Language: Английский

Citations

2

A forest is more than its trees: haplotypes and inferred ARGs DOI Creative Commons
Halley Fritze, Nathaniel S. Pope, Jerome Kelleher

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 2, 2024

Foreshadowing haplotype-based methods of the genomics era, it is an old observation that "junction" between two distinct haplotypes produced by recombination inherited as a Mendelian marker. In genealogical context, this recombination-mediated information reflects persistence ancestral across local trees in which they do not represent coalescences. We show how these non-coalescing ("locally-unary nodes") may be inserted into graphs (ARGs), compact but information-rich data structure describing relationships among recombinant sequences. The resulting ARGs are smaller, faster to compute with, and additional nearly always correct where initial ARG correct. provide efficient algorithms infer locally-unary nodes within existing ARGs, explore some consequences for inferred from real data. To this, we introduce new metrics agreement disagreement that, unlike previous methods, consider rather than just collection trees.

Language: Английский

Citations

2

Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent DOI Creative Commons
Kevin Korfmann, Thibaut Sellinger, Fabian Freund

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: Sept. 30, 2022

Abstract The reproductive mechanism of a species is key driver genome evolution. standard Wright-Fisher model for the reproduction individuals in population assumes that each individual produces number offspring negligible compared to total size. Yet many plants, invertebrates, prokaryotes or fish exhibit neutrally skewed distribution strong selection events yielding few produce up same magnitude as As result, genealogy sample characterized by multiple (more than two) coalescing simultaneously common ancestor. current methods developed detect such merger do not account complex demographic scenarios recombination, and require large sizes. We tackle these limitations developing two novel different approaches infer from sequence data ancestral recombination graph (ARG): sequentially Markovian coalescent (SM β C) neural network (GNN coal ). first give proof accuracy our estimate parameter past history using simulated under -coalescent model. Secondly, we show can also recover effect positive selective sweeps along genome. Finally, are able distinguish while inferring variation Our findings stress aptitude networks leverage information ARG inference but urgent need more accurate approaches.

Language: Английский

Citations

10

Analysis-ready VCF at Biobank scale using Zarr DOI Creative Commons
Eric Czech, Timothy R. Millar,

Will Tyler

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 12, 2024

Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.

Language: Английский

Citations

1

Biases in ARG-based inference of historical population size in populations experiencing selection DOI Creative Commons
Jacob I. Marsh, Parul Johri

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: April 26, 2024

Abstract Inferring the demographic history of populations provides fundamental insights into species dynamics and is essential for developing a null model to accurately study selective processes. However, background selection sweeps can produce genomic signatures at linked sites that mimic or mask signals associated with historical population size change. While theoretical biases introduced by effects have been well established, it unclear whether ARG-based approaches inference in typical empirical analyses are susceptible mis-inference due these effects. To address this, we developed highly realistic forward simulations human Drosophila melanogaster populations, including empirically estimated variability gene density, mutation rates, recombination purifying positive selection, across different scenarios, broadly assess impact on using genealogy-based approach. Our results indicate minimally though could cause similar genome architecture parameters experiencing more frequent recurrent sweeps. We found accurate D. methods compromised presence pervasive alone, leading spurious inferences recent expansion which may be further worsened sweeps, depending proportion strength beneficial mutations. Caution additional testing species-specific needed when inferring non-human avoid selection.

Language: Английский

Citations

1