MalKinID: A Classification Model for Identifying Malaria Parasite Genealogical Relationships Using Identity-by-Descent DOI Creative Commons
Wesley Wong,

Lea Wang,

S. F. Schaffner

et al.

Genetics, Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 23, 2024

Abstract Pathogen genomics is a powerful tool for tracking infectious disease transmission. In malaria, identity-by-descent (IBD) used to assess the genetic relatedness between parasites and has been study transmission importation. theory, IBD can be distinguish genealogical relationships reconstruct history or identify quantitative-trait-locus experiments. MalKinID (Malaria Kinship Identifier) new classification model designed among malaria based on genome-wide proportions segment distributions. was calibrated genomic data from three laboratory-based crosses (yielding 440 parent-child [PC] 9060 full-sibling [FS] comparisons). identified lab generated F1 progeny with >80% sensitivity showed that 0.39 (95% CI 0.28, 0.49) of second-generation NF54 NHP4026 cross were F1s 0.56 (0.45, 0.67) backcrosses an parental strain. simulated outcrossed importations, reconstructs genealogy high precision sensitivity, F1-scores exceeding 0.84. However, when importation involves inbreeding, such as during serial co-transmission, declined, (the harmonic mean sensitivity) 0.76 (0.56, 0.92) 0.23 (0.0, 0.4) PC FS <0.05 second-degree third-degree relatives. Disentangling inbred required adapting perform multi-sample comparisons. Genealogical inference most powered 1) outcrossing norm 2) comparisons predefined pedigree are used. lays foundations using track parasite separating

Language: Английский

A General Framework for Branch Length Estimation in Ancestral Recombination Graphs DOI Creative Commons
Yun Deng, Yun S. Song, Rasmus Nielsen

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 15, 2025

Inference of Ancestral Recombination Graphs (ARGs) is central interest in the analysis genomic variation. ARGs can be specified terms topologies and coalescence times. The times are usually estimated using an informative prior derived from coalescent theory, but this may generate biased estimates also complicate downstream inferences based on ARGs. Here we introduce, POLEGON, a novel approach for estimating branch lengths which uses uninformative prior. Using extensive simulations, show that method provides improved lead to more accurate effective population sizes under wide range demographic assumptions. It improves other including mutation rates. We apply data 1000 Genomes Project investigate size histories differential signatures across populations. estimate HLA region, they exceed 30 million years multiple segments.

Language: Английский

Citations

2

A genealogy-based approach for revealing ancestry-specific structures in admixed populations DOI Creative Commons

Ji Tang,

Charleston W. K. Chiang

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 14, 2025

Elucidating ancestry-specific structures in admixed populations is crucial for comprehending population history and mitigating confounding effects genome-wide association studies. Existing methods elucidating the generally rely on frequency-based estimates of genetic relationship matrix (GRM) among individuals after masking segments from ancestry components not being targeted investigation. However, these approaches disregard linkage information between markers, potentially limiting their resolution revealing structure within an component. We introduce expected GRM (as-eGRM), a novel framework relatedness individuals. The key design as-eGRM consists defining pairwise based genealogical trees encoded Ancestral Recombination Graph (ARG) local calls computing expectation across genome. Comprehensive evaluations using both simulated stepping-stone models empirical datasets three-way Latino cohorts showed that analysis robustly outperforms existing with diverse demographic histories. Taken together, has promise to better reveal fine-scale component individuals, which can help improve robustness interpretation findings studies disease or complex traits understudied populations.

Language: Английский

Citations

1

Nested likelihood-ratio testing of the nonsynonymous:synonymous ratio suggests greater adaptation in the piRNA machinery of Drosophila melanogaster compared with Drosophila ananassae and Drosophila willistoni, two species with higher repeat content DOI Creative Commons
Justin P. Blumenstiel, Sarah B. Kingan, Daniel Garrigan

et al.

G3 Genes Genomes Genetics, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 21, 2025

Abstract Numerous studies have revealed a signature of strong adaptive evolution in the piwi-interacting RNA (piRNA) machinery Drosophila melanogaster, but cause this pattern is not understood. Several hypotheses been proposed. One hypothesis that transposable element (TE) families and piRNA are co-evolving under an evolutionary arms race, perhaps due to antagonism by TEs against machinery. A related, though co-evolutionary, recurrent TE invasion drives adapt novel strategies. third ongoing fluctuation abundance leads adaptation must constantly adjust between sensitivity for detecting new elements specificity avoid cost off-target gene silencing. Rapid may also be driven independently TEs, instead from other functions such as role piRNAs suppressing sex-chromosome meiotic drive. We sought evaluate impact on D. melanogaster 2 species with higher repeat content—Drosophila ananassae willistoni. This comparison was achieved employing likelihood-based testing framework based McDonald–Kreitman test. show we can reject faster rate these species. propose high either recent influx occurred during range expansion or selection

Language: Английский

Citations

1

On ARGs, pedigrees, and genetic relatedness matrices DOI Creative Commons
Brieuc Lehmann, Hanbin Lee, Luke Anderson-Trocmé

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: March 5, 2025

Abstract Genetic relatedness is a central concept in genetics, underpinning studies of population and quantitative genetics human, animal, plant settings. It typically stored as genetic matrix (GRM), whose elements are pairwise values between individuals. This has been defined various contexts based on pedigree, genotype, phylogeny, coalescent times, and, recently, ancestral recombination graph (ARG). ARG-based GRMs have found to better capture the structure improve association relative genotype GRM. However, calculating further operations with them fundamentally challenging due inherent quadratic time space complexity. Here, we first discuss different definitions unifying context, making use additive model trait provide definition “branch relatedness” corresponding GRM”. We explore relationship branch pedigree through case study French-Canadian individuals that known pedigree. Through tree sequence encoding an ARG, then derive efficient algorithm for computing products GRM general vector, without explicitly forming leverages sparse genomes hence enables large-scale computations demonstrate power this by developing randomized principal components sequences easily scales millions genomes. All algorithms implemented open source tskit Python package. Taken together, work consolidates notions leveraging ARG it provides enable scale mega-scale genomic datasets.

Language: Английский

Citations

1

Flax domesticationprocesses as inferred from genome-wide SNP data DOI Creative Commons
Yong‐Bi Fu

Scientific Reports, Journal Year: 2025, Volume and Issue: 15(1)

Published: March 13, 2025

Abstract Flax ( Linum usitatissimum L.) is one of the founder crops domesticated for oil and fiber uses in Near-Eastern Fertile Crescent, but its domestication history remains largely elusive. Genetic inferences so far have expanded our knowledge several aspects flax such as wild progenitor, first use flax, events. However, little known about processes involving multiple This study applied genotyping-by-sequencing to infer processes. Ninety-three samples representing four groups (oilseed, fiber, winter capsular dehiscence) progenitor (or pale flax; L. bienne Mill.) were sequenced. SNP calling identified 16,998 SNPs that widely distributed across 15 chromosomes. Diversity analysis found had largest nucleotide diversity, followed by indehiscent, winter, oilseed cultivated flax. Pale seemed be under population contraction, while other expansion after bottleneck. Demographic showed five carried clear genetic signals mixture events associated with Phylogenetic revealed oilseed, formed two separate phylogenetic subclades. One subclade abundant along some mainly originating Near East nearby regions. The from Europe parts world. Dating divergences an assumption 10,000 years before present (BP) spread 5800 BP hardiness occurred 5100 BP. These findings provide new significant insights into

Language: Английский

Citations

1

Estimating dispersal rates and locating genetic ancestors with genome-wide genealogies DOI Creative Commons
Matthew M. Osmond, Graham Coop

eLife, Journal Year: 2024, Volume and Issue: 13

Published: Nov. 26, 2024

Spatial patterns in genetic diversity are shaped by individuals dispersing from their parents and larger-scale population movements. It has long been appreciated that these of movement shape the underlying genealogies along genome leading to geographic isolation-by-distance contemporary data. However, extracting enormous amount information contained recombining sequences has, until recently, not computationally feasible. Here, we capitalize on important recent advances genome-wide gene-genealogy reconstruction develop methods use thousands trees estimate per-generation dispersal rates locate ancestors a sample back through time. We take likelihood approach continuous space using simple approximate model (branching Brownian motion) as our prior distribution spatial genealogies. After testing method with simulations apply it Arabidopsis thaliana. rate roughly 60 km2/generation, slightly higher across latitude than longitude, potentially reflecting northward post-glacial expansion. Locating allows us visualize major movements, alternative histories, admixture. Our highlights huge about past events movements

Language: Английский

Citations

6

Estimating dispersal rates and locating genetic ancestors with genome-wide genealogies DOI Creative Commons
Matthew M. Osmond, Graham Coop

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2021, Volume and Issue: unknown

Published: July 14, 2021

Abstract Spatial patterns in genetic diversity are shaped by individuals dispersing from their parents and larger-scale population movements. It has long been appreciated that these of movement shape the underlying genealogies along genome leading to geographic isolation distance contemporary data. However, extracting enormous amount information contained recombining sequences has, until recently, not computationally feasible. Here we capitalize on important recent advances genome-wide gene-genealogy reconstruction develop methods use thousands trees estimate per-generation dispersal rates locate ancestors a sample back through time. We take likelihood approach continuous space using simple approximate model (branching Brownian motion) as our prior distribution spatial genealogies. After testing method with simulations apply it Arabidopsis thaliana . rate roughly 60km 2 per generation, slightly higher across latitude than longitude, potentially reflecting northward post-glacial expansion. Locating allows us visualize major movements, alternative histories, admixture. Our highlights huge about past events movements

Language: Английский

Citations

22

Consequences of training data composition for deep learning models in single-cell biology DOI Creative Commons
Ajay Nadig,

Akshaya Thoutam,

Madeline Hughes

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 24, 2025

Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools a variety of common analyses, especially when data are sparse. Recent work with large language has shown that training composition greatly shapes performance; however, date, foundation ignored this aspect, opting instead train on largest possible corpus. We systematically investigate consequences dataset behavior deep learning transcriptomics, focusing human hematopoiesis as tractable model system and including cells from adult developing tissues, disease states, perturbation atlases. find (1) these generalize poorly unseen cell types, (2) adding malignant healthy corpus does not necessarily improve modeling cells, (3) an embryonic stem differentiation atlas during improves performance out-of-distribution tasks. Our results emphasize importance diverse suggest strategies optimize future models.

Language: Английский

Citations

0

Bayesian StairwayPlot for Inferring Single Population Demographic Histories From Site Frequency Spectra DOI Creative Commons
Sebastian Höhna, Ana Catalán

Molecular Ecology Resources, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 26, 2025

The StairwayPlot approach provides an elegant, flexible and powerful method to estimate complex demographic histories of single populations from site frequency spectrum data. It uses expected coalescent times compute the within a multinomial likelihood function. Population sizes are allowed vary freely between events but constant each interval. Here, we implement in Bayesian software package RevBayes. We use approaches developed for Skyline Plots, which include independent identically distributed (i.i.d.) population sizes, Gaussian Markov random fields Horseshoe as prior distributions on sizes. Furthermore, recently computing leave-one-out cross-validation probability efficient model selection. compare inference our implementation original Maximum Likelihood implementation, StairwayPlot2. Our results show that RevBayes performs comparable StairwayPlot2 terms parameter accuracy, is given both same underlying From set models, field performed best smoothly varying histories, while abruptly changing histories. conclude study by exploring several choices often faced empirical studies, including total sequence length, assumed mutation rate, well biases through mis-calling ancestral alleles. using example few 10 diploid individuals sufficient infer at least 500 k nucleotide polymorphisms (SNPs) required.

Language: Английский

Citations

0

Likelihoods for a general class of ARGs under the SMC DOI Creative Commons
Gertjan Bisschop, Jerome Kelleher, Peter L. Ralph

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 27, 2025

Ancestral recombination graphs (ARGs) are the focus of much ongoing research interest. Recent progress in inference has made ARG-based approaches feasible across range applications, and many new methods using inferred ARGs as input have appeared. This on long-standing problem ARG proceeded two distinct directions. First, Bayesian under Sequentially Markov Coalescent (SMC), is now practical for tens-to-hundreds samples. Second, approximate models heuristics can scale to sample sizes three orders magnitude larger. Although these heuristic reasonably accurate metrics, one significant drawback that they estimate do not topological properties required compute a likelihood such SMC present-day formulations. In particular, typically precise details about events, which currently likelihood. this paper we present backwards-time formulation derive straightforward definition general class model. We show does require events be estimated, robust presence polytomies. discuss possibilities opens.

Language: Английский

Citations

0