
EBioMedicine, Journal Year: 2024, Volume and Issue: 110, P. 105441 - 105441
Published: Nov. 8, 2024
Language: Английский
EBioMedicine, Journal Year: 2024, Volume and Issue: 110, P. 105441 - 105441
Published: Nov. 8, 2024
Language: Английский
medRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown
Published: Feb. 10, 2025
Polygenic risk scores (PRS) are becoming increasingly vital for prediction and stratification in precision medicine. However, PRS model training presents significant challenges broader adoption of PRS, including limited access to computational resources, difficulties implementing advanced methods, availability privacy concerns over individual-level genetic data. Cloud computing provides a promising solution with centralized data resources. Here we introduce PennPRS ( https://pennprs.org ), scalable cloud platform online We developed novel pseudo-training algorithms multiple methods ensemble approaches, enabling without requiring These were rigorously validated through extensive simulations large-scale real analyses involving 6,000 phenotypes across various sources. supports single- multi-ancestry seven allowing users upload their own or query from more than 27,000 datasets the GWAS Catalog, submit jobs, download trained models. Additionally, applied our pipeline train models 8,000 made weights publicly accessible. In summary, improve accessibility applications reduce disparities resources global research community.
Language: Английский
Citations
0The American Journal of Human Genetics, Journal Year: 2025, Volume and Issue: 112(4), P. 727 - 740
Published: April 1, 2025
Language: Английский
Citations
0The American Journal of Human Genetics, Journal Year: 2024, Volume and Issue: unknown
Published: Nov. 1, 2024
Language: Английский
Citations
3bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown
Published: June 12, 2024
Abstract Background Variant Call Format (VCF) is the standard file format for interchanging genetic variation data and associated quality control metrics. The usual row-wise encoding of VCF model (either as text or packed binary) emphasises efficient retrieval all a given variant, but accessing on field sample basis inefficient. Biobank scale datasets currently available consist hundreds thousands whole genomes terabytes compressed VCF. Row-wise storage fundamentally unsuitable more scalable approach needed. Results Zarr storing multi-dimensional that widely used across sciences, ideally suited to massively parallel processing. We present specification, an using Zarr, along with fundamental software infrastructure reliable conversion at scale. show how this far than based approaches, competitive specialised methods genotype in terms compression ratios single-threaded calculation performance. case studies subsets three large human (Genomics England: n =78,195; Our Future Health: =651,050; All Us: =245,394) genome Norway Spruce ( =1,063) SARS-CoV-2 =4,484,157). demonstrate potential enable new generation high-performance cost-effective applications via illustrative examples cloud computing GPUs. Conclusions Large row-encoded files are major bottleneck current research, processing these incurs substantial cost. building widely-used, open-source technologies has greatly reduce costs, may diverse ecosystem next-generation tools analysing directly from cloud-based object stores, while maintaining compatibility existing file-oriented workflows. Key Points supported, underlying entrenched bioinformatics pipelines. (or inherently inefficient large-scale provides solution, by fields separately chunk-compressed binary format.
Language: Английский
Citations
1EBioMedicine, Journal Year: 2024, Volume and Issue: 110, P. 105441 - 105441
Published: Nov. 8, 2024
Language: Английский
Citations
0