Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers DOI Creative Commons
Alexander Karollus, Thomas Mauermeier, Julien Gagneur

et al.

Genome biology, Journal Year: 2023, Volume and Issue: 24(1)

Published: March 27, 2023

Abstract Background The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those exposed during training solely sequence variation between genes that arose through evolution, questioning extent which capture genuine causal signals. Results Here we confront predictions state-of-the-art regulation against data from two large-scale observational studies and five deep perturbation assays. most advanced these models, Enformer, large, captures determinants promoters. However, fail effects enhancers on expression, notably in medium long distances particularly for highly expressed More generally, predicted impact distal elements expression small ability correctly integrate long-range information significantly more limited than receptive fields suggest. likely caused escalating class imbalance actual candidate distance increases. Conclusions Our results suggest have point silico study promoter regions variants can provide meaningful insights practical guidance how use them. Moreover, foresee it will require new kinds train accurately accounting elements.

Language: Английский

Systematic differences in discovery of genetic effects on gene expression and complex traits DOI
Hakhamanesh Mostafavi, Jeffrey P. Spence,

Sahin Naqvi

et al.

Nature Genetics, Journal Year: 2023, Volume and Issue: 55(11), P. 1866 - 1875

Published: Oct. 19, 2023

Language: Английский

Citations

155

Incomplete Penetrance and Variable Expressivity: From Clinical Studies to Population Cohorts DOI Creative Commons
Rebecca Kingdom, Caroline F. Wright

Frontiers in Genetics, Journal Year: 2022, Volume and Issue: 13

Published: July 25, 2022

The same genetic variant found in different individuals can cause a range of diverse phenotypes, from no discernible clinical phenotype to severe disease, even among related individuals. Such variants be said display incomplete penetrance, binary phenomenon where the genotype either causes expected or it does not, they variable expressivity, which wide symptoms across spectrum. Both penetrance and expressivity are thought caused by factors, including common variants, regulatory regions, epigenetics, environmental lifestyle. Many thousands have been identified as monogenic disorders, mostly determined through small studies, thus, these may overestimated when compared their effect on general population. With wealth population cohort data currently available, such investigated much wider contingent, potentially helping reclassify that were previously completely penetrant. Research into is important for classification, both determining causative mechanisms disease affected providing accurate risk information counseling. A genotype-based definition rare diseases incorporating cohorts studies critical our understanding expressivity. This review examines current knowledge populations, well looking potential variation seen, modifiers, mosaicism, polygenic others. We also considered challenges come with investigating

Language: Английский

Citations

143

Multiple causal variants underlie genetic associations in humans DOI
Nathan S. Abell, Marianne K. DeGorter, Michael J. Gloudemans

et al.

Science, Journal Year: 2022, Volume and Issue: 375(6586), P. 1247 - 1254

Published: March 17, 2022

Associations between genetic variation and traits are often in noncoding regions with strong linkage disequilibrium (LD), where a single causal variant is assumed to underlie the association. We applied massively parallel reporter assay (MPRA) functionally evaluate variants high, local LD for independent cis-expression quantitative trait loci (eQTL). found that 17.7% of eQTLs exhibit more than one major allelic effect tight LD. The detected regulatory were highly specifically enriched activating chromatin structures transcription factor binding. Integration MPRA profiles eQTL/complex colocalizations across 114 human diseases identified sets demonstrating how association signals can manifest through multiple, tightly linked variants.

Language: Английский

Citations

138

CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions DOI Creative Commons
Max Schubach, Thorben Maaß, Lusiné Nazaretyan

et al.

Nucleic Acids Research, Journal Year: 2024, Volume and Issue: 52(D1), P. D1143 - D1154

Published: Jan. 5, 2024

Machine Learning-based scoring and classification of genetic variants aids the assessment clinical findings is employed to prioritize in diverse studies analyses. Combined Annotation-Dependent Depletion (CADD) one first methods for genome-wide prioritization across different molecular functions has been continuously developed improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) sequence conservation (Zoonomia). evaluated version on data sets derived from ClinVar, ExAC/gnomAD 1000 Genomes variants. For coding effects, tested 31 Deep Mutational Scanning (DMS) ProteinGym and, prediction, used saturation mutagenesis reporter assay promoter enhancer sequences. The inclusion features further overall performance CADD. As with previous releases, all sets, v1.7 scores, scripts on-site an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ community.

Language: Английский

Citations

129

iPSC-based disease modeling and drug discovery in cardinal neurodegenerative disorders DOI Creative Commons
Hideyuki Okano, Satoru Morimoto

Cell stem cell, Journal Year: 2022, Volume and Issue: 29(2), P. 189 - 208

Published: Feb. 1, 2022

Language: Английский

Citations

121

AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics DOI Creative Commons
Wen‐Feng Zeng,

Xie‐Xuan Zhou,

Sander Willems

et al.

Nature Communications, Journal Year: 2022, Volume and Issue: 13(1)

Published: Nov. 24, 2022

Machine learning and in particular deep (DL) are increasingly important mass spectrometry (MS)-based proteomics. Recent DL models can predict the retention time, ion mobility fragment intensities of a peptide just from amino acid sequence with good accuracy. However, is very rapidly developing field new neural network architectures frequently appearing, which challenging to incorporate for proteomics researchers. Here we introduce AlphaPeptDeep, modular Python framework built on PyTorch library that learns predicts properties peptides ( https://github.com/MannLabs/alphapeptdeep ). It features model shop enables non-specialists create few lines code. AlphaPeptDeep represents post-translational modifications generic manner, even if only chemical composition known. Extensive use transfer obviates need large data sets refine experimental conditions. The predicting collisional cross sections at least par existing tools. Additional sequence-based also be predicted by as demonstrated HLA prediction improve identification data-independent acquisition https://github.com/MannLabs/PeptDeep-HLA

Language: Английский

Citations

105

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics DOI Creative Commons

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Jan. 15, 2023

Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, prediction of molecular phenotypes from DNA sequences alone remains limited inaccurate, often driven by scarcity annotated data inability to transfer learnings tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50M up 2.5B parameters integrating 3,202 diverse human genomes, as well 850 genomes selected across phyla, including both model non-model organisms. These transformer yield transferable, context-specific representations nucleotide which allow for accurate phenotype even low-data settings. We show that developed can be fine-tuned at low cost despite available regime solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements, those regulate gene expression, such enhancers. Lastly, demonstrate utilizing improve prioritization functional variants. The training application foundational explored this provide widely applicable stepping stone bridge sequence. Code weights at: https://github.com/instadeepai/nucleotide-transformer Jax https://huggingface.co/InstaDeepAI Pytorch. Example notebooks apply these any downstream task are https://huggingface.co/docs/transformers/notebooks#pytorch-bio.

Language: Английский

Citations

105

Generating specificity in genome regulation through transcription factor sensitivity to chromatin DOI
Luke Isbel, Ralph S. Grand, Dirk Schübeler

et al.

Nature Reviews Genetics, Journal Year: 2022, Volume and Issue: 23(12), P. 728 - 740

Published: July 12, 2022

Language: Английский

Citations

101

DeepConsensus improves the accuracy of sequences with a gap-aware sequence transformer DOI
Gunjan Baid, Daniel E. Cook, Kishwar Shafin

et al.

Nature Biotechnology, Journal Year: 2022, Volume and Issue: unknown

Published: Sept. 1, 2022

Language: Английский

Citations

94

Applications of transformer-based language models in bioinformatics: a survey DOI Creative Commons
Shuang Zhang, Rui Fan, Yuti Liu

et al.

Bioinformatics Advances, Journal Year: 2023, Volume and Issue: 3(1)

Published: Jan. 1, 2023

Abstract Summary The transformer-based language models, including vanilla transformer, BERT and GPT-3, have achieved revolutionary breakthroughs in the field of natural processing (NLP). Since there are inherent similarities between various biological sequences languages, remarkable interpretability adaptability these models prompted a new wave their application bioinformatics research. To provide timely comprehensive review, we introduce key developments by describing detailed structure transformers summarize contribution to wide range research from basic sequence analysis drug discovery. While applications diverse multifaceted, identify discuss common challenges, heterogeneity training data, computational expense model interpretability, opportunities context We hope that broader community NLP researchers, bioinformaticians biologists will be brought together foster future development inspire novel unattainable traditional methods. Supplementary information data available at Bioinformatics Advances online.

Language: Английский

Citations

94