Nature, Год журнала: 2025, Номер 637(8047), С. 965 - 973
Опубликована: Янв. 8, 2025
Transcriptional regulation, which involves a complex interplay between regulatory sequences and proteins, directs all biological processes. Computational models of transcription lack generalizability to accurately extrapolate unseen cell types conditions. Here we introduce GET (general expression transformer), an interpretable foundation model designed uncover grammars across 213 human fetal adult types1,2. Relying exclusively on chromatin accessibility data sequence information, achieves experimental-level accuracy in predicting gene even previously types3. also shows remarkable adaptability new sequencing platforms assays, enabling inference broad range conditions, uncovers universal cell-type-specific factor interaction networks. We evaluated its performance prediction activity, elements regulators, identification physical interactions factors found that it outperforms current models4 lentivirus-based massively parallel reporter assay readout5,6. In erythroblasts7, identified distal (greater than 1 Mbp) regions were missed by previous models, and, B cells, lymphocyte-specific factor-transcription explains the functional significance leukaemia risk predisposing germline mutation8-10. sum, provide generalizable accurate for together with catalogues regulation interactions, type specificity.
Язык: Английский
Процитировано
4bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Март 4, 2024
ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.
Язык: Английский
Процитировано
12bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Март 15, 2024
Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation particular are emerging a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across wide range of genomic regulatory elements, limiting their practical usefulness. In this paper, we build on our previous work the Nucleotide Transformer (NT) to develop segmentation model, SegmentNT, that processes input sequences up 30kb-long predict 14 different classes elements at single nucleotide resolution. By utilizing pre-trained weights from NT, SegmentNT surpasses performance ablation models, including convolution networks with one-hot encoded trained scratch. can process multiple sequence lengths zero-shot generalization 50kb. We show improved detection splice sites throughout genome demonstrate strong precision. Because it evaluates all gene simultaneously, impact variants not only site changes but also exon intron rearrangements transcript isoforms. Finally, human generalize plant species multispecies achieves stronger genic unseen species. summary, demonstrates tackle complex, granular tasks genomics single-nucleotide be easily extended additional species, thus representing new paradigm how analyze interpret DNA. make SegmentNT-30kb available github repository Jax HuggingFace space Pytorch.
Язык: Английский
Процитировано
8bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Дек. 25, 2024
Despite extensive mapping of cis-regulatory elements (cREs) across cellular contexts with chromatin accessibility assays, the sequence syntax and genetic variants that regulate transcription factor (TF) binding at context-specific cREs remain elusive. We introduce ChromBPNet, a deep learning DNA model base-resolution profiles detects, learns deconvolves assay-specific enzyme biases from regulatory determinants accessibility, enabling robust discovery compact TF motif lexicons, cooperative precision footprints assays sequencing depths. Extensive benchmarks show despite its lightweight design, is competitive much larger contemporary models predicting variant effects on pioneer reporter activity cell ancestry, while providing interpretation disrupted syntax. ChromBPNet also helps prioritize interpret influence complex traits rare diseases, thereby powerful lens to decode variation.
Язык: Английский
Процитировано
8Cell Systems, Год журнала: 2025, Номер unknown, С. 101205 - 101205
Опубликована: Март 1, 2025
Язык: Английский
Процитировано
1bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown
Опубликована: Дек. 21, 2023
Regulatory DNA sequences within enhancers and promoters bind transcription factors to encode cell type-specific patterns of gene expression. However, the regulatory effects programmability such remain difficult map or predict because we have lacked scalable methods precisely edit quantify in an endogenous genomic context. Here present approach measure quantitative hundreds designed sequence variants on expression, by combining pooled CRISPR prime editing with RNA fluorescence
Язык: Английский
Процитировано
14PLoS Computational Biology, Год журнала: 2025, Номер 21(2), С. e1012824 - e1012824
Опубликована: Фев. 4, 2025
Interphase mammalian genomes are folded in 3D with complex locus-specific patterns that impact gene regulation. CTCF (CCCTC-binding factor) is a key architectural protein binds specific DNA sites, halts cohesin-mediated loop extrusion, and enables long-range chromatin interactions. There hundreds of thousands annotated CTCF-binding sites genomes; disruptions some result distinct phenotypes, while others have no visible effect. Despite their importance, the determinants which necessary for genome folding regulation remain unclear. Here, we update utilize Akita, convolutional neural network model, to extract sequence preferences grammar contributing folding. Our analyses individual reveal four predictions: (i) only small fraction genomic impactful; (ii) highly dependent on sequences flanking core binding motif; (iii) nucleotides contribute largely additively overall site; (iv) created as combinations different impacts proportional product average impacts, i.e. they broadly compatible. analysis collections make two predictions multi-motif grammar: insulation strength depends number within cluster, pattern formation governed by orientation spacing these rather than any inherent specialization motifs themselves. In sum, present framework using models probe instructing provide guide future experimental inquiries.
Язык: Английский
Процитировано
0Journal of Advanced Research, Год журнала: 2025, Номер unknown
Опубликована: Фев. 1, 2025
Synthetic biology revolutionizes our ability to decode and recode genetic systems. The capability reconstruct flexibly manipulate multi-gene systems is critical for understanding cellular behaviors has significant applications in therapeutics. This study aims construct a diverse library of synthetic tunable promoters (STPs) enable flexible control expression mammalian cells. We designed constructed that incorporate both universal activation site (UAS) specific (SAS), enabling multi-level via the CRISPR (CRISPRa) system. To evaluate promoter activity, we utilized Massively Parallel Reporter Assays (MPRA) assess basal strengths STPs their responses. Next, three-gene reporter system capacity achieving multilevel single-gene within contains 24,960 unique non-redundant with distinct sequence characteristics. MPRA revealed wide range activities, showing different levels when activated by CRISPRa When regulated targeting SAS, exhibited orthogonality, allowing without cross-interference. Furthermore, combinatorial enlarged scope achievable, providing fine-tuned over gene expression. provide collection promoters, offering valuable toolkit construction manipulation cells, therapy biotechnology.
Язык: Английский
Процитировано
0bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Май 24, 2024
Engineering regulatory DNA sequences with precise activity levels in specific cell types hold immense potential for medicine and biotechnology. However, the vast combinatorial space of possible complex grammars governing gene regulation have proven challenging existing approaches. Supervised deep learning models that score proposed by local search algorithms ignore global structure functional sequence space. While diffusion-based generative shown promise these distributions, their application to has been limited. Evaluating quality generated also remains due a lack unified framework characterizes key properties DNA. Here we introduce Discrete Diffusion (D3), conditionally sampling targeted levels. We develop comprehensive suite evaluation metrics assess similarity, composition sequences. Through benchmarking on three high-quality genomics datasets spanning human promoters fly enhancers, demonstrate D3 outperforms methods capturing diversity cis-regulatory generating more accurately reflect genomic Furthermore, show D3-generated can effectively augment supervised improve predictive performance, even data-limited scenarios.
Язык: Английский
Процитировано
2Cell Metabolism, Год журнала: 2024, Номер 36(8), С. 1639 - 1641
Опубликована: Авг. 1, 2024
Язык: Английский
Процитировано
2