Nature Methods, Год журнала: 2024, Номер unknown
Опубликована: Ноя. 28, 2024
Язык: Английский
Nature Methods, Год журнала: 2024, Номер unknown
Опубликована: Ноя. 28, 2024
Язык: Английский
Trends in Genetics, Год журнала: 2025, Номер unknown
Опубликована: Янв. 1, 2025
Язык: Английский
Процитировано
4bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Март 15, 2024
Foundation models have achieved remarkable success in several fields such as natural language processing, computer vision and more recently biology. DNA foundation particular are emerging a promising approach for genomics. However, so far no model has delivered granular, nucleotide-level predictions across wide range of genomic regulatory elements, limiting their practical usefulness. In this paper, we build on our previous work the Nucleotide Transformer (NT) to develop segmentation model, SegmentNT, that processes input sequences up 30kb-long predict 14 different classes elements at single nucleotide resolution. By utilizing pre-trained weights from NT, SegmentNT surpasses performance ablation models, including convolution networks with one-hot encoded trained scratch. can process multiple sequence lengths zero-shot generalization 50kb. We show improved detection splice sites throughout genome demonstrate strong precision. Because it evaluates all gene simultaneously, impact variants not only site changes but also exon intron rearrangements transcript isoforms. Finally, human generalize plant species multispecies achieves stronger genic unseen species. summary, demonstrates tackle complex, granular tasks genomics single-nucleotide be easily extended additional species, thus representing new paradigm how analyze interpret DNA. make SegmentNT-30kb available github repository Jax HuggingFace space Pytorch.
Язык: Английский
Процитировано
8bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2025, Номер unknown
Опубликована: Янв. 8, 2025
Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains significant challenge. Furthermore, comprehensive benchmark suite evaluating that rely on notably absent. To address this gap, we introduce DNAL ong B ench , dataset encompassing five important genomics consider up to 1 million pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D organization, regulatory sequence activity, transcription initiation signals. comprehensively assess evaluate the performance methods: task-specific expert model, convolutional neural network (CNN)-based three fine-tuned foundation models - HyenaDNA, Caduceus-Ph, Caduceus-PS. We envision standardized resource with potential facilitate comparisons rigorous evaluations emerging sequence-based deep learning account dependencies.
Язык: Английский
Процитировано
1PLoS Computational Biology, Год журнала: 2025, Номер 21(1), С. e1012755 - e1012755
Опубликована: Янв. 10, 2025
Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained that can be fine-tuned specific applications, they enable researchers to create accurate with minimal effort computational resources. Large scale genomics come two flavors: the first are of DNA sequences a self-supervised fashion, similar corresponding models; second supervised leverage from ENCODE other sources. We argue these equivalent foundation their utility, as encode within them chromatin state its different aspects, useful representations allow quick deployment gene regulation. demonstrate this premise by leveraging recently created Sei model develop simple, interpretable intron retention, advantage over based DNABERT-2. Our work also demonstrates impact regulation retention. Using learned Sei, our is able discover involvement transcription factors marks regulating better accuracy than published custom developed purpose.
Язык: Английский
Процитировано
0Nature Machine Intelligence, Год журнала: 2025, Номер unknown
Опубликована: Март 13, 2025
Процитировано
0Nature Methods, Год журнала: 2024, Номер 21(8), С. 1374 - 1377
Опубликована: Авг. 1, 2024
Язык: Английский
Процитировано
3bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Май 24, 2024
Engineering regulatory DNA sequences with precise activity levels in specific cell types hold immense potential for medicine and biotechnology. However, the vast combinatorial space of possible complex grammars governing gene regulation have proven challenging existing approaches. Supervised deep learning models that score proposed by local search algorithms ignore global structure functional sequence space. While diffusion-based generative shown promise these distributions, their application to has been limited. Evaluating quality generated also remains due a lack unified framework characterizes key properties DNA. Here we introduce Discrete Diffusion (D3), conditionally sampling targeted levels. We develop comprehensive suite evaluation metrics assess similarity, composition sequences. Through benchmarking on three high-quality genomics datasets spanning human promoters fly enhancers, demonstrate D3 outperforms methods capturing diversity cis-regulatory generating more accurately reflect genomic Furthermore, show D3-generated can effectively augment supervised improve predictive performance, even data-limited scenarios.
Язык: Английский
Процитировано
2bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Июль 17, 2024
Abstract Protein aggregation is a pathological hallmark of more than fifty human diseases and major problem for biotechnology. Methods have been proposed to predict from sequence, but these trained evaluated on small biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the amyloid nucleation >100,000 protein sequences. This unprecedented dataset reveals limited performance existing computational methods allows us train CANYA, convolution-attention hybrid neural network that accurately predicts sequence. We adapt genomic interpretability analyses reveal CANYA’s decision-making process learned grammar. Our results illustrate power massive analysis random sequence-spaces provide an interpretable robust model nucleation.
Язык: Английский
Процитировано
1bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Ноя. 15, 2024
Deep neural networks (DNNs) have advanced predictive modeling for regulatory genomics, but challenges remain in ensuring the reliability of their predictions and understanding key factors behind decision making. Here we introduce DEGU (Distilling Ensembles Genomic Uncertainty-aware models), a method that integrates ensemble learning knowledge distillation to improve robustness explainability DNN predictions. distills an DNNs into single model, capturing both average ensemble's variability across them, with latter representing epistemic (or model-based) uncertainty. also includes optional auxiliary task estimate aleatoric, or data-based, uncertainty by experimental replicates. By applying various functional genomic prediction tasks, demonstrate DEGU-trained models inherit performance benefits ensembles improved generalization out-of-distribution sequences more consistent explanations cis-regulatory mechanisms through attribution analysis. Moreover, provide calibrated estimates, conformal offering coverage guarantees under minimal assumptions. Overall, paves way robust trustworthy applications deep genomics research.
Язык: Английский
Процитировано
0bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown
Опубликована: Дек. 5, 2024
Abstract Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of language models, genome remain nascent. Recent studies suggest bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be case that even short DNA modeled poorly by existing approaches, current unable represent wide array functions encoded DNA. To study this, we develop AIDO.DNA, pretrained module for representation in an AI-driven Digital Organism [1]. AIDO.DNA seven billion parameter encoder-only transformer trained on 10.6 nucleotides from dataset 796 species. By scaling model size while maintaining length 4k nucleotides, shows substantial improvements across breadth supervised, generative, zero-shot tasks relevant functional genomics, synthetic biology, drug development. Notably, outperforms prior architectures without new data, suggesting laws needed achieve computeoptimal models. Models code available through Model-Generator https://github.com/genbio-ai/AIDO Hugging Face at https://huggingface.co/genbio-ai .
Язык: Английский
Процитировано
0