Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale DOI Creative Commons
Caleb N. Ellington, Nian X. Sun, Nicholas Ho

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 5, 2024

Abstract Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of language models, genome remain nascent. Recent studies suggest bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be case that even short DNA modeled poorly by existing approaches, current unable represent wide array functions encoded DNA. To study this, we develop AIDO.DNA, pretrained module for representation in an AI-driven Digital Organism [1]. AIDO.DNA seven billion parameter encoder-only transformer trained on 10.6 nucleotides from dataset 796 species. By scaling model size while maintaining length 4k nucleotides, shows substantial improvements across breadth supervised, generative, zero-shot tasks relevant functional genomics, synthetic biology, drug development. Notably, outperforms prior architectures without new data, suggesting laws needed achieve computeoptimal models. Models code available through Model-Generator https://github.com/genbio-ai/AIDO Hugging Face at https://huggingface.co/genbio-ai .

Language: Английский

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

et al.

Trends in Genetics, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

4

Recipes and ingredients for deep learning models of 3D genome folding DOI Creative Commons
Paulina N. Smaruj, Yao Xiao, Geoffrey Fudenberg

et al.

Current Opinion in Genetics & Development, Journal Year: 2025, Volume and Issue: 91, P. 102308 - 102308

Published: Jan. 24, 2025

Language: Английский

Citations

0

From computational models of the splicing code to regulatory mechanisms and therapeutic implications DOI
Charlotte Capitanchik, Oscar G. Wilkins, Nils Wagner

et al.

Nature Reviews Genetics, Journal Year: 2024, Volume and Issue: unknown

Published: Oct. 2, 2024

Language: Английский

Citations

2

A Large-Scale Foundation Model for RNA Function and Structure Prediction DOI Creative Commons

S. Zou,

Tianhua Tao,

Parvez Mahbub

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 29, 2024

Abstract Originally marginalized as an intermediate in the information flow from DNA to protein, RNA has become star of modern biology, holding key precision therapeutics, genetic engineering, evolutionary origins, and our understanding fundamental cellular processes. Yet is mysterious it prolific, serving store, a messenger, catalyst, spanning many underchar-acterized functional structural classes. Deciphering language important not only for mechanistic its biological functions but also accelerating drug design. Toward this goal, we introduce AIDO.RNA, pre-trained module AI-driven Digital Organism [1]. AIDO.RNA contains scale 1.6 billion parameters, trained on 42 million non-coding (ncRNA) sequences at single-nucleotide resolution, achieves state-of-the-art performance comprehensive set tasks, including structure prediction, regulation, molecular function across species, sequence after domain adaptation learns model essential parts protein translation that models, which have received widespread attention recent years, do not. More broadly, hints generality modeling ability leverage central dogma improve biomolecular representations. Models code are available through ModelGenerator https://github.com/genbio-ai/AIDO Hugging Face .

Language: Английский

Citations

1

Interpreting deep neural networks for the prediction of translation rates DOI Creative Commons
Frederick Korbel,

Ekaterina Eroshok,

Uwe Ohler

et al.

BMC Genomics, Journal Year: 2024, Volume and Issue: 25(1)

Published: Nov. 9, 2024

The 5' untranslated region of mRNA strongly impacts the rate translation initiation. A recent convolutional neural network (CNN) model accurately quantifies relationship between massively parallel synthetic regions (5'UTRs) and levels. However, underlying biological features, which drive predictions, remain elusive. Uncovering sequence determinants predictive output may allow us to develop a more detailed understanding regulation at 5'UTR.

Language: Английский

Citations

0

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale DOI Creative Commons
Caleb N. Ellington, Nian X. Sun, Nicholas Ho

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 5, 2024

Abstract Language models applied to protein sequences have become a panacea, enabling therapeutics development, materials engineering, and core biology research. Despite the successes of language models, genome remain nascent. Recent studies suggest bottleneck is data volume or modeling context size, since long-range interactions are widely acknowledged but sparsely annotated. However, it may be case that even short DNA modeled poorly by existing approaches, current unable represent wide array functions encoded DNA. To study this, we develop AIDO.DNA, pretrained module for representation in an AI-driven Digital Organism [1]. AIDO.DNA seven billion parameter encoder-only transformer trained on 10.6 nucleotides from dataset 796 species. By scaling model size while maintaining length 4k nucleotides, shows substantial improvements across breadth supervised, generative, zero-shot tasks relevant functional genomics, synthetic biology, drug development. Notably, outperforms prior architectures without new data, suggesting laws needed achieve computeoptimal models. Models code available through Model-Generator https://github.com/genbio-ai/AIDO Hugging Face at https://huggingface.co/genbio-ai .

Language: Английский

Citations

0