The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling DOI Creative Commons

Andre Cornman,

Jacob West-Roberts, Antônio Pedro Camargo

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Aug. 17, 2024

Abstract Biological language model performance depends heavily on pretraining data quality, diversity, and size. While metagenomic datasets feature enormous biological their utilization as has been limited due to challenges in accessibility, quality filtering deduplication. Here, we present the Open MetaGenomic (OMG) corpus, a genomic dataset totalling 3.1T base pairs 3.3B protein coding sequences, obtained by combining two largest repositories (JGI’s IMG EMBL’s MGnify). We first document composition of describe steps taken remove poor data. make OMG corpus available mixed-modality sequence that represents multi-gene encoding sequences with translated amino acids for nucleic intergenic sequences. train (gLM2) leverages context information learn robust functional representations, well coevolutionary signals protein-protein interfaces regulatory syntax. Furthermore, show deduplication embedding space can be used balance demonstrating improved downstream tasks. The is publicly hosted Hugging Face Hub at https://huggingface.co/datasets/tattabio/OMG gLM2 https://huggingface.co/tattabio/gLM2_650M .

Language: Английский

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

et al.

Trends in Genetics, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

4

Are protein language models the new universal key? DOI Creative Commons
Konstantin Weißenow, Burkhard Rost

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 102997 - 102997

Published: Feb. 7, 2025

Protein language models (pLMs) capture some aspects of the grammar life as written in protein sequences. The so-called pLM embeddings implicitly contain this information. Therefore, can serve exclusive input into downstream supervised methods for prediction. Over last 33 years, evolutionary information extracted through simple averaging specific families from multiple sequence alignments (MSAs) has been most successful universal key to success For many applications, MSA-free pLM-based predictions now have become significantly more accurate. reason is often a combination two aspects. Firstly, condense so efficiently that prediction succeed with small models, i.e., they need few free parameters particular era exploding deep neural networks. Secondly, provide protein-specific solutions. As additional benefit, once pre-training complete, solutions tend consume much fewer resources than MSA-based In fact, we appeal community rather optimize foundation retrain new ones and evolve incentives require even at loss accuracy. Although pLMs not, yet, succeeded entirely replace body developed over three decades, clearly are rapidly advancing

Language: Английский

Citations

1

Teaching AI to speak protein DOI Creative Commons
Michael Heinzinger, Burkhard Rost

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 91, P. 102986 - 102986

Published: Feb. 21, 2025

Language: Английский

Citations

1

The OMG dataset: An Open MetaGenomic corpus for mixed-modality genomic language modeling DOI Creative Commons

Andre Cornman,

Jacob West-Roberts, Antônio Pedro Camargo

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: Aug. 17, 2024

Abstract Biological language model performance depends heavily on pretraining data quality, diversity, and size. While metagenomic datasets feature enormous biological their utilization as has been limited due to challenges in accessibility, quality filtering deduplication. Here, we present the Open MetaGenomic (OMG) corpus, a genomic dataset totalling 3.1T base pairs 3.3B protein coding sequences, obtained by combining two largest repositories (JGI’s IMG EMBL’s MGnify). We first document composition of describe steps taken remove poor data. make OMG corpus available mixed-modality sequence that represents multi-gene encoding sequences with translated amino acids for nucleic intergenic sequences. train (gLM2) leverages context information learn robust functional representations, well coevolutionary signals protein-protein interfaces regulatory syntax. Furthermore, show deduplication embedding space can be used balance demonstrating improved downstream tasks. The is publicly hosted Hugging Face Hub at https://huggingface.co/datasets/tattabio/OMG gLM2 https://huggingface.co/tattabio/gLM2_650M .

Language: Английский

Citations

6