Protein Set Transformer: A protein-based genome language model to power high diversity viromics DOI
Karthik Anantharaman, Cody Martin, Anthony Gitter

et al.

Research Square (Research Square), Journal Year: 2024, Volume and Issue: unknown

Published: Sept. 23, 2024

Exponential increases in microbial and viral genomic data demand transformational advances scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of especially genomes proteins that significantly decreases volume usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model models as sets without considering sparsely available labels. Trained on >100k viruses, PST outperformed other homology- model-based approaches relating based shared protein content. Further, demonstrated structural awareness clustering capsid-fold-containing with known capsid uniquely late gene within related viruses. Our establish valuable method diverse genomics, ecology, evolutionary applications. We posit framework can be foundation genomics when trained suitable

Language: Английский

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

et al.

Trends in Genetics, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

4

Scientific Large Language Models: A Survey on Biological & Chemical Domains DOI Open Access
Qiang Zhang, Keyan Ding, Tingting Lv

et al.

ACM Computing Surveys, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 26, 2025

Large Language Models (LLMs) have emerged as a transformative power in enhancing natural language comprehension, representing significant stride toward artificial general intelligence. The application of LLMs extends beyond conventional linguistic boundaries, encompassing specialized systems developed within various scientific disciplines. This growing interest has led to the advent LLMs, novel subclass specifically engineered for facilitating discovery. As burgeoning area community AI Science, warrant comprehensive exploration. However, systematic and up-to-date survey introducing them is currently lacking. In this paper, we endeavor methodically delineate concept “scientific language”, whilst providing thorough review latest advancements LLMs. Given expansive realm disciplines, our analysis adopts focused lens, concentrating on biological chemical domains. includes an in-depth examination textual knowledge, small molecules, macromolecular proteins, genomic sequences, their combinations, analyzing terms model architectures, capabilities, datasets, evaluation. Finally, critically examine prevailing challenges point out promising research directions along with advances By offering overview technical developments field, aspires be invaluable resource researchers navigating intricate landscape

Language: Английский

Citations

3

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Language: Английский

Citations

12

Synthetic genomes unveil the effects of synonymous recoding DOI Creative Commons
Ákos Nyerges, Anush Chiappino-Pepe, Bogdan Budnik

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 16, 2024

Abstract Engineering the genetic code of an organism provides basis for (i) making any safely resistant to natural viruses and (ii) preventing information flow into out genetically modified organisms while (iii) allowing biosynthesis encoded unnatural polymers 1–4 . Achieving these three goals requires reassignment multiple 64 codons nature uses encode proteins. However, synonymous codon replacement—recoding—is frequently lethal, how recoding impacts fitness remains poorly explored. Here, we explore effects using whole-genome synthesis, multiplexed directed evolution, genome-transcriptome-translatome-proteome co-profiling on recoded genomes. Using this information, assemble a synthetic Escherichia coli genome in seven sections only 57 By discovering rules responsible lethality developing data-driven multi-omics-based construction workflow that troubleshoots genomes, overcome lethal 62,007 swaps 11,108 additional genomic edits. We show induces transcriptional noise including new antisense RNAs, leading drastic transcriptome proteome perturbation. As elimination select from organism’s results widespread appearance cryptic promoters, choice may naturally evolve minimize noise. Our work first genome-scale description changes influence organismal paves way functional genomes provide firewalls ecosystems produce biopolymers, drugs, enzymes with expanded chemistry.

Language: Английский

Citations

5

DeepInterAware: Deep Interaction Interface‐Aware Network for Improving Antigen‐Antibody Interaction Prediction from Sequence Data DOI Creative Commons

Yiben Xia,

Zhiwei Wang, Feng Huang

et al.

Advanced Science, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 11, 2025

Abstract Identifying interactions between candidate antibodies and target antigens is a key step in developing effective human therapeutics. The antigen–antibody interaction (AAI) occurs at the structural level, but limited structure data poses significant challenge. However, recent studies revealed that information can be learned from vast amount of sequence data, indicating prediction benefit abundance antigen antibody sequences. In this study, DeepInterAware (deep interface‐aware network) proposed, framework dynamically incorporating interface directly along with inherent specificity Experimental results demonstrate outperforms existing methods exhibits promising inductive capabilities for predicting involving unseen or antibodies, transfer similar tasks. More notably, has unique advantages lack. First, dive into underlying mechanisms AAIs, offering ability to identify potential binding sites. Second, it proficient detecting mutations within extended precise predictions free energy changes upon mutations. HER2‐targeting screening experiment further underscores DeepInterAware's exceptional capability identifying antigens, establishing as an important tool screening.

Language: Английский

Citations

0

Large language model applications in nucleic acid research DOI
Lei Li, Zhao Cheng

Published: Jan. 1, 2025

Language: Английский

Citations

0

ABI and generative biology: A new paradigm for gene therapy, genome engineering, and engineered cell therapy DOI

Adrian Woolfson

Molecular Therapy, Journal Year: 2025, Volume and Issue: unknown

Published: March 1, 2025

Language: Английский

Citations

0

The design and engineering of synthetic genomes DOI
Joshua S. James, Junbiao Dai, Wei Leong Chew

et al.

Nature Reviews Genetics, Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 6, 2024

Language: Английский

Citations

3

Protein Set Transformer: A protein-based genome language model to power high diversity viromics DOI Creative Commons
Cody Martin, Anthony Gitter, Karthik Anantharaman

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: July 29, 2024

Abstract Exponential increases in microbial and viral genomic data demand transformational advances scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of especially genomes proteins that significantly decreases volume usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model models as sets without considering sparsely available labels. Trained on >100k viruses, PST outperformed other homology- model-based approaches relating based shared protein content. Further, demonstrated structural awareness clustering capsid-fold-containing with known capsid uniquely late gene within related viruses. Our establish valuable method diverse genomics, ecology, evolutionary applications. We posit framework can be foundation genomics when trained suitable

Language: Английский

Citations

0

Protein Set Transformer: A protein-based genome language model to power high diversity viromics DOI
Karthik Anantharaman, Cody Martin, Anthony Gitter

et al.

Research Square (Research Square), Journal Year: 2024, Volume and Issue: unknown

Published: Sept. 23, 2024

Exponential increases in microbial and viral genomic data demand transformational advances scalable, generalizable frameworks for their interpretation. Standard homology-based functional analyses are hindered by the rapid divergence of especially genomes proteins that significantly decreases volume usable data. Here, we present Protein Set Transformer (PST), a protein-based genome language model models as sets without considering sparsely available labels. Trained on >100k viruses, PST outperformed other homology- model-based approaches relating based shared protein content. Further, demonstrated structural awareness clustering capsid-fold-containing with known capsid uniquely late gene within related viruses. Our establish valuable method diverse genomics, ecology, evolutionary applications. We posit framework can be foundation genomics when trained suitable

Language: Английский

Citations

0