The gene function prediction challenge: Large language models and knowledge graphs to the rescue DOI

Rohan Shawn Sunil,

Shan Chun Lim,

Manoj Itharajula

et al.

Current Opinion in Plant Biology, Journal Year: 2024, Volume and Issue: 82, P. 102665 - 102665

Published: Nov. 22, 2024

Language: Английский

Nucleotide Transformer: building and evaluating robust foundation models for human genomics DOI Creative Commons

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

et al.

Nature Methods, Journal Year: 2024, Volume and Issue: unknown

Published: Nov. 28, 2024

The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50 million up 2.5 billion parameters integrating information 3,202 human genomes 850 diverse species. These transformer yield context-specific representations nucleotide which allow for accurate predictions even low-data settings. We show that developed can be fine-tuned at low cost solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements used improve prioritization genetic variants. training application foundational provides widely applicable approach phenotype sequence. Transformer is series different parameter sizes datasets applied various downstream tasks fine-tuning.

Language: Английский

Citations

36

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

et al.

Trends in Genetics, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

4

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Language: Английский

Citations

12

Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model DOI Creative Commons
Jingjing Zhai, Aaron Gokaslan, Yair Schiff

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: June 5, 2024

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation offer cross-species prediction better than supervised through fine-tuning limited labeled data. We introduce PlantCaduceus, a DNA LM based the Caduceus Mamba architectures, curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus Arabidopsis data for four tasks, including predicting translation initiation/termination sites splice donor acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming best existing by 1.45 7.23-fold. is competitive state-of-the-art protein LMs terms deleterious mutation identification, threefold PhyloP. Additionally, successfully identifies well-known causal variants both maize. Overall, versatile that accelerate genomics crop breeding applications.

Language: Английский

Citations

6

Confronting The Data Deluge: How Artificial Intelligence Can Be Used in the Study of Plant Stress DOI Creative Commons
Eugene Koh,

Rohan Shawn Sunil,

Hilbert Yuen In Lam

et al.

Computational and Structural Biotechnology Journal, Journal Year: 2024, Volume and Issue: 23, P. 3454 - 3466

Published: Sept. 17, 2024

Language: Английский

Citations

5

Artificial intelligence-driven plant bio-genomics research: a new era DOI
Yang Lin, Hao Wang, Meiling Zou

et al.

Tropical Plants, Journal Year: 2025, Volume and Issue: 4(1), P. 0 - 0

Published: Jan. 1, 2025

Language: Английский

Citations

0

Application of machine learning and genomics for orphan crop improvement DOI Creative Commons
Tessa R. MacNish, Monica F. Danilevicz, Philipp E. Bayer

et al.

Nature Communications, Journal Year: 2025, Volume and Issue: 16(1)

Published: Jan. 24, 2025

Orphan crops are important sources of nutrition in developing regions and many tolerant to biotic abiotic stressors; however, modern crop improvement technologies have not been widely applied orphan due the lack resources available. There representatives across major types conservation genes between these related species can be used improvement. Machine learning (ML) has emerged as a promising tool for Transferring knowledge from using machine improve accuracy efficiency crops. Here, authors review transferring breeding.

Language: Английский

Citations

0

Large language model applications in nucleic acid research DOI
Lei Li, Zhao Cheng

Published: Jan. 1, 2025

Language: Английский

Citations

0

The application progress and research trends of knowledge graphs and large language models in agriculture DOI

Ruizi Gong,

Xinxing Li

Computers and Electronics in Agriculture, Journal Year: 2025, Volume and Issue: 235, P. 110396 - 110396

Published: April 19, 2025

Language: Английский

Citations

0

PDLLMs: A group of tailored DNA large language models for analyzing plant genomes DOI Creative Commons
Guanqing Liu, Long Chen, Yuechao Wu

et al.

Molecular Plant, Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 1, 2024

Language: Английский

Citations

2