The gene function prediction challenge: Large language models and knowledge graphs to the rescue DOI

Rohan Shawn Sunil,

Shan Chun Lim,

Manoj Itharajula

и другие.

Current Opinion in Plant Biology, Год журнала: 2024, Номер 82, С. 102665 - 102665

Опубликована: Ноя. 22, 2024

Язык: Английский

Nucleotide Transformer: building and evaluating robust foundation models for human genomics DOI Creative Commons

Hugo Dalla-Torre,

Liam Gonzalez,

Javier Mendoza‐Revilla

и другие.

Nature Methods, Год журнала: 2024, Номер unknown

Опубликована: Ноя. 28, 2024

The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study foundation models pre-trained on sequences, named Nucleotide Transformer, ranging 50 million up 2.5 billion parameters integrating information 3,202 human genomes 850 diverse species. These transformer yield context-specific representations nucleotide which allow for accurate predictions even low-data settings. We show that developed can be fine-tuned at low cost solve variety genomics applications. Despite no supervision, learned focus attention key genomic elements used improve prioritization genetic variants. training application foundational provides widely applicable approach phenotype sequence. Transformer is series different parameter sizes datasets applied various downstream tasks fine-tuning.

Язык: Английский

Процитировано

36

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

и другие.

Trends in Genetics, Год журнала: 2025, Номер unknown

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

4

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Март 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Язык: Английский

Процитировано

12

Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model DOI Creative Commons
Jingjing Zhai, Aaron Gokaslan, Yair Schiff

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 5, 2024

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation offer cross-species prediction better than supervised through fine-tuning limited labeled data. We introduce PlantCaduceus, a DNA LM based the Caduceus Mamba architectures, curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus Arabidopsis data for four tasks, including predicting translation initiation/termination sites splice donor acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming best existing by 1.45 7.23-fold. is competitive state-of-the-art protein LMs terms deleterious mutation identification, threefold PhyloP. Additionally, successfully identifies well-known causal variants both maize. Overall, versatile that accelerate genomics crop breeding applications.

Язык: Английский

Процитировано

6

Confronting The Data Deluge: How Artificial Intelligence Can Be Used in the Study of Plant Stress DOI Creative Commons
Eugene Koh,

Rohan Shawn Sunil,

Hilbert Yuen In Lam

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2024, Номер 23, С. 3454 - 3466

Опубликована: Сен. 17, 2024

Язык: Английский

Процитировано

5

Artificial intelligence-driven plant bio-genomics research: a new era DOI
Yang Lin, Hao Wang, Meiling Zou

и другие.

Tropical Plants, Год журнала: 2025, Номер 4(1), С. 0 - 0

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

0

Application of machine learning and genomics for orphan crop improvement DOI Creative Commons
Tessa R. MacNish, Monica F. Danilevicz, Philipp E. Bayer

и другие.

Nature Communications, Год журнала: 2025, Номер 16(1)

Опубликована: Янв. 24, 2025

Orphan crops are important sources of nutrition in developing regions and many tolerant to biotic abiotic stressors; however, modern crop improvement technologies have not been widely applied orphan due the lack resources available. There representatives across major types conservation genes between these related species can be used improvement. Machine learning (ML) has emerged as a promising tool for Transferring knowledge from using machine improve accuracy efficiency crops. Here, authors review transferring breeding.

Язык: Английский

Процитировано

0

Large language model applications in nucleic acid research DOI
Lei Li, Zhao Cheng

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

0

The application progress and research trends of knowledge graphs and large language models in agriculture DOI

Ruizi Gong,

Xinxing Li

Computers and Electronics in Agriculture, Год журнала: 2025, Номер 235, С. 110396 - 110396

Опубликована: Апрель 19, 2025

Язык: Английский

Процитировано

0

PDLLMs: A group of tailored DNA large language models for analyzing plant genomes DOI Creative Commons
Guanqing Liu, Long Chen, Yuechao Wu

и другие.

Molecular Plant, Год журнала: 2024, Номер unknown

Опубликована: Дек. 1, 2024

Язык: Английский

Процитировано

2