How did we get there? AI applications to biological networks and sequences DOI Creative Commons
Marco Anteghini, Francesco Gualdi,

Baldo Oliva

et al.

Computers in Biology and Medicine, Journal Year: 2025, Volume and Issue: 190, P. 110064 - 110064

Published: April 5, 2025

The rapidly advancing field of artificial intelligence (AI) has transformed numerous scientific domains, including biology, where a vast and complex volume data is available for analysis. This paper provides comprehensive overview the current state AI-driven methodologies in genomics, proteomics, systems biology. We discuss how machine learning algorithms, particularly deep models, have enhanced accuracy efficiency embedding sequences, motif discovery, prediction gene expression protein structure. Additionally, we explore integration AI analysis biological networks, protein-protein interaction networks multi-layered networks. By leveraging large-scale data, techniques enabled unprecedented insights into processes disease mechanisms. work underlines potential applying to highlighting applications suggesting directions future research further this evolving field.

Language: Английский

Large language models in plant biology DOI
Hilbert Yuen In Lam, Xing Er Ong, Marek Mutwil

et al.

Trends in Plant Science, Journal Year: 2024, Volume and Issue: 29(10), P. 1145 - 1155

Published: May 26, 2024

Language: Английский

Citations

16

How to build the virtual cell with artificial intelligence: Priorities and opportunities DOI Creative Commons
Charlotte Bunne, Yusuf Roohani, Yanay Rosen

et al.

Cell, Journal Year: 2024, Volume and Issue: 187(25), P. 7045 - 7063

Published: Dec. 1, 2024

Cells are essential to understanding health and disease, yet traditional models fall short of modeling simulating their function behavior. Advances in AI omics offer groundbreaking opportunities create an virtual cell (AIVC), a multi-scale, multi-modal large-neural-network-based model that can represent simulate the behavior molecules, cells, tissues across diverse states. This Perspective provides vision on design how collaborative efforts build AIVCs will transform biological research by allowing high-fidelity simulations, accelerating discoveries, guiding experimental studies, offering new for cellular functions fostering interdisciplinary collaborations open science.

Language: Английский

Citations

16

A DNA language model based on multispecies alignment predicts the effects of genome-wide variants DOI
Gonzalo Benegas,

Carlos Albors,

Alan J. Aw

et al.

Nature Biotechnology, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 2, 2025

Language: Английский

Citations

6

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

et al.

Trends in Genetics, Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 1, 2025

Language: Английский

Citations

5

GENA-LM: a family of open-source foundational DNA language models for long sequences DOI Creative Commons
Veniamin Fishman, Yuri Kuratov, Aleksei Shmelev

et al.

Nucleic Acids Research, Journal Year: 2025, Volume and Issue: 53(2)

Published: Jan. 11, 2025

Abstract Recent advancements in genomics, propelled by artificial intelligence, have unlocked unprecedented capabilities interpreting genomic sequences, mitigating the need for exhaustive experimental analysis of complex, intertwined molecular processes inherent DNA function. A significant challenge, however, resides accurately decoding which inherently involves comprehending rich contextual information dispersed across thousands nucleotides. To address this need, we introduce GENA language model (GENA-LM), a suite transformer-based foundational models capable handling input lengths up to 36 000 base pairs. Notably, integrating newly developed recurrent memory mechanism allows these process even larger segments. We provide pre-trained versions GENA-LM, including multispecies and taxon-specific models, demonstrating their capability fine-tuning addressing spectrum complex biological tasks with modest computational demands. While already achieved breakthroughs protein biology, GENA-LM showcases similarly promising potential reshaping landscape genomics multi-omics data analysis. All are publicly available on GitHub (https://github.com/AIRI-Institute/GENA_LM) HuggingFace (https://huggingface.co/AIRI-Institute). In addition, web service (https://dnalm.airi.net/) allowing user-friendly annotation models.

Language: Английский

Citations

3

Recent advances in deep learning and language models for studying the microbiome DOI Creative Commons
Binghao Yan,

Yunbi Nam,

Lingyao Li

et al.

Frontiers in Genetics, Journal Year: 2025, Volume and Issue: 15

Published: Jan. 7, 2025

Recent advancements in deep learning, particularly large language models (LLMs), made a significant impact on how researchers study microbiome and metagenomics data. Microbial protein genomic sequences, like natural languages, form of life, enabling the adoption LLMs to extract useful insights from complex microbial ecologies. In this paper, we review applications learning analyzing We focus problem formulations, necessary datasets, integration modeling techniques. provide an extensive overview protein/genomic their contributions studies. also discuss such as novel viromics modeling, biosynthetic gene cluster prediction, knowledge for

Language: Английский

Citations

2

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: March 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Language: Английский

Citations

12

Toward a comprehensive profiling of alternative splicing proteoform structures, interactions and functions DOI Creative Commons
Élodie Laine, María I. Freiberger

Current Opinion in Structural Biology, Journal Year: 2025, Volume and Issue: 90, P. 102979 - 102979

Published: Jan. 7, 2025

The mRNA splicing machinery has been estimated to generate 100,000 known protein-coding transcripts for 20,000 human genes (Ensembl, Sept. 2024). However, this set is expanding with the massive and rapidly growing data coming from high-throughput technologies, particularly single-cell long-read sequencing. Yet, implications of complexity at protein level remain largely uncharted. In review, we describe current advances toward systematically assessing contribution alternative proteome function diversification. We discuss potential challenges using artificial intelligence-based techniques in identifying proteoforms characterising their structures, interactions, functions.

Language: Английский

Citations

1

DNALONGBENCH: A Benchmark Suite for Long-Range DNA Prediction Tasks DOI Creative Commons

W. Cheng,

Zhenqiao Song,

Yang Zhang

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 8, 2025

Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains significant challenge. Furthermore, comprehensive benchmark suite evaluating that rely on notably absent. To address this gap, we introduce DNAL ong B ench , dataset encompassing five important genomics consider up to 1 million pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D organization, regulatory sequence activity, transcription initiation signals. comprehensively assess evaluate the performance methods: task-specific expert model, convolutional neural network (CNN)-based three fine-tuned foundation models - HyenaDNA, Caduceus-Ph, Caduceus-PS. We envision standardized resource with potential facilitate comparisons rigorous evaluations emerging sequence-based deep learning account dependencies.

Language: Английский

Citations

1

Application of machine learning and genomics for orphan crop improvement DOI Creative Commons
Tessa R. MacNish, Monica F. Danilevicz, Philipp E. Bayer

et al.

Nature Communications, Journal Year: 2025, Volume and Issue: 16(1)

Published: Jan. 24, 2025

Orphan crops are important sources of nutrition in developing regions and many tolerant to biotic abiotic stressors; however, modern crop improvement technologies have not been widely applied orphan due the lack resources available. There representatives across major types conservation genes between these related species can be used improvement. Machine learning (ML) has emerged as a promising tool for Transferring knowledge from using machine improve accuracy efficiency crops. Here, authors review transferring breeding.

Language: Английский

Citations

1