Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision DOI Creative Commons
Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli

и другие.

arXiv (Cornell University), Год журнала: 2023, Номер unknown

Опубликована: Янв. 1, 2023

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop efficient model capable sequence-to-sequence transformations, generalizing previous genomic models encoder-only or decoder-only architectures. We use Masked Language Modeling pre-train using reference genome and apply it in following downstream tasks: (1) identification enhancers, promotors splice sites, (2) recognition containing base call mismatches insertion/deletion errors, advantage over tokenization schemes involving multiple pairs, which lose ability analyze precision, (3) biological function annotations sequences, (4) generating mutations Influenza virus architecture validating them against real-world observations. In each these tasks, we demonstrate significant improvement as compared existing state-of-the-art results.

Язык: Английский

Multimodal Large Language Models in Healthcare: Applications, Challenges, and Future Outlook (Preprint) DOI Creative Commons
Rawan AlSaad, Alaa Abd‐Alrazaq, Sabri Boughorbel

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e59505 - e59505

Опубликована: Авг. 20, 2024

In the complex and multidimensional field of medicine, multimodal data are prevalent crucial for informed clinical decisions. Multimodal span a broad spectrum types, including medical images (eg, MRI CT scans), time-series sensor from wearable devices electronic health records), audio recordings heart respiratory sounds patient interviews), text notes research articles), videos surgical procedures), omics genomics proteomics). While advancements in large language models (LLMs) have enabled new applications knowledge retrieval processing field, most LLMs remain limited to unimodal data, typically text-based content, often overlook importance integrating diverse modalities encountered practice. This paper aims present detailed, practical, solution-oriented perspective on use (M-LLMs) field. Our investigation spanned M-LLM foundational principles, current potential applications, technical ethical challenges, future directions. By connecting these elements, we aimed provide comprehensive framework that links aspects M-LLMs, offering unified vision their care. approach guide both practical implementations M-LLMs care, positioning them as paradigm shift toward integrated, data–driven We anticipate this work will spark further discussion inspire development innovative approaches next generation systems.

Язык: Английский

Процитировано

20

Genomic language models: opportunities and challenges DOI
Gonzalo Benegas, Chengzhong Ye,

Carlos Albors

и другие.

Trends in Genetics, Год журнала: 2025, Номер unknown

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

4

Evaluating the representational power of pre-trained DNA language models for regulatory genomics DOI Creative Commons
Ziqi Tang,

Nikunj V. Somia,

Yiyang Yu

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Март 4, 2024

ABSTRACT The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity cis -regulatory patterns in the non-coding genome without requiring labels functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged improve predictive performance across broad range regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody foundational understanding biology remains open question. Here we evaluate representational power predict interpret cell-type-specific data span DNA RNA regulation. Our findings suggest probing do not offer substantial advantages over conventional machine approaches use one-hot encoded sequences. This work highlights major gap with current gLMs, raising potential issues pre-training strategies genome.

Язык: Английский

Процитировано

12

Big data and deep learning for RNA biology DOI Creative Commons
Hyeonseo Hwang,

Hyeonseong Jeon,

Nagyeong Yeo

и другие.

Experimental & Molecular Medicine, Год журнала: 2024, Номер 56(6), С. 1293 - 1321

Опубликована: Июнь 14, 2024

Abstract The exponential growth of big data in RNA biology (RB) has led to the development deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies other fields, successful implementation RB depends heavily on effective utilization large-scale datasets from public databases. In achieving this goal, encoding methods, algorithms, and techniques align well with biological domain knowledge played pivotal roles. review, we provide guiding principles for applying these concepts various problems demonstrating examples associated methodologies. We also discuss remaining challenges developing suggest strategies overcome challenges. Overall, review aims illuminate compelling potential ways apply powerful technology investigate intriguing more effectively.

Язык: Английский

Процитировано

10

Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model DOI Creative Commons
Jingjing Zhai, Aaron Gokaslan, Yair Schiff

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июнь 5, 2024

Interpreting function and fitness effects in diverse plant genomes requires transferable models. Language models (LMs) pre-trained on large-scale biological sequences can learn evolutionary conservation offer cross-species prediction better than supervised through fine-tuning limited labeled data. We introduce PlantCaduceus, a DNA LM based the Caduceus Mamba architectures, curated dataset of 16 Angiosperm genomes. Fine-tuning PlantCaduceus Arabidopsis data for four tasks, including predicting translation initiation/termination sites splice donor acceptor sites, demonstrated high transferability to 160 million year diverged maize, outperforming best existing by 1.45 7.23-fold. is competitive state-of-the-art protein LMs terms deleterious mutation identification, threefold PhyloP. Additionally, successfully identifies well-known causal variants both maize. Overall, versatile that accelerate genomics crop breeding applications.

Язык: Английский

Процитировано

6

Human Genome Book: Words, Sentences and Paragraphs DOI Creative Commons
Liang Wang

Опубликована: Фев. 19, 2025

Since the completion of human genome sequencing project in 2001, significant progress has been made areas such as gene regulation editing and protein structure prediction. However, given vast amount genomic data, segments that can be fully annotated understood remain relatively limited. If we consider a book, constructing its equivalents words, sentences, paragraphs long-standing popular research direction. Recently, studies on transfer learning large language models have provided novel approach to this challenge. Multilingual ability, which assesses how well fine-tuned source applied other languages, extensively studied multilingual pre-trained models. Similarly, natural capabilities “DNA language” also validated. Building upon these findings, first trained foundational model capable transferring linguistic from English DNA sequences. Using model, constructed vocabulary words mapped their equivalents. Subsequently, using datasets for paragraphing sentence segmentation develop segmenting sequences into sentences paragraphs. Leveraging models, processed GRCh38.p14 by segmenting, tokenizing, organizing it “book” comprised “words,” “sentences,” “paragraphs.” Additionally, based DNA-to-English mapping, created an “English version” book. This study offers perspective understanding provides exciting possibilities developing innovative tools search, generation, analysis.

Язык: Английский

Процитировано

0

Beyond digital twins: the role of foundation models in enhancing the interpretability of multiomics modalities in precision medicine DOI Creative Commons
Sakhaa B. Alsaedi, Xin Gao, Takashi Gojobori

и другие.

FEBS Open Bio, Год журнала: 2025, Номер unknown

Опубликована: Фев. 24, 2025

Medical digital twins (MDTs) are virtual representations of patients that simulate the biological, physiological, and clinical processes individuals to enable personalized medicine. With increasing complexity omics data, particularly multiomics, there is a growing need for advanced computational frameworks interpret these data effectively. Foundation models (FMs), large‐scale machine learning pretrained on diverse types, have recently emerged as powerful tools improving interpretability decision‐making in precision This review discusses integration FMs into MDT systems, their role enhancing multiomics data. We examine current challenges, recent advancements, future opportunities leveraging analysis MDTs, with focus application

Язык: Английский

Процитировано

0

A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis DOI Creative Commons
Nimisha Ghosh, Daniele Santoni, Indrajit Saha

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2025, Номер unknown

Опубликована: Март 1, 2025

Язык: Английский

Процитировано

0

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring DOI Creative Commons

Ollie Liu,

Sami Jaghour,

Johannes Hagemann

и другие.

Опубликована: Янв. 12, 2025

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as _metagenomic foundation model_, on novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from large collection human wastewater samples, processed sequenced using deep (next-generation) sequencing methods. Unlike genomic models that focus individual genomes or curated sets specific species, the aim METAGENE-1 capture full distribution information present within this wastewater, aid in tasks relevant pandemic monitoring pathogen detection. carry out byte-pair encoding (BPE) tokenization our dataset, tailored for sequences, then model. In paper, first detail pretraining strategy, model architecture, highlighting considerations design choices enable effective modeling data. show results providing details about losses, system metrics, training stability course pretraining. Finally, demonstrate performance achieves state-of-the-art set benchmarks new evaluations focused human-pathogen detection sequence embedding, showcasing its potential public health applications monitoring, biosurveillance, early emerging threats. Website: metagene.ai [https://metagene.ai/] Model Weights: huggingface.co/metagene-ai [https://huggingface.co/metagene-ai] Code Repository: github.com/metagene-ai [https://github.com/metagene-ai]

Язык: Английский

Процитировано

0

Large language model applications in nucleic acid research DOI
Lei Li, Zhao Cheng

Опубликована: Янв. 1, 2025

Язык: Английский

Процитировано

0