Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision DOI Creative Commons
Aditya Malusare, Harish Kothandaraman, Dipesh Tamboli

и другие.

arXiv (Cornell University), Год журнала: 2023, Номер unknown

Опубликована: Янв. 1, 2023

This paper presents the Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. ENBED uses a sub-quadratic implementation of attention to develop efficient model capable sequence-to-sequence transformations, generalizing previous genomic models encoder-only or decoder-only architectures. We use Masked Language Modeling pre-train using reference genome and apply it in following downstream tasks: (1) identification enhancers, promotors splice sites, (2) recognition containing base call mismatches insertion/deletion errors, advantage over tokenization schemes involving multiple pairs, which lose ability analyze precision, (3) biological function annotations sequences, (4) generating mutations Influenza virus architecture validating them against real-world observations. In each these tasks, we demonstrate significant improvement as compared existing state-of-the-art results.

Язык: Английский

The Farm Animal Genotype–Tissue Expression (FarmGTEx) Project DOI
Lingzhao Fang, Jinyan Teng, Qing Lin

и другие.

Nature Genetics, Год журнала: 2025, Номер unknown

Опубликована: Март 17, 2025

Язык: Английский

Процитировано

0

DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models DOI Creative Commons
Muhammad Nabeel Asim, Muhammad Ali Ibrahim, Alam Zaib

и другие.

Frontiers in Medicine, Год журнала: 2025, Номер 12

Опубликована: Апрель 8, 2025

Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline somatic mutations. Germline mutations underlie hereditary conditions, while induced by various factors including environmental influences, chemicals, lifestyle choices, errors in replication repair mechanisms which lead to cancer. sequence analysis plays a pivotal role uncovering the intricate information embedded within an organism's understanding modify it. This helps early detection diseases design targeted therapies. Traditional wet-lab experimental traditional methods is costly, time-consuming, prone errors. To accelerate large-scale analysis, researchers are developing AI applications complement methods. These approaches help generate hypotheses, prioritize experiments, interpret results identifying patterns large genomic datasets. Effective integration with validation requires scientists understand both fields. Considering need comprehensive literature bridges gap between fields, contributions this paper manifold: It presents diverse range tasks methodologies. equips essential biological knowledge 44 distinct aligns these 3 AI-paradigms, namely, classification, regression, clustering. streamlines into consolidating 36 databases used develop benchmark datasets for different tasks. ensure performance comparisons new existing predictors, it provides insights 140 related word embeddings language models across development predictors providing survey 39 67 based predictive pipeline values well top performing encoding-based their performances

Язык: Английский

Процитировано

0

Artificial intelligence and machine learning applications for cultured meat DOI Creative Commons

Michael E. Todhunter,

Sheikh Jubair, Ruchika Verma

и другие.

Frontiers in Artificial Intelligence, Год журнала: 2024, Номер 7

Опубликована: Сен. 24, 2024

Cultured meat has the potential to provide a complementary industry with reduced environmental, ethical, and health impacts. However, major technological challenges remain which require time-and resource-intensive research development efforts. Machine learning accelerate cultured technology by streamlining experiments, predicting optimal results, reducing experimentation time resources. use of machine in is its infancy. This review covers work available date on explores future possibilities. We address four areas development: establishing cell lines, culture media design, microscopy image analysis, bioprocessing food processing optimization. In addition, we have included survey datasets relevant CM research. aims foundation necessary for both scientists identify opportunities at intersection between learning.

Язык: Английский

Процитировано

3

Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models DOI Creative Commons

Duo Du,

Fan Zhong,

Lei Liu

и другие.

Journal of Translational Medicine, Год журнала: 2024, Номер 22(1)

Опубликована: Авг. 12, 2024

Decoding human genomic sequences requires comprehensive analysis of DNA sequence functionality. Through computational and experimental approaches, researchers have studied the genotype-phenotype relationship generate important datasets that help unravel complicated genetic blueprints. Thus, recently developed artificial intelligence methods can be used to interpret functions those sequences. This study explores use deep learning, particularly pre-trained models like DNA_bert_6 human_gpt2-v1, in interpreting representing genome Initially, we meticulously constructed multiple linking genotypes phenotypes fine-tune for precise classification. Additionally, evaluate influence length on classification results analyze impact feature extraction hidden layers our model using HERV dataset. To enhance understanding phenotype-specific patterns recognized by model, perform enrichment, pathogenicity conservation analyzes specific motifs endogenous retrovirus (HERV) with high average local representation weight (ALRW) scores. We displaying commendable performance comparison random sequences, dataset, which achieved binary multi-classification accuracies F1 values exceeding 0.935 0.888, respectively. Notably, fine-tuning dataset not only improved ability identify distinguish diverse information types within but also successfully identified associated neurological disorders cancers regions ALRW Subsequent these shed light adaptive responses species environmental pressures their co-evolution pathogens. These findings highlight potential learning representations, when utilizing provide valuable insights future research endeavors. represents an innovative strategy combines representations classical analyzing functionality thereby promoting cross-fertilization between genomics intelligence.

Язык: Английский

Процитировано

1

A Benchmark and Chain-of-Thought Prompting Strategy for Large Multimodal Models with Multiple Image Inputs DOI
Daoan Zhang,

Junming Yang,

Hanjia Lyu

и другие.

Lecture notes in computer science, Год журнала: 2024, Номер unknown, С. 226 - 241

Опубликована: Дек. 2, 2024

Язык: Английский

Процитировано

1

Identification, characterization, and design of plant genome sequences using deep learning DOI Open Access

Zhenye Wang,

Hao Yuan, Jianbing Yan

и другие.

The Plant Journal, Год журнала: 2024, Номер unknown

Опубликована: Дек. 12, 2024

SUMMARY Due to its excellent performance in processing large amounts of data and capturing complex non‐linear relationships, deep learning has been widely applied many fields plant biology. Here we first review the application analyzing genome sequences predict gene expression, chromatin interactions, epigenetic features (open chromatin, transcription factor binding sites, methylation sites) plants. Then, current motif mining functional component design synthesis based on generative adversarial networks, models, attention mechanisms are elaborated detail. The progress protein structure function prediction, genomic model applications is also discussed. Finally, this work provides prospects for future development plants with regard multiple omics data, algorithm optimization, language sequence design, intelligent breeding.

Язык: Английский

Процитировано

1

BioLLMBench: A Comprehensive Benchmarking of Large Language Models in Bioinformatics DOI Creative Commons
Varuni Sarwal, Viorel Munteanu,

Timur Suhodolschi

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2023, Номер unknown

Опубликована: Дек. 20, 2023

Abstract Large Language Models (LLMs) have shown great promise in their knowledge integration and problem-solving capabilities, but ability to assist bioinformatics research has not been systematically evaluated. To bridge this gap, we present BioLLMBench, a novel benchmarking framework coupled with scoring metric scheme for comprehensively evaluating LLMs solving tasks. Through conducted thorough evaluation of 2,160 experimental runs the three most widely used models, GPT-4, Bard LLaMA, focusing on 36 distinct tasks within field bioinformatics. The come from six key areas emphasis that directly relate daily challenges faced by individuals field. These are domain expertise, mathematical problem-solving, coding proficiency, data visualization, summarizing papers, developing machine learning models. also span across varying levels complexity, ranging fundamental concepts expert-level challenges. Each area was evaluated using seven specifically designed task metrics, which were then conduct an overall LLM’s response. enhance our understanding model responses under conditions, implemented Contextual Response Variability Analysis. Our results reveal diverse spectrum performance, GPT-4 leading all except problem solving. GPT4 able achieve proficiency score 91.3% tasks, while excelled 97.5% success rate. While outperformed development average accuracy 65.32%, both LLaMA unable generate executable end-to-end code. All models considerable paper summarization, none them exceeding 40% Recall-Oriented Understudy Gisting Evaluation (ROUGE) score, highlighting significant future improvement. We observed increase performance variance when new chatting window compared same chat, although scores between two contextual environments remained similar. Lastly, discuss various limitations these acknowledge risks associated potential misuse.

Язык: Английский

Процитировано

3

A Sparse and Wide Neural Network Model for DNA Sequences DOI

Tong Yu,

Lei Cheng, Ruslan Khalitov

и другие.

Опубликована: Янв. 1, 2024

Accurate modeling of DNA sequences requires capturing distant semantic relationships between the nucleotide acid bases. Most existing deep neural network models face two challenges: 1) they are limited to short fragments and cannot capture long-range interactions, 2) require many supervised labels, which is often expensive in practice. We propose a new model called SwanDNA address above challenges. By using sparse wide architecture, our enables inferences over very long sequences. incorporating into self-supervised learning framework, method can give accurate predictions while less labels. evaluate three sequence inference tasks, human variant effect, open chromatin regions detection plant genes, GenomicBenchmarks. outperforms all competitors first tasks achieves state-of-art seven eight datasets Our code available at https://github.com/wiedersehne/SwanDNA.

Язык: Английский

Процитировано

0

Self-Distillation Improves DNA Sequence Inference DOI

Tong Yu,

Lei Cheng, Ruslan Khalitov

и другие.

Опубликована: Янв. 1, 2024

Язык: Английский

Процитировано

0

Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook (Preprint) DOI
Rawan AlSaad, Alaa Abd‐Alrazaq, Sabri Boughorbel

и другие.

Опубликована: Апрель 13, 2024

UNSTRUCTURED In the complex and multidimensional field of medicine, multimodal data are prevalent crucial for informed clinical decisions. Multimodal span a broad spectrum types, including medical images (eg, MRI CT scans), time-series sensor from wearable devices electronic health records), audio recordings heart respiratory sounds patient interviews), text notes research articles), videos surgical procedures), omics genomics proteomics). While advancements in large language models (LLMs) have enabled new applications knowledge retrieval processing field, most LLMs remain limited to unimodal data, typically text-based content, often overlook importance integrating diverse modalities encountered practice. This paper aims present detailed, practical, solution-oriented perspective on use (M-LLMs) field. Our investigation spanned M-LLM foundational principles, current potential applications, technical ethical challenges, future directions. By connecting these elements, we aimed provide comprehensive framework that links aspects M-LLMs, offering unified vision their care. approach guide both practical implementations M-LLMs care, positioning them as paradigm shift toward integrated, data–driven We anticipate this work will spark further discussion inspire development innovative approaches next generation systems.

Язык: Английский

Процитировано

0