BiRNA-BERT allows efficient RNA language modeling with adaptive tokenization DOI Creative Commons
Md Toki Tahmid,

Haz Sameen Shahgir,

Sazan Mahbub

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 4, 2024

Abstract Recent advancements in Transformer-based models have spurred interest their use for biological sequence analysis. However, adapting like BERT is challenging due to length, often requiring truncation proteomics and genomics tasks. Additionally, advanced tokenization relative positional encoding techniques long contexts NLP are not directly transferable DNA/RNA sequences, which require nucleotide or character-level encodings tasks such as 3D torsion angle prediction. To tackle these challenges, we propose an adaptive dual scheme bioinformatics that utilizes both nucleotide-level (NUC) efficient BPE tokenizations. Building on the tokenization, introduce BiRNA-BERT, a 117M parameter Transformer encoder pretrained with our proposed 28 billion nucleotides across 36 million coding non-coding RNA sequences. The learned representation by BiRNA-BERT generalizes range of applications achieves state-of-the-art results long-sequence downstream performance comparable 6× larger short-sequence 27×less pre-training compute. can dynamically adjust its strategy based lengths, utilizing NUC shorter sequences switching longer ones, thereby offering, first time, capability efficiently handle arbitrarily 1

Язык: Английский

Pre-trained Language Models in Biomedical Domain: A Systematic Survey DOI Open Access
Benyou Wang, Qianqian Xie, Jiahuan Pei

и другие.

ACM Computing Surveys, Год журнала: 2023, Номер 56(3), С. 1 - 52

Опубликована: Авг. 1, 2023

Pre-trained language models (PLMs) have been the de facto paradigm for most natural processing tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science communities propose various PLMs trained on datasets, e.g., text, electronic health records, protein, DNA sequences However, cross-discipline characteristics of hinder their spreading among communities; some existing works are isolated each other without comprehensive comparison discussions. It is nontrivial to make a survey that not only systematically reviews recent advances in applications but standardizes terminology benchmarks. article summarizes progress pre-trained domain downstream Particularly, we discuss motivations introduce key concepts models. We then taxonomy categorizes them perspectives systematically. Plus, tasks exhaustively discussed, respectively. Last, illustrate limitations future trends, which aims provide inspiration research.

Язык: Английский

Процитировано

85

Accelerating the integration of ChatGPT and other large‐scale AI models into biomedical research and healthcare DOI Creative Commons

Ding‐Qiao Wang,

Long‐Yu Feng,

Jinguo Ye

и другие.

MedComm – Future Medicine, Год журнала: 2023, Номер 2(2)

Опубликована: Май 17, 2023

Abstract Large‐scale artificial intelligence (AI) models such as ChatGPT have the potential to improve performance on many benchmarks and real‐world tasks. However, it is difficult develop maintain these because of their complexity resource requirements. As a result, they are still inaccessible healthcare industries clinicians. This situation might soon be changed advancements in graphics processing unit (GPU) programming parallel computing. More importantly, leveraging existing large‐scale AIs GPT‐4 Med‐PaLM integrating them into multiagent (e.g., Visual‐ChatGPT) will facilitate implementations. review aims raise awareness applications healthcare. We provide general overview several advanced AI models, including language vision‐language graph learning language‐conditioned multimodal embodied models. discuss medical addition challenges future directions. Importantly, we stress need align with human values goals, using reinforcement from feedback, ensure that accurate personalized insights support decision‐making outcomes.

Язык: Английский

Процитировано

80

Advances in AI for Protein Structure Prediction: Implications for Cancer Drug Discovery and Development DOI Creative Commons
Xinru Qiu, H. Li, Greg Ver Steeg

и другие.

Biomolecules, Год журнала: 2024, Номер 14(3), С. 339 - 339

Опубликована: Март 12, 2024

Recent advancements in AI-driven technologies, particularly protein structure prediction, are significantly reshaping the landscape of drug discovery and development. This review focuses on question how these technological breakthroughs, exemplified by AlphaFold2, revolutionizing our understanding function changes underlying cancer improve approaches to counter them. By enhancing precision speed at which targets identified candidates can be designed optimized, technologies streamlining entire development process. We explore use AlphaFold2 development, scrutinizing its efficacy, limitations, potential challenges. also compare with other algorithms like ESMFold, explaining diverse methodologies employed this field practical effects differences for application specific algorithms. Additionally, we discuss broader applications including prediction complex structures generative design novel proteins.

Язык: Английский

Процитировано

22

Multimodal Large Language Models in Healthcare: Applications, Challenges, and Future Outlook (Preprint) DOI Creative Commons
Rawan AlSaad, Alaa Abd‐Alrazaq, Sabri Boughorbel

и другие.

Journal of Medical Internet Research, Год журнала: 2024, Номер 26, С. e59505 - e59505

Опубликована: Авг. 20, 2024

In the complex and multidimensional field of medicine, multimodal data are prevalent crucial for informed clinical decisions. Multimodal span a broad spectrum types, including medical images (eg, MRI CT scans), time-series sensor from wearable devices electronic health records), audio recordings heart respiratory sounds patient interviews), text notes research articles), videos surgical procedures), omics genomics proteomics). While advancements in large language models (LLMs) have enabled new applications knowledge retrieval processing field, most LLMs remain limited to unimodal data, typically text-based content, often overlook importance integrating diverse modalities encountered practice. This paper aims present detailed, practical, solution-oriented perspective on use (M-LLMs) field. Our investigation spanned M-LLM foundational principles, current potential applications, technical ethical challenges, future directions. By connecting these elements, we aimed provide comprehensive framework that links aspects M-LLMs, offering unified vision their care. approach guide both practical implementations M-LLMs care, positioning them as paradigm shift toward integrated, data–driven We anticipate this work will spark further discussion inspire development innovative approaches next generation systems.

Язык: Английский

Процитировано

20

Representation learning applications in biological sequence analysis DOI Creative Commons
Hitoshi Iuchi, Taro Matsutani, Keisuke Yamada

и другие.

Computational and Structural Biotechnology Journal, Год журнала: 2021, Номер 19, С. 3198 - 3208

Опубликована: Янв. 1, 2021

Although remarkable advances have been reported in high-throughput sequencing, the ability to aptly analyze a substantial amount of rapidly generated biological (DNA/RNA/protein) sequencing data remains critical hurdle. To tackle this issue, application natural language processing (NLP) sequence analysis has received increased attention. In method, sequences are regarded as sentences while single nucleic acids/amino acids or k-mers these represent words. Embedding is an essential step NLP, which performs conversion words into vectors. Specifically, representation learning approach used for transformation process, can be applied sequences. Vectorized then function and structure estimation, input other probabilistic models. Considering importance growing trend research, present study, we reviewed existing knowledge analysis.

Язык: Английский

Процитировано

70

Recent trends in RNA informatics: a review of machine learning and deep learning for RNA secondary structure prediction and RNA drug discovery DOI Creative Commons
Kengo Sato, Michiaki Hamada

Briefings in Bioinformatics, Год журнала: 2023, Номер 24(4)

Опубликована: Май 25, 2023

Computational analysis of RNA sequences constitutes a crucial step in the field biology. As other domains life sciences, incorporation artificial intelligence and machine learning techniques into sequence has gained significant traction recent years. Historically, thermodynamics-based methods were widely employed for prediction secondary structures; however, learning-based approaches have demonstrated remarkable advancements years, enabling more accurate predictions. Consequently, precision pertaining to structures, such as RNA-protein interactions, also been enhanced, making substantial contribution Additionally, are introducing technical innovations RNA-small molecule interactions RNA-targeted drug discovery design aptamers, where serves its own ligand. This review will highlight trends structure, aptamers using learning, deep related technologies, discuss potential future avenues informatics.

Язык: Английский

Процитировано

39

GenoM7GNet: An Efficient N7-Methylguanosine Site Prediction Approach Based on a Nucleotide Language Model DOI
Chuang Li, Heshi Wang, Yanhua Wen

и другие.

IEEE/ACM Transactions on Computational Biology and Bioinformatics, Год журнала: 2024, Номер 21(6), С. 2258 - 2268

Опубликована: Сен. 20, 2024

N

Язык: Английский

Процитировано

12

Large language models in bioinformatics: applications and perspectives DOI Creative Commons
Jiajia Liu,

Mengyuan Yang,

Yankai Yu

и другие.

arXiv (Cornell University), Год журнала: 2024, Номер unknown

Опубликована: Янв. 1, 2024

Large language models (LLMs) are a class of artificial intelligence based on deep learning, which have great performance in various tasks, especially natural processing (NLP). typically consist neural networks with numerous parameters, trained large amounts unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed proficiency modeling human language. In this review, we will present summary the prominent used processing, such as BERT and GPT, focus exploring applications at different omics levels bioinformatics, mainly including genomics, transcriptomics, proteomics, drug discovery single cell analysis. Finally, review summarizes prospects bioinformatic problems.

Язык: Английский

Процитировано

8

Explainable artificial intelligence for omics data: a systematic mapping study DOI Creative Commons
Philipp A Toussaint, Florian Leiser, Scott Thiebes

и другие.

Briefings in Bioinformatics, Год журнала: 2023, Номер 25(1)

Опубликована: Ноя. 22, 2023

Abstract Researchers increasingly turn to explainable artificial intelligence (XAI) analyze omics data and gain insights into the underlying biological processes. Yet, given interdisciplinary nature of field, many findings have only been shared in their respective research community. An overview XAI for is needed highlight promising approaches help detect common issues. Toward this end, we conducted a systematic mapping study. To identify relevant literature, queried Scopus, PubMed, Web Science, BioRxiv, MedRxiv arXiv. Based on keywording, developed coding scheme with 10 facets regarding studies’ AI methods, explainability methods data. Our study resulted 405 included papers published between 2010 2023. The inspected DNA-based (mostly genomic), transcriptomic, proteomic or metabolomic by means neural networks, tree-based statistical further methods. preferred post-hoc are feature relevance (n = 166) visual explanation 52), while using interpretable often resort use transparent models 83) architecture modifications 72). With gaps still apparent data, deduced eight directions discuss potential field. We also provide exemplary questions each direction. Many problems adoption clinical practice yet be resolved. This outlines extant topic provides researchers practitioners.

Язык: Английский

Процитировано

17

Prediction of protein-ATP binding residues using multi-view feature learning via contextual-based co-attention network DOI
Jia‐shun Wu, Yan Liu, Fang Ge

и другие.

Computers in Biology and Medicine, Год журнала: 2024, Номер 172, С. 108227 - 108227

Опубликована: Март 4, 2024

Язык: Английский

Процитировано

7