Biomedical data and AI DOI
Haojie Xu, Shibo Zhou, Zefeng Zhu

et al.

Science China Life Sciences, Journal Year: 2025, Volume and Issue: unknown

Published: March 14, 2025

Human Genome Book: Words, Sentences and Paragraphs DOI Creative Commons
Liang Wang

Published: Feb. 19, 2025

Since the completion of human genome sequencing project in 2001, significant progress has been made areas such as gene regulation editing and protein structure prediction. However, given vast amount genomic data, segments that can be fully annotated understood remain relatively limited. If we consider a book, constructing its equivalents words, sentences, paragraphs long-standing popular research direction. Recently, studies on transfer learning large language models have provided novel approach to this challenge. Multilingual ability, which assesses how well fine-tuned source applied other languages, extensively studied multilingual pre-trained models. Similarly, natural capabilities “DNA language” also validated. Building upon these findings, first trained foundational model capable transferring linguistic from English DNA sequences. Using model, constructed vocabulary words mapped their equivalents. Subsequently, using datasets for paragraphing sentence segmentation develop segmenting sequences into sentences paragraphs. Leveraging models, processed GRCh38.p14 by segmenting, tokenizing, organizing it “book” comprised “words,” “sentences,” “paragraphs.” Additionally, based DNA-to-English mapping, created an “English version” book. This study offers perspective understanding provides exciting possibilities developing innovative tools search, generation, analysis.

Language: Английский

Citations

0

Small, Open-Source Text-Embedding Models as Substitutes to OpenAI Models for Gene Analysis DOI Creative Commons
Dailin Gan, Jun Li

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 20, 2025

Abstract While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers low-cost highly efficient alternative. utilizes OpenAI’s text-embedding function encode background information, which is in textual form, about genes. However, the closed-source, online nature of service raises concerns regarding privacy, among other issues. In this paper, we explore possibility replacing with open-source models. We identified ten from Hugging Face that are small size, easy install, light computation. Across all four classification tasks considered, some these have outperformed OpenAI’s, demonstrating their potential viable, or even superior, alternatives. Additionally, find fine-tuning often does not lead significant improvements performance.

Language: Английский

Citations

0

Biomedical data and AI DOI
Haojie Xu, Shibo Zhou, Zefeng Zhu

et al.

Science China Life Sciences, Journal Year: 2025, Volume and Issue: unknown

Published: March 14, 2025

Citations

0