
Database, Год журнала: 2025, Номер 2025
Опубликована: Янв. 1, 2025
Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel deep domain knowledge. In this paper, we investigate the performance of large language model (LLM), specifically generative pre-trained transformer (GPT)-3.5 GPT-4, in extracting presenting against human curator. order accomplish task, used small set journal articles on wheat barley genetics, focusing traits, such salinity tolerance disease resistance, which are becoming more important. The 36 papers were then curated professional curator for GrainGenes database (https://wheat.pw.usda.gov). parallel, developed GPT-based retrieval-augmented generation question-answering system compared how GPT performed answering questions about traits quantitative trait loci (QTLs). Our findings show that average GPT-4 correctly categorized manuscripts 97% time, extracted 80% 61% marker-trait associations. Furthermore, assessed ability DataFrame agent filter summarize genetics data, showing potential computational curators working side-by-side. one case study, our was able retrieve up 91% related, human-curated QTLs across whole genome, 96% specific genomic region through prompt engineering. Also, observed most tasks, consistently outperformed GPT-3.5 while generating less hallucinations, suggesting improvements LLM models will make artificial intelligence much accurate companion information scientific literature. Despite their limitations, LLMs demonstrated extract present biological databases, long aware inaccuracies possibility incomplete extraction.
Язык: Английский