FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts DOI Creative Commons
Nathaniel Smith, Xinyu Yuan, Chesney Melissinos

и другие.

Bioinformatics, Год журнала: 2024, Номер unknown

Опубликована: Дек. 20, 2024

Abstract Motivation Thousands of genomes are publicly available, however, most genes in those have poorly defined functions. This is partly due to a gap between previously published, experimentally-characterized protein activities and deposited databases. activity deposition bottlenecked by the time-consuming biocuration process. The emergence large language models (LLMs) presents an opportunity speed up text-mining for biocuration. Results We developed FuncFetch—a workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4 Zotero—to screen thousands manuscripts extract enzyme activities. Extensive validation revealed high precision recall determining whether abstract given paper indicates presence characterized paper. Provided manuscript, FuncFetch extracted data such as species information, names, sequence identifiers, substrates products, which were subjected extensive quality analyses. Comparison this against manually curated dataset BAHD acyltransferase demonstrated precision/recall 0.86/0.64 extracting substrates. further deployed on nine plant families. Screening 26,543 papers, retrieved 32,605 entries from 5,459 selected papers. also identified multiple extraction errors including incorrect associations, non-target enzymes, hallucinations, highlight need manual curation. verified, resulting comprehensive functional fingerprint family revealing ∼70% experimentally enzymes uncurated public domain. represents advance lays groundwork predicting functions uncharacterized enzymes. Availability Implementation Code minimally-curated available at: https://github.com/moghelab/funcfetch https://tools.moghelab.org/funczymedb. Supplementary information at Bioinformatics online.

Язык: Английский

UniProt: the Universal Protein Knowledgebase in 2025 DOI Creative Commons
Alex Bateman, María Martin, Sandra Orchard

и другие.

Nucleic Acids Research, Год журнала: 2024, Номер 53(D1), С. D609 - D617

Опубликована: Ноя. 18, 2024

The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely accessible set protein sequences annotated functional information. In this publication, we describe ongoing changes our production pipeline limit available in UniProtKB high-quality, non-redundant reference proteomes. We continue manually curate scientific literature add latest data use machine learning techniques. also encourage community curation ensure key publications are not missed. an update on automatic annotation methods used by predict information for unreviewed entries describing unstudied proteins. Finally, updates website described, including new tab linking genomic recognition its value community, database has been awarded Global Core Biodata Resource status.

Язык: Английский

Процитировано

142

FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts DOI Open Access
Nathaniel Smith, Xinyu Yuan, Chesney Melissinos

и другие.

bioRxiv (Cold Spring Harbor Laboratory), Год журнала: 2024, Номер unknown

Опубликована: Июль 23, 2024

Abstract Motivation Thousands of genomes are publicly available, however, most genes in those have poorly defined functions. This is partly due to a gap between previously published, experimentally-characterized protein activities and deposited databases. activity deposition bottlenecked by the time-consuming biocuration process. The emergence large language models (LLMs) presents an opportunity speed up text-mining for biocuration. Results We developed FuncFetch — workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4 Zotero screen thousands manuscripts extract enzyme activities. Extensive validation revealed high precision recall determining whether abstract given paper indicates presence characterized paper. Provided manuscript, extracted data such as species information, names, sequence identifiers, substrates products, which were subjected extensive quality analyses. Comparison this against manually curated dataset BAHD acyltransferase demonstrated precision/recall 0.86/0.64 extracting substrates. further deployed on nine plant families. Screening 27,120 papers, retrieved 32,605 entries from 5547 selected papers. also identified multiple extraction errors including incorrect associations, non-target enzymes, hallucinations, highlight need manual curation. verified, resulting comprehensive functional fingerprint family revealing ∼70% experimentally enzymes uncurated public domain. represents advance lays groundwork predicting functions uncharacterized enzymes. Availability Implementation Code minimally-curated available at: https://github.com/moghelab/funcfetch https://tools.moghelab.org/funczymedb

Язык: Английский

Процитировано

3

FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts DOI Creative Commons
Nathaniel Smith, Xinyu Yuan, Chesney Melissinos

и другие.

Bioinformatics, Год журнала: 2024, Номер unknown

Опубликована: Дек. 20, 2024

Abstract Motivation Thousands of genomes are publicly available, however, most genes in those have poorly defined functions. This is partly due to a gap between previously published, experimentally-characterized protein activities and deposited databases. activity deposition bottlenecked by the time-consuming biocuration process. The emergence large language models (LLMs) presents an opportunity speed up text-mining for biocuration. Results We developed FuncFetch—a workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4 Zotero—to screen thousands manuscripts extract enzyme activities. Extensive validation revealed high precision recall determining whether abstract given paper indicates presence characterized paper. Provided manuscript, FuncFetch extracted data such as species information, names, sequence identifiers, substrates products, which were subjected extensive quality analyses. Comparison this against manually curated dataset BAHD acyltransferase demonstrated precision/recall 0.86/0.64 extracting substrates. further deployed on nine plant families. Screening 26,543 papers, retrieved 32,605 entries from 5,459 selected papers. also identified multiple extraction errors including incorrect associations, non-target enzymes, hallucinations, highlight need manual curation. verified, resulting comprehensive functional fingerprint family revealing ∼70% experimentally enzymes uncurated public domain. represents advance lays groundwork predicting functions uncharacterized enzymes. Availability Implementation Code minimally-curated available at: https://github.com/moghelab/funcfetch https://tools.moghelab.org/funczymedb. Supplementary information at Bioinformatics online.

Язык: Английский

Процитировано

0