FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts DOI Creative Commons
Nathaniel Smith, Xinyu Yuan, Chesney Melissinos

et al.

Bioinformatics, Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 20, 2024

Abstract Motivation Thousands of genomes are publicly available, however, most genes in those have poorly defined functions. This is partly due to a gap between previously published, experimentally-characterized protein activities and deposited databases. activity deposition bottlenecked by the time-consuming biocuration process. The emergence large language models (LLMs) presents an opportunity speed up text-mining for biocuration. Results We developed FuncFetch—a workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4 Zotero—to screen thousands manuscripts extract enzyme activities. Extensive validation revealed high precision recall determining whether abstract given paper indicates presence characterized paper. Provided manuscript, FuncFetch extracted data such as species information, names, sequence identifiers, substrates products, which were subjected extensive quality analyses. Comparison this against manually curated dataset BAHD acyltransferase demonstrated precision/recall 0.86/0.64 extracting substrates. further deployed on nine plant families. Screening 26,543 papers, retrieved 32,605 entries from 5,459 selected papers. also identified multiple extraction errors including incorrect associations, non-target enzymes, hallucinations, highlight need manual curation. verified, resulting comprehensive functional fingerprint family revealing ∼70% experimentally enzymes uncurated public domain. represents advance lays groundwork predicting functions uncharacterized enzymes. Availability Implementation Code minimally-curated available at: https://github.com/moghelab/funcfetch https://tools.moghelab.org/funczymedb. Supplementary information at Bioinformatics online.

Language: Английский

UniProt: the Universal Protein Knowledgebase in 2025 DOI Creative Commons
Alex Bateman, María Martin, Sandra Orchard

et al.

Nucleic Acids Research, Journal Year: 2024, Volume and Issue: 53(D1), P. D609 - D617

Published: Nov. 18, 2024

The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely accessible set protein sequences annotated functional information. In this publication, we describe ongoing changes our production pipeline limit available in UniProtKB high-quality, non-redundant reference proteomes. We continue manually curate scientific literature add latest data use machine learning techniques. also encourage community curation ensure key publications are not missed. an update on automatic annotation methods used by predict information for unreviewed entries describing unstudied proteins. Finally, updates website described, including new tab linking genomic recognition its value community, database has been awarded Global Core Biodata Resource status.

Language: Английский

Citations

142

FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts DOI Open Access
Nathaniel Smith, Xinyu Yuan, Chesney Melissinos

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2024, Volume and Issue: unknown

Published: July 23, 2024

Abstract Motivation Thousands of genomes are publicly available, however, most genes in those have poorly defined functions. This is partly due to a gap between previously published, experimentally-characterized protein activities and deposited databases. activity deposition bottlenecked by the time-consuming biocuration process. The emergence large language models (LLMs) presents an opportunity speed up text-mining for biocuration. Results We developed FuncFetch — workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4 Zotero screen thousands manuscripts extract enzyme activities. Extensive validation revealed high precision recall determining whether abstract given paper indicates presence characterized paper. Provided manuscript, extracted data such as species information, names, sequence identifiers, substrates products, which were subjected extensive quality analyses. Comparison this against manually curated dataset BAHD acyltransferase demonstrated precision/recall 0.86/0.64 extracting substrates. further deployed on nine plant families. Screening 27,120 papers, retrieved 32,605 entries from 5547 selected papers. also identified multiple extraction errors including incorrect associations, non-target enzymes, hallucinations, highlight need manual curation. verified, resulting comprehensive functional fingerprint family revealing ∼70% experimentally enzymes uncurated public domain. represents advance lays groundwork predicting functions uncharacterized enzymes. Availability Implementation Code minimally-curated available at: https://github.com/moghelab/funcfetch https://tools.moghelab.org/funczymedb

Language: Английский

Citations

3

FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts DOI Creative Commons
Nathaniel Smith, Xinyu Yuan, Chesney Melissinos

et al.

Bioinformatics, Journal Year: 2024, Volume and Issue: unknown

Published: Dec. 20, 2024

Abstract Motivation Thousands of genomes are publicly available, however, most genes in those have poorly defined functions. This is partly due to a gap between previously published, experimentally-characterized protein activities and deposited databases. activity deposition bottlenecked by the time-consuming biocuration process. The emergence large language models (LLMs) presents an opportunity speed up text-mining for biocuration. Results We developed FuncFetch—a workflow that integrates NCBI E-Utilities, OpenAI’s GPT-4 Zotero—to screen thousands manuscripts extract enzyme activities. Extensive validation revealed high precision recall determining whether abstract given paper indicates presence characterized paper. Provided manuscript, FuncFetch extracted data such as species information, names, sequence identifiers, substrates products, which were subjected extensive quality analyses. Comparison this against manually curated dataset BAHD acyltransferase demonstrated precision/recall 0.86/0.64 extracting substrates. further deployed on nine plant families. Screening 26,543 papers, retrieved 32,605 entries from 5,459 selected papers. also identified multiple extraction errors including incorrect associations, non-target enzymes, hallucinations, highlight need manual curation. verified, resulting comprehensive functional fingerprint family revealing ∼70% experimentally enzymes uncurated public domain. represents advance lays groundwork predicting functions uncharacterized enzymes. Availability Implementation Code minimally-curated available at: https://github.com/moghelab/funcfetch https://tools.moghelab.org/funczymedb. Supplementary information at Bioinformatics online.

Language: Английский

Citations

0