Provenance Information for Biomedical Data and Workflows: Scoping Review DOI Creative Commons
Kerstin Gierend, Frank Krüger, Sascha Genehr

et al.

Journal of Medical Internet Research, Journal Year: 2024, Volume and Issue: 26, P. e51297 - e51297

Published: Aug. 23, 2024

Background The record of the origin and history data, known as provenance, holds importance. Provenance information leads to higher interpretability scientific results enables reliable collaboration data sharing. However, lack comprehensive evidence on provenance approaches hinders uptake good practice in clinical research. Objective This scoping review aims identify criteria for tracking biomedical domain. We reviewed state-of-the-art frameworks, associated artifacts, methodologies tracking. Methods followed methodological framework developed by Arksey O’Malley. searched PubMed Web Science databases English-language articles published from 2006 2022. Title abstract screening were carried out 4 independent reviewers using Rayyan tool. A majority vote was required consent eligibility papers based defined inclusion exclusion criteria. Full-text reading performed independently 2 reviewers, extracted into a pretested template 5 research questions. Disagreements resolved domain expert. study protocol has previously been published. Results search resulted total 764 papers. Of 624 identified, deduplicated papers, 66 (10.6%) studies fulfilled identified diverse provenance-tracking ranging practical processing managing theoretical frameworks distinguishing concepts details metadata models, components, notations. substantial investigated underlying requirements varying extents validation intensities but lacked completeness coverage. Mostly, cited concerned knowledge about integrity reproducibility. Moreover, these revolved around robust quality assessments, consistent policies sensitive protection, improved user interfaces, automated ontology development. found that different stakeholder groups benefit availability information. Thereby, we recognized term is subjected an evolutionary technical process with multifaceted meanings roles. Challenges included organizational issues linked annotation, modeling, performance, amplified subsequent matters such enhanced principles. Conclusions As volumes grow computing power increases, challenge scaling systems handle efficiently assist complex queries intensifies, necessitating scalable solutions. With rising legal demands, there urgent need greater transparency implementing projects, despite challenges unresolved granularity bottlenecks. believe our recommendations enable guide implementation auditable measurable well solutions daily tasks scientists. International Registered Report Identifier (IRRID) RR2-10.2196/31750

Language: Английский

The Gene Ontology knowledgebase in 2023 DOI Creative Commons
Suzi Aleksander, James P. Balhoff, Seth Carbon

et al.

Genetics, Journal Year: 2023, Volume and Issue: 224(1)

Published: March 3, 2023

Abstract The Gene Ontology (GO) knowledgebase (http://geneontology.org) is a comprehensive resource concerning the functions of genes and gene products (proteins noncoding RNAs). GO annotations cover from organisms across tree life as well viruses, though most function knowledge currently derives experiments carried out in relatively small number model organisms. Here, we provide an updated overview knowledgebase, efforts broad, international consortium scientists that develops, maintains, updates knowledgebase. consists three components: (1) GO—a computational structure describing functional characteristics genes; (2) annotations—evidence-supported statements asserting specific product has particular characteristic; (3) Causal Activity Models (GO-CAMs)—mechanistic models molecular “pathways” (GO biological processes) created by linking multiple using defined relations. Each these components continually expanded, revised, response to newly published discoveries receives extensive QA checks, reviews, user feedback. For each components, description current contents, recent developments keep up date with new discoveries, guidance on how users can best make use data provide. We conclude future directions for project.

Language: Английский

Citations

1285

Annotation of biologically relevant ligands in UniProtKB using ChEBI DOI Creative Commons
Elisabeth Coudert, Sébastien Géhant, Edouard de Castro

et al.

Bioinformatics, Journal Year: 2022, Volume and Issue: 39(1)

Published: Dec. 8, 2022

Abstract Motivation To provide high quality, computationally tractable annotation of binding sites for biologically relevant (cognate) ligands in UniProtKB using the chemical ontology ChEBI (Chemical Entities Biological Interest), to better support efforts study and predict functionally interactions between protein sequences structures small molecule ligands. Results We structured data model cognate ligand site annotations performed a complete reannotation all stable unique identifiers from ChEBI, which we now use as reference vocabulary such annotations. developed improved search query facilities UniProt website, REST API SPARQL endpoint that leverage structure data, nomenclature classification provides. Availability implementation Binding described are available sequence records several formats (text, XML RDF) freely download through website (www.uniprot.org), (www.uniprot.org/help/api), (sparql.uniprot.org/) FTP (https://ftp.uniprot.org/pub/databases/uniprot/). Supplementary information at Bioinformatics online.

Language: Английский

Citations

195

WormBase 2024: status and transitioning to Alliance infrastructure DOI Creative Commons
Paul W. Sternberg, Kimberly Van Auken, Qinghua Wang

et al.

Genetics, Journal Year: 2024, Volume and Issue: 227(1)

Published: April 4, 2024

Abstract WormBase has been the major repository and knowledgebase of information about genome genetics Caenorhabditis elegans other nematodes experimental interest for over 2 decades. We have 3 goals: to keep current with fast-paced C. research, provide better integration resources, be sustainable. Here, we discuss state as well progress plans moving core infrastructure Alliance Genome Resources (the Alliance). As an member, will continue interact community, develop new features needed, curate key from literature large-scale projects.

Language: Английский

Citations

78

The Arabidopsis Information Resource in 2024 DOI Creative Commons
Leonore Reiser,

Erica Bakker,

Sabarinath Subramaniam

et al.

Genetics, Journal Year: 2024, Volume and Issue: 227(1)

Published: March 8, 2024

Since 1999, The Arabidopsis Information Resource (www.arabidopsis.org) has been curating data about the thaliana genome. Its primary focus is integrating experimental gene function information from peer-reviewed literature and codifying it as controlled vocabulary annotations. Our goal to produce a "gold standard" functional annotation set that reflects current state of knowledge At same time, resource serves nexus for community-based collaborations aimed at improving quality, access, reuse. For past decade, our work made possible by subscriptions global user base. This update covers ongoing biocuration work, some modernization efforts contribute first major infrastructure overhaul since 2011, introduction JBrowse2, resource's role in community activities such organizing structural reannotation assessment, we used ontology annotations metric evaluate: (1) what currently known (2) "unknown" genes. Currently, 74% proteome annotated least one term. Of those loci, half have support following aspects: molecular function, biological process, or cellular component. sheds light on genes which not yet identified any published no annotation. Drawing attention these unknown highlights gaps potential sources novel discoveries.

Language: Английский

Citations

74

DisProt in 2024: improving function annotation of intrinsically disordered proteins DOI Creative Commons
Maria Cristina Aspromonte, María Victoria Nugnes, Federica Quaglia

et al.

Nucleic Acids Research, Journal Year: 2023, Volume and Issue: 52(D1), P. D434 - D441

Published: Oct. 30, 2023

Abstract DisProt (URL: https://disprot.org) is the gold standard database for intrinsically disordered proteins and regions, providing valuable information about their functions. The latest version of brings significant advancements, including a broader representation functions an enhanced curation process. These improvements aim to increase both quality annotations coverage at sequence level. Higher has been achieved by adopting additional evidence codes. Quality improved systematically applying Minimum Information About Disorder Experiments (MIADE) principles reporting all details experimental setup that could potentially influence structural state protein. now includes new thematic datasets expanded adoption Gene Ontology terms, resulting in extensive functional repertoire which automatically propagated UniProtKB. Finally, we show DisProt's curated strongly correlate with disorder predictions inferred from AlphaFold2 pLDDT (predicted Local Distance Difference Test) confidence scores. This comparison highlights utility explaining apparent uncertainty certain well-defined predicted structures, often correspond folding-upon-binding fragments. Overall, serves as comprehensive resource, combining enhance our understanding implications.

Language: Английский

Citations

58

Protein function prediction as approximate semantic entailment DOI Creative Commons
Maxat Kulmanov, Francisco J. Guzmán‐Vega, Paula Duek

et al.

Nature Machine Intelligence, Journal Year: 2024, Volume and Issue: 6(2), P. 220 - 228

Published: Feb. 14, 2024

Abstract The Gene Ontology (GO) is a formal, axiomatic theory with over 100,000 axioms that describe the molecular functions, biological processes and cellular locations of proteins in three subontologies. Predicting functions using GO requires both learning reasoning capabilities order to maintain consistency exploit background knowledge GO. Many methods have been developed automatically predict protein but effectively exploiting all for knowledge-enhanced has remained challenge. We DeepGO-SE, method predicts from sequences pretrained large language model. DeepGO-SE generates multiple approximate models GO, neural network truth values statements about these models. aggregate so approximates semantic entailment when predicting functions. show, several benchmarks, approach exploits improves function prediction compared state-of-the-art methods.

Language: Английский

Citations

21

Annotation of biologically relevant ligands in UniProtKB using ChEBI DOI Creative Commons
Elisabeth Coudert, Sébastien Géhant, Edouard de Castro

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2022, Volume and Issue: unknown

Published: Aug. 22, 2022

Abstract Motivation To provide high quality, computationally tractable annotation of binding sites for biologically relevant (cognate) ligands in UniProtKB using the chemical ontology ChEBI (Chemical Entities Biological Interest), to better support efforts study and predict functionally interactions between proteins small molecule ligands. Results We structured data model cognate ligand site annotations performed a complete reannotation all stable unique identifiers from ChEBI, which we now use as reference vocabulary such annotations. developed improved search query facilities UniProt website, REST API SPARQL endpoint that leverage structure data, nomenclature, classification provides. Availability Binding described are available protein sequence records several formats (text, XML, RDF), freely download through website ( www.uniprot.org ), www.uniprot.org/help/api sparql.uniprot.org/ FTP https://ftp.uniprot.org/pub/databases/uniprot/ ). Contact [email protected] Supplementary information Table 1.

Language: Английский

Citations

39

RegulonDB v12.0: a comprehensive resource of transcriptional regulation in E. coli K-12 DOI Creative Commons
Heladia Salgado, Socorro Gama‐Castro, Paloma Lara

et al.

Nucleic Acids Research, Journal Year: 2023, Volume and Issue: 52(D1), P. D255 - D264

Published: Nov. 16, 2023

RegulonDB is a database that contains the most comprehensive corpus of knowledge regulation transcription initiation Escherichia coli K-12, including data from both classical molecular biology and high-throughput methodologies. Here, we describe biological advances since our last NAR paper 2019. We explain changes to satisfy FAIR requirements. also present full reconstruction computational infrastructure, which has significantly improved storage, retrieval accessibility thus supports more intuitive user-friendly experience. The integration graphical tools provides clear visual representations genetic data, facilitating interpretation integration. version 12.0 can be accessed at https://regulondb.ccg.unam.mx.

Language: Английский

Citations

37

An ontology-based rare disease common data model harmonising international registries, FHIR, and Phenopackets DOI Creative Commons
Adam S.L. Graefe, Miriam Hübner, Filip Rehburg

et al.

Scientific Data, Journal Year: 2025, Volume and Issue: 12(1)

Published: Feb. 8, 2025

Abstract Although rare diseases (RDs) affect over 260 million individuals worldwide, low data quality and scarcity challenge effective care research. This work aims to harmonise the Common Data Set by European Rare Disease Registry Infrastructure, Health Level 7 Fast Healthcare Interoperability Base Resources, Global Alliance for Genomics Phenopacket Schema into a novel disease common model (RD-CDM), laying foundation developing international RD-CDMs aligned with these standards. We developed modular-based GitHub repository documentation account flexibility, extensions further development. Recommendations on model’s cardinalities are given, inviting refinement collaboration. An ontology-based approach was selected find denominator between semantic syntactic Our RD-CDM version 2.0.0 comprises 78 elements, extending ERDRI-CDS 62 elements previous versions implemented in four German university hospitals capturing real world development evaluation. identified three categories evaluation: Medical Granularity, Clinical Reasoning Relevance, Harmonisation.

Language: Английский

Citations

1

Decoding the Molecular Language of Proteins with Evola DOI Creative Commons
Xibin Zhou, Chenchen Han, Yingqi Zhang

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 6, 2025

Abstract Proteins, nature’s intricate molecular machines, are the products of billions years evolution and play fundamental roles in sustaining life. Yet, deciphering their language - that is, understanding how protein sequences structures encode determine biological functions remains a cornerstone challenge modern biology. Here, we introduce Evola, an 80 billion frontier protein-language generative model designed to decode proteins. By integrating information from sequences, structures, user queries, Evola generates precise contextually nuanced insights into function. A key innovation lies its training on unprecedented AI-generated dataset: 546 million question-answer pairs 150 word tokens, reflect immense complexity functional diversity Post-pretraining, integrates Direct Preference Optimization (DPO) refine based preference signals Retrieval-Augmented Generation (RAG) for external knowledge incorporation, improving response quality relevance. To evaluate performance, propose novel framework, Instructional Response Space (IRS), demonstrating delivers expert-level insights, advancing research proteomics genomics while shedding light logic encoded The online demo is available at http://www.chat-protein.com/ .

Language: Английский

Citations

0