Provenance Information for Biomedical Data and Workflows: Scoping Review DOI Creative Commons
Kerstin Gierend, Frank Krüger, Sascha Genehr

et al.

Journal of Medical Internet Research, Journal Year: 2024, Volume and Issue: 26, P. e51297 - e51297

Published: Aug. 23, 2024

Background The record of the origin and history data, known as provenance, holds importance. Provenance information leads to higher interpretability scientific results enables reliable collaboration data sharing. However, lack comprehensive evidence on provenance approaches hinders uptake good practice in clinical research. Objective This scoping review aims identify criteria for tracking biomedical domain. We reviewed state-of-the-art frameworks, associated artifacts, methodologies tracking. Methods followed methodological framework developed by Arksey O’Malley. searched PubMed Web Science databases English-language articles published from 2006 2022. Title abstract screening were carried out 4 independent reviewers using Rayyan tool. A majority vote was required consent eligibility papers based defined inclusion exclusion criteria. Full-text reading performed independently 2 reviewers, extracted into a pretested template 5 research questions. Disagreements resolved domain expert. study protocol has previously been published. Results search resulted total 764 papers. Of 624 identified, deduplicated papers, 66 (10.6%) studies fulfilled identified diverse provenance-tracking ranging practical processing managing theoretical frameworks distinguishing concepts details metadata models, components, notations. substantial investigated underlying requirements varying extents validation intensities but lacked completeness coverage. Mostly, cited concerned knowledge about integrity reproducibility. Moreover, these revolved around robust quality assessments, consistent policies sensitive protection, improved user interfaces, automated ontology development. found that different stakeholder groups benefit availability information. Thereby, we recognized term is subjected an evolutionary technical process with multifaceted meanings roles. Challenges included organizational issues linked annotation, modeling, performance, amplified subsequent matters such enhanced principles. Conclusions As volumes grow computing power increases, challenge scaling systems handle efficiently assist complex queries intensifies, necessitating scalable solutions. With rising legal demands, there urgent need greater transparency implementing projects, despite challenges unresolved granularity bottlenecks. believe our recommendations enable guide implementation auditable measurable well solutions daily tasks scientists. International Registered Report Identifier (IRRID) RR2-10.2196/31750

Language: Английский

Characterization and automated classification of sentences in the biomedical literature: a case study for biocuration of gene expression and protein kinase activity DOI Creative Commons
Daniela Raciti, Kimberly Van Auken, Valerio Arnaboldi

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2025, Volume and Issue: unknown

Published: Jan. 8, 2025

Biological knowledgebases are essential resources for biomedical researchers, providing ready access to gene function and genomic data. Professional, manual curation of knowledgebases, however, is labor-intensive thus high-performing machine learning methods that improve biocuration efficiency needed. Here we report on sentence-level classification identify biocuration-relevant sentences in the full text published references two data types: expression protein kinase activity. We performed a detailed characterization from WormBase bibliography used this define three tasks classifying as either 1) fully curatable, 2) partially or 3) all language-related. evaluated various (ML) models applied these found GPT BioBERT achieve highest average performance, resulting F1 performance scores ranging 0.89 0.99 depending upon task. Our findings demonstrate feasibility extracting text. Integrating into professional workflows, such those by Alliance Genome Resources ACKnowledge community platform, might well facilitate efficient accurate annotation literature.

Language: Английский

Citations

0

A compendium of human gene functions derived from evolutionary modelling DOI Creative Commons
Marc Feuermann, Huaiyu Mi, Pascale Gaudet

et al.

Nature, Journal Year: 2025, Volume and Issue: unknown

Published: Feb. 26, 2025

Abstract A comprehensive, computable representation of the functional repertoire all macromolecules encoded within human genome is a foundational resource for biology and biomedical research. The Gene Ontology Consortium has been working towards this goal by generating structured body information about gene functions, which now includes experimental findings reported in more than 175,000 publications genes experimentally tractable model organisms 1,2 . Here, we describe results large, international effort to integrate these create functions that as complete accurate possible. Specifically, apply an expert-curated, explicit evolutionary modelling approach protein-coding genes. This integrates available across families related into models reconstruct gain loss characteristics over time. resulting set 68,667 integrated cover approximately 82% reveals marked preponderance molecular regulatory provide insights origins functions. We show our descriptions can improve widely used genomic technique enrichment analysis. evidence each characteristic recorded, thereby enabling scientific community help review resource, have made publicly available.

Language: Английский

Citations

0

Minimum information guidelines for experiments structurally characterizing intrinsically disordered protein regions DOI Open Access
Bálint Mészáros, András Hatos, Nicolás Palópoli

et al.

Nature Methods, Journal Year: 2023, Volume and Issue: 20(9), P. 1291 - 1303

Published: July 3, 2023

Language: Английский

Citations

10

The Arabidopsis Information Resource in 2024 DOI Creative Commons
Leonore Reiser,

Erica Bakker,

Sabarinath Subramaniam

et al.

bioRxiv (Cold Spring Harbor Laboratory), Journal Year: 2023, Volume and Issue: unknown

Published: Nov. 7, 2023

Abstract Since 1999, The Arabidopsis Information Resource ( www.arabidopsis.org ) has been curating data about the thaliana genome. Its primary focus is integrating experimental gene function information from peer-reviewed literature and codifying it as controlled vocabulary annotations. Our goal to produce a ‘gold standard’ functional annotation set that reflects current state of knowledge At same time, resource serves nexus for community-based collaborations aimed at improving quality, access reuse. For past decade, our work made possible by subscriptions global user base. This update covers ongoing biocuration work, some modernization efforts contribute first major infrastructure overhaul since 2011, introduction JBrowse2, resource’s role in community activities such organizing structural reannotation assessment, we used Gene Ontology annotations metric evaluate: (1) what currently known function, (2) ‘unknown’ genes. Currently, 74% proteome annotated least one term. Of those loci, half have support following aspects: molecular biological process, or cellular component. sheds light on genes which not yet identified any published no annotation. Drawing attention these unknown highlights gaps potential sources novel discoveries. Article Summary (TAIR, comprehensive website , small plant that’s very easy grow analyze laboratory understand how many other plants function. We share progress collection organization, tool improvement, involvement projects.

Language: Английский

Citations

9

EnzChemRED, a rich enzyme chemistry relation extraction dataset DOI Creative Commons
Po‐Ting Lai, Elisabeth Coudert, Lucila Aimo

et al.

Scientific Data, Journal Year: 2024, Volume and Issue: 11(1)

Published: Sept. 9, 2024

Language: Английский

Citations

3

The Immune Epitope Database (IEDB): 2024 update DOI Creative Commons
Randi Vita, Nina Blazeska, Daniel Marrama

et al.

Nucleic Acids Research, Journal Year: 2024, Volume and Issue: 53(D1), P. D436 - D443

Published: Nov. 18, 2024

Abstract Over the past 20 years, Immune Epitope Database (IEDB, iedb.org) has established itself as foremost resource for immune epitope data. The IEDB catalogs published epitopes and their contextual experimental data in a freely searchable public resource. team manually curates from literature into structured format spans infectious, allergic, autoimmune, transplant diseases. Here, we describe enhancements made since our 2018 paper, capturing user-directed updates to search interface, advanced exports, increases quality, improved interoperability across related resources. As look forward next are confident ability meet needs of users contribute broader field standardization.

Language: Английский

Citations

3

Toward FAIR Representations of Microbial Interactions DOI Creative Commons
Alan R. Pacheco, Charlie Pauvert, Dileep Kishore

et al.

mSystems, Journal Year: 2022, Volume and Issue: 7(5)

Published: Aug. 25, 2022

Despite an ever-growing number of data sets that catalog and characterize interactions between microbes in different environments conditions, many these are neither easily accessible nor intercompatible. These limitations present a major challenge to microbiome research by hindering the streamlined drawing inferences across studies. Here, we propose guiding principles make microbial interaction more findable, accessible, interoperable, reusable (FAIR). We outline specific use cases for span diverse space research, discuss untapped potential new insights can be fulfilled through broader integration data. include, among others, design intercompatible synthetic communities environmental, industrial, or medical applications, inference novel from disparate Lastly, envision trajectories deployment FAIR based on existing resources, reporting standards, current momentum within community.

Language: Английский

Citations

14

Phenopacket-tools: Building and validating GA4GH Phenopackets DOI Creative Commons
Daniel Daniš, Julius O.B. Jacobsen, Alex H. Wagner

et al.

PLoS ONE, Journal Year: 2023, Volume and Issue: 18(5), P. e0285433 - e0285433

Published: May 17, 2023

The Global Alliance for Genomics and Health (GA4GH) is a standards-setting organization that developing suite of coordinated standards genomics. GA4GH Phenopacket Schema standard sharing disease phenotype information characterizes an individual person or biosample. flexible can represent clinical data any kind human including rare disease, complex cancer. It also allows consortia databases to apply additional constraints ensure uniform collection specific goals. We present phenopacket-tools, open-source Java library command-line application construction, conversion, validation phenopackets. Phenopacket-tools simplifies construction phenopackets by providing concise builders, programmatic shortcuts, predefined building blocks (ontology classes) concepts such as anatomical organs, age onset, biospecimen type, modifiers. be used validate the syntax semantics well assess adherence user-defined requirements. documentation includes examples showing how use tool create demonstrate create, convert, using application. Source code, API documentation, comprehensive user guide tutorial found at https://github.com/phenopackets/phenopacket-tools . installed from public Maven Central artifact repository available standalone archive. phenopacket-tools helps developers implement standardize exchange phenotypic other in phenotype-driven genomic diagnostics, translational research, precision medicine applications.

Language: Английский

Citations

8

Complex portal 2025: predicted human complexes and enhanced visualisation tools for the comparison of orthologous and paralogous complexes DOI Creative Commons
Sucharitha Balu,

Susie Huget,

Juan Carlos De los Reyes

et al.

Nucleic Acids Research, Journal Year: 2024, Volume and Issue: 53(D1), P. D644 - D650

Published: Nov. 18, 2024

Abstract The Complex Portal (www.ebi.ac.uk/complexportal) is a manually curated reference database for molecular complexes. It unifying web resource linking aggregated data on composition, topology and the function of macromolecular complexes from 28 species. In addition to significantly extending number complexes, we have massively extended coverage human complexome through incorporation high confidence assemblies predicted by machine-learning algorithms trained large-scale experimental data. current content portal comprising 2150 has been augmented 14 964 (ML) hu.MAP3.0. We refactored website enable easy search filtering these different classes protein implemented Navigator, visualisation tool facilitate comparison related in context orthology or paralogy. embedded Rhea reaction into users view catalytic activity enzyme

Language: Английский

Citations

2

The Origin of Discrepancies between Predictions and Annotations in Intrinsically Disordered Proteins DOI Creative Commons
Mátyás Pajkos, Gábor Erdős, Zsuzsanna Dosztányi

et al.

Biomolecules, Journal Year: 2023, Volume and Issue: 13(10), P. 1442 - 1442

Published: Sept. 25, 2023

Disorder prediction methods that can discriminate between ordered and disordered regions have contributed fundamentally to our understanding of the properties prevalence intrinsically proteins (IDPs) in proteomes as well their functional roles. However, a recent large-scale assessment performance these indicated there is still room for further improvements, necessitating novel approaches understand strengths weaknesses individual methods. In this study, we compared two methods, IUPred disorder prediction, based on pLDDT scores derived from AlphaFold2 (AF2) models. We evaluated using dataset DisProt database, consisting experimentally characterized subsets associated with diverse experimental functions. AF2 provided consistent predictions 79% cases long regions; however, 15% cases, they both suggested order disagreement annotations. These discrepancies arose primarily due weak support, presence intermediate states, or context-dependent behavior, such binding-induced transitions. Furthermore, tended predict helical high within segments, while had limitations identifying linker regions. results provide valuable insights into inherent potential biases

Language: Английский

Citations

6