Journal of Medical Internet Research,
Journal Year:
2024,
Volume and Issue:
26, P. e51297 - e51297
Published: Aug. 23, 2024
Background
The
record
of
the
origin
and
history
data,
known
as
provenance,
holds
importance.
Provenance
information
leads
to
higher
interpretability
scientific
results
enables
reliable
collaboration
data
sharing.
However,
lack
comprehensive
evidence
on
provenance
approaches
hinders
uptake
good
practice
in
clinical
research.
Objective
This
scoping
review
aims
identify
criteria
for
tracking
biomedical
domain.
We
reviewed
state-of-the-art
frameworks,
associated
artifacts,
methodologies
tracking.
Methods
followed
methodological
framework
developed
by
Arksey
O’Malley.
searched
PubMed
Web
Science
databases
English-language
articles
published
from
2006
2022.
Title
abstract
screening
were
carried
out
4
independent
reviewers
using
Rayyan
tool.
A
majority
vote
was
required
consent
eligibility
papers
based
defined
inclusion
exclusion
criteria.
Full-text
reading
performed
independently
2
reviewers,
extracted
into
a
pretested
template
5
research
questions.
Disagreements
resolved
domain
expert.
study
protocol
has
previously
been
published.
Results
search
resulted
total
764
papers.
Of
624
identified,
deduplicated
papers,
66
(10.6%)
studies
fulfilled
identified
diverse
provenance-tracking
ranging
practical
processing
managing
theoretical
frameworks
distinguishing
concepts
details
metadata
models,
components,
notations.
substantial
investigated
underlying
requirements
varying
extents
validation
intensities
but
lacked
completeness
coverage.
Mostly,
cited
concerned
knowledge
about
integrity
reproducibility.
Moreover,
these
revolved
around
robust
quality
assessments,
consistent
policies
sensitive
protection,
improved
user
interfaces,
automated
ontology
development.
found
that
different
stakeholder
groups
benefit
availability
information.
Thereby,
we
recognized
term
is
subjected
an
evolutionary
technical
process
with
multifaceted
meanings
roles.
Challenges
included
organizational
issues
linked
annotation,
modeling,
performance,
amplified
subsequent
matters
such
enhanced
principles.
Conclusions
As
volumes
grow
computing
power
increases,
challenge
scaling
systems
handle
efficiently
assist
complex
queries
intensifies,
necessitating
scalable
solutions.
With
rising
legal
demands,
there
urgent
need
greater
transparency
implementing
projects,
despite
challenges
unresolved
granularity
bottlenecks.
believe
our
recommendations
enable
guide
implementation
auditable
measurable
well
solutions
daily
tasks
scientists.
International
Registered
Report
Identifier
(IRRID)
RR2-10.2196/31750
Genetics,
Journal Year:
2023,
Volume and Issue:
224(1)
Published: March 3, 2023
Abstract
The
Gene
Ontology
(GO)
knowledgebase
(http://geneontology.org)
is
a
comprehensive
resource
concerning
the
functions
of
genes
and
gene
products
(proteins
noncoding
RNAs).
GO
annotations
cover
from
organisms
across
tree
life
as
well
viruses,
though
most
function
knowledge
currently
derives
experiments
carried
out
in
relatively
small
number
model
organisms.
Here,
we
provide
an
updated
overview
knowledgebase,
efforts
broad,
international
consortium
scientists
that
develops,
maintains,
updates
knowledgebase.
consists
three
components:
(1)
GO—a
computational
structure
describing
functional
characteristics
genes;
(2)
annotations—evidence-supported
statements
asserting
specific
product
has
particular
characteristic;
(3)
Causal
Activity
Models
(GO-CAMs)—mechanistic
models
molecular
“pathways”
(GO
biological
processes)
created
by
linking
multiple
using
defined
relations.
Each
these
components
continually
expanded,
revised,
response
to
newly
published
discoveries
receives
extensive
QA
checks,
reviews,
user
feedback.
For
each
components,
description
current
contents,
recent
developments
keep
up
date
with
new
discoveries,
guidance
on
how
users
can
best
make
use
data
provide.
We
conclude
future
directions
for
project.
Bioinformatics,
Journal Year:
2022,
Volume and Issue:
39(1)
Published: Dec. 8, 2022
Abstract
Motivation
To
provide
high
quality,
computationally
tractable
annotation
of
binding
sites
for
biologically
relevant
(cognate)
ligands
in
UniProtKB
using
the
chemical
ontology
ChEBI
(Chemical
Entities
Biological
Interest),
to
better
support
efforts
study
and
predict
functionally
interactions
between
protein
sequences
structures
small
molecule
ligands.
Results
We
structured
data
model
cognate
ligand
site
annotations
performed
a
complete
reannotation
all
stable
unique
identifiers
from
ChEBI,
which
we
now
use
as
reference
vocabulary
such
annotations.
developed
improved
search
query
facilities
UniProt
website,
REST
API
SPARQL
endpoint
that
leverage
structure
data,
nomenclature
classification
provides.
Availability
implementation
Binding
described
are
available
sequence
records
several
formats
(text,
XML
RDF)
freely
download
through
website
(www.uniprot.org),
(www.uniprot.org/help/api),
(sparql.uniprot.org/)
FTP
(https://ftp.uniprot.org/pub/databases/uniprot/).
Supplementary
information
at
Bioinformatics
online.
Genetics,
Journal Year:
2024,
Volume and Issue:
227(1)
Published: April 4, 2024
Abstract
WormBase
has
been
the
major
repository
and
knowledgebase
of
information
about
genome
genetics
Caenorhabditis
elegans
other
nematodes
experimental
interest
for
over
2
decades.
We
have
3
goals:
to
keep
current
with
fast-paced
C.
research,
provide
better
integration
resources,
be
sustainable.
Here,
we
discuss
state
as
well
progress
plans
moving
core
infrastructure
Alliance
Genome
Resources
(the
Alliance).
As
an
member,
will
continue
interact
community,
develop
new
features
needed,
curate
key
from
literature
large-scale
projects.
Genetics,
Journal Year:
2024,
Volume and Issue:
227(1)
Published: March 8, 2024
Since
1999,
The
Arabidopsis
Information
Resource
(www.arabidopsis.org)
has
been
curating
data
about
the
thaliana
genome.
Its
primary
focus
is
integrating
experimental
gene
function
information
from
peer-reviewed
literature
and
codifying
it
as
controlled
vocabulary
annotations.
Our
goal
to
produce
a
"gold
standard"
functional
annotation
set
that
reflects
current
state
of
knowledge
At
same
time,
resource
serves
nexus
for
community-based
collaborations
aimed
at
improving
quality,
access,
reuse.
For
past
decade,
our
work
made
possible
by
subscriptions
global
user
base.
This
update
covers
ongoing
biocuration
work,
some
modernization
efforts
contribute
first
major
infrastructure
overhaul
since
2011,
introduction
JBrowse2,
resource's
role
in
community
activities
such
organizing
structural
reannotation
assessment,
we
used
ontology
annotations
metric
evaluate:
(1)
what
currently
known
(2)
"unknown"
genes.
Currently,
74%
proteome
annotated
least
one
term.
Of
those
loci,
half
have
support
following
aspects:
molecular
function,
biological
process,
or
cellular
component.
sheds
light
on
genes
which
not
yet
identified
any
published
no
annotation.
Drawing
attention
these
unknown
highlights
gaps
potential
sources
novel
discoveries.
Nucleic Acids Research,
Journal Year:
2023,
Volume and Issue:
52(D1), P. D434 - D441
Published: Oct. 30, 2023
Abstract
DisProt
(URL:
https://disprot.org)
is
the
gold
standard
database
for
intrinsically
disordered
proteins
and
regions,
providing
valuable
information
about
their
functions.
The
latest
version
of
brings
significant
advancements,
including
a
broader
representation
functions
an
enhanced
curation
process.
These
improvements
aim
to
increase
both
quality
annotations
coverage
at
sequence
level.
Higher
has
been
achieved
by
adopting
additional
evidence
codes.
Quality
improved
systematically
applying
Minimum
Information
About
Disorder
Experiments
(MIADE)
principles
reporting
all
details
experimental
setup
that
could
potentially
influence
structural
state
protein.
now
includes
new
thematic
datasets
expanded
adoption
Gene
Ontology
terms,
resulting
in
extensive
functional
repertoire
which
automatically
propagated
UniProtKB.
Finally,
we
show
DisProt's
curated
strongly
correlate
with
disorder
predictions
inferred
from
AlphaFold2
pLDDT
(predicted
Local
Distance
Difference
Test)
confidence
scores.
This
comparison
highlights
utility
explaining
apparent
uncertainty
certain
well-defined
predicted
structures,
often
correspond
folding-upon-binding
fragments.
Overall,
serves
as
comprehensive
resource,
combining
enhance
our
understanding
implications.
Nature Machine Intelligence,
Journal Year:
2024,
Volume and Issue:
6(2), P. 220 - 228
Published: Feb. 14, 2024
Abstract
The
Gene
Ontology
(GO)
is
a
formal,
axiomatic
theory
with
over
100,000
axioms
that
describe
the
molecular
functions,
biological
processes
and
cellular
locations
of
proteins
in
three
subontologies.
Predicting
functions
using
GO
requires
both
learning
reasoning
capabilities
order
to
maintain
consistency
exploit
background
knowledge
GO.
Many
methods
have
been
developed
automatically
predict
protein
but
effectively
exploiting
all
for
knowledge-enhanced
has
remained
challenge.
We
DeepGO-SE,
method
predicts
from
sequences
pretrained
large
language
model.
DeepGO-SE
generates
multiple
approximate
models
GO,
neural
network
truth
values
statements
about
these
models.
aggregate
so
approximates
semantic
entailment
when
predicting
functions.
show,
several
benchmarks,
approach
exploits
improves
function
prediction
compared
state-of-the-art
methods.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2022,
Volume and Issue:
unknown
Published: Aug. 22, 2022
Abstract
Motivation
To
provide
high
quality,
computationally
tractable
annotation
of
binding
sites
for
biologically
relevant
(cognate)
ligands
in
UniProtKB
using
the
chemical
ontology
ChEBI
(Chemical
Entities
Biological
Interest),
to
better
support
efforts
study
and
predict
functionally
interactions
between
proteins
small
molecule
ligands.
Results
We
structured
data
model
cognate
ligand
site
annotations
performed
a
complete
reannotation
all
stable
unique
identifiers
from
ChEBI,
which
we
now
use
as
reference
vocabulary
such
annotations.
developed
improved
search
query
facilities
UniProt
website,
REST
API
SPARQL
endpoint
that
leverage
structure
data,
nomenclature,
classification
provides.
Availability
Binding
described
are
available
protein
sequence
records
several
formats
(text,
XML,
RDF),
freely
download
through
website
(
www.uniprot.org
),
www.uniprot.org/help/api
sparql.uniprot.org/
FTP
https://ftp.uniprot.org/pub/databases/uniprot/
).
Contact
[email protected]
Supplementary
information
Table
1.
Nucleic Acids Research,
Journal Year:
2023,
Volume and Issue:
52(D1), P. D255 - D264
Published: Nov. 16, 2023
RegulonDB
is
a
database
that
contains
the
most
comprehensive
corpus
of
knowledge
regulation
transcription
initiation
Escherichia
coli
K-12,
including
data
from
both
classical
molecular
biology
and
high-throughput
methodologies.
Here,
we
describe
biological
advances
since
our
last
NAR
paper
2019.
We
explain
changes
to
satisfy
FAIR
requirements.
also
present
full
reconstruction
computational
infrastructure,
which
has
significantly
improved
storage,
retrieval
accessibility
thus
supports
more
intuitive
user-friendly
experience.
The
integration
graphical
tools
provides
clear
visual
representations
genetic
data,
facilitating
interpretation
integration.
version
12.0
can
be
accessed
at
https://regulondb.ccg.unam.mx.
Scientific Data,
Journal Year:
2025,
Volume and Issue:
12(1)
Published: Feb. 8, 2025
Abstract
Although
rare
diseases
(RDs)
affect
over
260
million
individuals
worldwide,
low
data
quality
and
scarcity
challenge
effective
care
research.
This
work
aims
to
harmonise
the
Common
Data
Set
by
European
Rare
Disease
Registry
Infrastructure,
Health
Level
7
Fast
Healthcare
Interoperability
Base
Resources,
Global
Alliance
for
Genomics
Phenopacket
Schema
into
a
novel
disease
common
model
(RD-CDM),
laying
foundation
developing
international
RD-CDMs
aligned
with
these
standards.
We
developed
modular-based
GitHub
repository
documentation
account
flexibility,
extensions
further
development.
Recommendations
on
model’s
cardinalities
are
given,
inviting
refinement
collaboration.
An
ontology-based
approach
was
selected
find
denominator
between
semantic
syntactic
Our
RD-CDM
version
2.0.0
comprises
78
elements,
extending
ERDRI-CDS
62
elements
previous
versions
implemented
in
four
German
university
hospitals
capturing
real
world
development
evaluation.
identified
three
categories
evaluation:
Medical
Granularity,
Clinical
Reasoning
Relevance,
Harmonisation.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 6, 2025
Abstract
Proteins,
nature’s
intricate
molecular
machines,
are
the
products
of
billions
years
evolution
and
play
fundamental
roles
in
sustaining
life.
Yet,
deciphering
their
language
-
that
is,
understanding
how
protein
sequences
structures
encode
determine
biological
functions
remains
a
cornerstone
challenge
modern
biology.
Here,
we
introduce
Evola,
an
80
billion
frontier
protein-language
generative
model
designed
to
decode
proteins.
By
integrating
information
from
sequences,
structures,
user
queries,
Evola
generates
precise
contextually
nuanced
insights
into
function.
A
key
innovation
lies
its
training
on
unprecedented
AI-generated
dataset:
546
million
question-answer
pairs
150
word
tokens,
reflect
immense
complexity
functional
diversity
Post-pretraining,
integrates
Direct
Preference
Optimization
(DPO)
refine
based
preference
signals
Retrieval-Augmented
Generation
(RAG)
for
external
knowledge
incorporation,
improving
response
quality
relevance.
To
evaluate
performance,
propose
novel
framework,
Instructional
Response
Space
(IRS),
demonstrating
delivers
expert-level
insights,
advancing
research
proteomics
genomics
while
shedding
light
logic
encoded
The
online
demo
is
available
at
http://www.chat-protein.com/
.