Journal of Medical Internet Research,
Journal Year:
2024,
Volume and Issue:
26, P. e51297 - e51297
Published: Aug. 23, 2024
Background
The
record
of
the
origin
and
history
data,
known
as
provenance,
holds
importance.
Provenance
information
leads
to
higher
interpretability
scientific
results
enables
reliable
collaboration
data
sharing.
However,
lack
comprehensive
evidence
on
provenance
approaches
hinders
uptake
good
practice
in
clinical
research.
Objective
This
scoping
review
aims
identify
criteria
for
tracking
biomedical
domain.
We
reviewed
state-of-the-art
frameworks,
associated
artifacts,
methodologies
tracking.
Methods
followed
methodological
framework
developed
by
Arksey
O’Malley.
searched
PubMed
Web
Science
databases
English-language
articles
published
from
2006
2022.
Title
abstract
screening
were
carried
out
4
independent
reviewers
using
Rayyan
tool.
A
majority
vote
was
required
consent
eligibility
papers
based
defined
inclusion
exclusion
criteria.
Full-text
reading
performed
independently
2
reviewers,
extracted
into
a
pretested
template
5
research
questions.
Disagreements
resolved
domain
expert.
study
protocol
has
previously
been
published.
Results
search
resulted
total
764
papers.
Of
624
identified,
deduplicated
papers,
66
(10.6%)
studies
fulfilled
identified
diverse
provenance-tracking
ranging
practical
processing
managing
theoretical
frameworks
distinguishing
concepts
details
metadata
models,
components,
notations.
substantial
investigated
underlying
requirements
varying
extents
validation
intensities
but
lacked
completeness
coverage.
Mostly,
cited
concerned
knowledge
about
integrity
reproducibility.
Moreover,
these
revolved
around
robust
quality
assessments,
consistent
policies
sensitive
protection,
improved
user
interfaces,
automated
ontology
development.
found
that
different
stakeholder
groups
benefit
availability
information.
Thereby,
we
recognized
term
is
subjected
an
evolutionary
technical
process
with
multifaceted
meanings
roles.
Challenges
included
organizational
issues
linked
annotation,
modeling,
performance,
amplified
subsequent
matters
such
enhanced
principles.
Conclusions
As
volumes
grow
computing
power
increases,
challenge
scaling
systems
handle
efficiently
assist
complex
queries
intensifies,
necessitating
scalable
solutions.
With
rising
legal
demands,
there
urgent
need
greater
transparency
implementing
projects,
despite
challenges
unresolved
granularity
bottlenecks.
believe
our
recommendations
enable
guide
implementation
auditable
measurable
well
solutions
daily
tasks
scientists.
International
Registered
Report
Identifier
(IRRID)
RR2-10.2196/31750
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 8, 2025
Biological
knowledgebases
are
essential
resources
for
biomedical
researchers,
providing
ready
access
to
gene
function
and
genomic
data.
Professional,
manual
curation
of
knowledgebases,
however,
is
labor-intensive
thus
high-performing
machine
learning
methods
that
improve
biocuration
efficiency
needed.
Here
we
report
on
sentence-level
classification
identify
biocuration-relevant
sentences
in
the
full
text
published
references
two
data
types:
expression
protein
kinase
activity.
We
performed
a
detailed
characterization
from
WormBase
bibliography
used
this
define
three
tasks
classifying
as
either
1)
fully
curatable,
2)
partially
or
3)
all
language-related.
evaluated
various
(ML)
models
applied
these
found
GPT
BioBERT
achieve
highest
average
performance,
resulting
F1
performance
scores
ranging
0.89
0.99
depending
upon
task.
Our
findings
demonstrate
feasibility
extracting
text.
Integrating
into
professional
workflows,
such
those
by
Alliance
Genome
Resources
ACKnowledge
community
platform,
might
well
facilitate
efficient
accurate
annotation
literature.
Nature,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 26, 2025
Abstract
A
comprehensive,
computable
representation
of
the
functional
repertoire
all
macromolecules
encoded
within
human
genome
is
a
foundational
resource
for
biology
and
biomedical
research.
The
Gene
Ontology
Consortium
has
been
working
towards
this
goal
by
generating
structured
body
information
about
gene
functions,
which
now
includes
experimental
findings
reported
in
more
than
175,000
publications
genes
experimentally
tractable
model
organisms
1,2
.
Here,
we
describe
results
large,
international
effort
to
integrate
these
create
functions
that
as
complete
accurate
possible.
Specifically,
apply
an
expert-curated,
explicit
evolutionary
modelling
approach
protein-coding
genes.
This
integrates
available
across
families
related
into
models
reconstruct
gain
loss
characteristics
over
time.
resulting
set
68,667
integrated
cover
approximately
82%
reveals
marked
preponderance
molecular
regulatory
provide
insights
origins
functions.
We
show
our
descriptions
can
improve
widely
used
genomic
technique
enrichment
analysis.
evidence
each
characteristic
recorded,
thereby
enabling
scientific
community
help
review
resource,
have
made
publicly
available.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Nov. 7, 2023
Abstract
Since
1999,
The
Arabidopsis
Information
Resource
(
www.arabidopsis.org
)
has
been
curating
data
about
the
thaliana
genome.
Its
primary
focus
is
integrating
experimental
gene
function
information
from
peer-reviewed
literature
and
codifying
it
as
controlled
vocabulary
annotations.
Our
goal
to
produce
a
‘gold
standard’
functional
annotation
set
that
reflects
current
state
of
knowledge
At
same
time,
resource
serves
nexus
for
community-based
collaborations
aimed
at
improving
quality,
access
reuse.
For
past
decade,
our
work
made
possible
by
subscriptions
global
user
base.
This
update
covers
ongoing
biocuration
work,
some
modernization
efforts
contribute
first
major
infrastructure
overhaul
since
2011,
introduction
JBrowse2,
resource’s
role
in
community
activities
such
organizing
structural
reannotation
assessment,
we
used
Gene
Ontology
annotations
metric
evaluate:
(1)
what
currently
known
function,
(2)
‘unknown’
genes.
Currently,
74%
proteome
annotated
least
one
term.
Of
those
loci,
half
have
support
following
aspects:
molecular
biological
process,
or
cellular
component.
sheds
light
on
genes
which
not
yet
identified
any
published
no
annotation.
Drawing
attention
these
unknown
highlights
gaps
potential
sources
novel
discoveries.
Article
Summary
(TAIR,
comprehensive
website
,
small
plant
that’s
very
easy
grow
analyze
laboratory
understand
how
many
other
plants
function.
We
share
progress
collection
organization,
tool
improvement,
involvement
projects.
Nucleic Acids Research,
Journal Year:
2024,
Volume and Issue:
53(D1), P. D436 - D443
Published: Nov. 18, 2024
Abstract
Over
the
past
20
years,
Immune
Epitope
Database
(IEDB,
iedb.org)
has
established
itself
as
foremost
resource
for
immune
epitope
data.
The
IEDB
catalogs
published
epitopes
and
their
contextual
experimental
data
in
a
freely
searchable
public
resource.
team
manually
curates
from
literature
into
structured
format
spans
infectious,
allergic,
autoimmune,
transplant
diseases.
Here,
we
describe
enhancements
made
since
our
2018
paper,
capturing
user-directed
updates
to
search
interface,
advanced
exports,
increases
quality,
improved
interoperability
across
related
resources.
As
look
forward
next
are
confident
ability
meet
needs
of
users
contribute
broader
field
standardization.
mSystems,
Journal Year:
2022,
Volume and Issue:
7(5)
Published: Aug. 25, 2022
Despite
an
ever-growing
number
of
data
sets
that
catalog
and
characterize
interactions
between
microbes
in
different
environments
conditions,
many
these
are
neither
easily
accessible
nor
intercompatible.
These
limitations
present
a
major
challenge
to
microbiome
research
by
hindering
the
streamlined
drawing
inferences
across
studies.
Here,
we
propose
guiding
principles
make
microbial
interaction
more
findable,
accessible,
interoperable,
reusable
(FAIR).
We
outline
specific
use
cases
for
span
diverse
space
research,
discuss
untapped
potential
new
insights
can
be
fulfilled
through
broader
integration
data.
include,
among
others,
design
intercompatible
synthetic
communities
environmental,
industrial,
or
medical
applications,
inference
novel
from
disparate
Lastly,
envision
trajectories
deployment
FAIR
based
on
existing
resources,
reporting
standards,
current
momentum
within
community.
PLoS ONE,
Journal Year:
2023,
Volume and Issue:
18(5), P. e0285433 - e0285433
Published: May 17, 2023
The
Global
Alliance
for
Genomics
and
Health
(GA4GH)
is
a
standards-setting
organization
that
developing
suite
of
coordinated
standards
genomics.
GA4GH
Phenopacket
Schema
standard
sharing
disease
phenotype
information
characterizes
an
individual
person
or
biosample.
flexible
can
represent
clinical
data
any
kind
human
including
rare
disease,
complex
cancer.
It
also
allows
consortia
databases
to
apply
additional
constraints
ensure
uniform
collection
specific
goals.
We
present
phenopacket-tools,
open-source
Java
library
command-line
application
construction,
conversion,
validation
phenopackets.
Phenopacket-tools
simplifies
construction
phenopackets
by
providing
concise
builders,
programmatic
shortcuts,
predefined
building
blocks
(ontology
classes)
concepts
such
as
anatomical
organs,
age
onset,
biospecimen
type,
modifiers.
be
used
validate
the
syntax
semantics
well
assess
adherence
user-defined
requirements.
documentation
includes
examples
showing
how
use
tool
create
demonstrate
create,
convert,
using
application.
Source
code,
API
documentation,
comprehensive
user
guide
tutorial
found
at
https://github.com/phenopackets/phenopacket-tools
.
installed
from
public
Maven
Central
artifact
repository
available
standalone
archive.
phenopacket-tools
helps
developers
implement
standardize
exchange
phenotypic
other
in
phenotype-driven
genomic
diagnostics,
translational
research,
precision
medicine
applications.
Nucleic Acids Research,
Journal Year:
2024,
Volume and Issue:
53(D1), P. D644 - D650
Published: Nov. 18, 2024
Abstract
The
Complex
Portal
(www.ebi.ac.uk/complexportal)
is
a
manually
curated
reference
database
for
molecular
complexes.
It
unifying
web
resource
linking
aggregated
data
on
composition,
topology
and
the
function
of
macromolecular
complexes
from
28
species.
In
addition
to
significantly
extending
number
complexes,
we
have
massively
extended
coverage
human
complexome
through
incorporation
high
confidence
assemblies
predicted
by
machine-learning
algorithms
trained
large-scale
experimental
data.
current
content
portal
comprising
2150
has
been
augmented
14
964
(ML)
hu.MAP3.0.
We
refactored
website
enable
easy
search
filtering
these
different
classes
protein
implemented
Navigator,
visualisation
tool
facilitate
comparison
related
in
context
orthology
or
paralogy.
embedded
Rhea
reaction
into
users
view
catalytic
activity
enzyme
Biomolecules,
Journal Year:
2023,
Volume and Issue:
13(10), P. 1442 - 1442
Published: Sept. 25, 2023
Disorder
prediction
methods
that
can
discriminate
between
ordered
and
disordered
regions
have
contributed
fundamentally
to
our
understanding
of
the
properties
prevalence
intrinsically
proteins
(IDPs)
in
proteomes
as
well
their
functional
roles.
However,
a
recent
large-scale
assessment
performance
these
indicated
there
is
still
room
for
further
improvements,
necessitating
novel
approaches
understand
strengths
weaknesses
individual
methods.
In
this
study,
we
compared
two
methods,
IUPred
disorder
prediction,
based
on
pLDDT
scores
derived
from
AlphaFold2
(AF2)
models.
We
evaluated
using
dataset
DisProt
database,
consisting
experimentally
characterized
subsets
associated
with
diverse
experimental
functions.
AF2
provided
consistent
predictions
79%
cases
long
regions;
however,
15%
cases,
they
both
suggested
order
disagreement
annotations.
These
discrepancies
arose
primarily
due
weak
support,
presence
intermediate
states,
or
context-dependent
behavior,
such
binding-induced
transitions.
Furthermore,
tended
predict
helical
high
within
segments,
while
had
limitations
identifying
linker
regions.
results
provide
valuable
insights
into
inherent
potential
biases