UniProt: the Universal Protein Knowledgebase in 2025
Nucleic Acids Research,
Journal Year:
2024,
Volume and Issue:
53(D1), P. D609 - D617
Published: Nov. 18, 2024
The
aim
of
the
UniProt
Knowledgebase
(UniProtKB;
https://www.uniprot.org/)
is
to
provide
users
with
a
comprehensive,
high-quality
and
freely
accessible
set
protein
sequences
annotated
functional
information.
In
this
publication,
we
describe
ongoing
changes
our
production
pipeline
limit
available
in
UniProtKB
high-quality,
non-redundant
reference
proteomes.
We
continue
manually
curate
scientific
literature
add
latest
data
use
machine
learning
techniques.
also
encourage
community
curation
ensure
key
publications
are
not
missed.
an
update
on
automatic
annotation
methods
used
by
predict
information
for
unreviewed
entries
describing
unstudied
proteins.
Finally,
updates
website
described,
including
new
tab
linking
genomic
recognition
its
value
community,
database
has
been
awarded
Global
Core
Biodata
Resource
status.
Language: Английский
FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 23, 2024
Abstract
Motivation
Thousands
of
genomes
are
publicly
available,
however,
most
genes
in
those
have
poorly
defined
functions.
This
is
partly
due
to
a
gap
between
previously
published,
experimentally-characterized
protein
activities
and
deposited
databases.
activity
deposition
bottlenecked
by
the
time-consuming
biocuration
process.
The
emergence
large
language
models
(LLMs)
presents
an
opportunity
speed
up
text-mining
for
biocuration.
Results
We
developed
FuncFetch
—
workflow
that
integrates
NCBI
E-Utilities,
OpenAI’s
GPT-4
Zotero
screen
thousands
manuscripts
extract
enzyme
activities.
Extensive
validation
revealed
high
precision
recall
determining
whether
abstract
given
paper
indicates
presence
characterized
paper.
Provided
manuscript,
extracted
data
such
as
species
information,
names,
sequence
identifiers,
substrates
products,
which
were
subjected
extensive
quality
analyses.
Comparison
this
against
manually
curated
dataset
BAHD
acyltransferase
demonstrated
precision/recall
0.86/0.64
extracting
substrates.
further
deployed
on
nine
plant
families.
Screening
27,120
papers,
retrieved
32,605
entries
from
5547
selected
papers.
also
identified
multiple
extraction
errors
including
incorrect
associations,
non-target
enzymes,
hallucinations,
highlight
need
manual
curation.
verified,
resulting
comprehensive
functional
fingerprint
family
revealing
∼70%
experimentally
enzymes
uncurated
public
domain.
represents
advance
lays
groundwork
predicting
functions
uncharacterized
enzymes.
Availability
Implementation
Code
minimally-curated
available
at:
https://github.com/moghelab/funcfetch
https://tools.moghelab.org/funczymedb
Language: Английский
FuncFetch: An LLM-assisted workflow enables mining thousands of enzyme-substrate interactions from published manuscripts
Bioinformatics,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Dec. 20, 2024
Abstract
Motivation
Thousands
of
genomes
are
publicly
available,
however,
most
genes
in
those
have
poorly
defined
functions.
This
is
partly
due
to
a
gap
between
previously
published,
experimentally-characterized
protein
activities
and
deposited
databases.
activity
deposition
bottlenecked
by
the
time-consuming
biocuration
process.
The
emergence
large
language
models
(LLMs)
presents
an
opportunity
speed
up
text-mining
for
biocuration.
Results
We
developed
FuncFetch—a
workflow
that
integrates
NCBI
E-Utilities,
OpenAI’s
GPT-4
Zotero—to
screen
thousands
manuscripts
extract
enzyme
activities.
Extensive
validation
revealed
high
precision
recall
determining
whether
abstract
given
paper
indicates
presence
characterized
paper.
Provided
manuscript,
FuncFetch
extracted
data
such
as
species
information,
names,
sequence
identifiers,
substrates
products,
which
were
subjected
extensive
quality
analyses.
Comparison
this
against
manually
curated
dataset
BAHD
acyltransferase
demonstrated
precision/recall
0.86/0.64
extracting
substrates.
further
deployed
on
nine
plant
families.
Screening
26,543
papers,
retrieved
32,605
entries
from
5,459
selected
papers.
also
identified
multiple
extraction
errors
including
incorrect
associations,
non-target
enzymes,
hallucinations,
highlight
need
manual
curation.
verified,
resulting
comprehensive
functional
fingerprint
family
revealing
∼70%
experimentally
enzymes
uncurated
public
domain.
represents
advance
lays
groundwork
predicting
functions
uncharacterized
enzymes.
Availability
Implementation
Code
minimally-curated
available
at:
https://github.com/moghelab/funcfetch
https://tools.moghelab.org/funczymedb.
Supplementary
information
at
Bioinformatics
online.
Language: Английский