medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Dec. 31, 2023
Abstract
Background
The
development
of
clinical
practice
guidelines
requires
a
meticulous
literature
search
and
screening
process.
This
study
aims
to
explore
the
potential
large
language
models
in
Japanese
Clinical
Practice
Guidelines
for
Management
Sepsis
Septic
Shock
(J-SSCG),
focusing
on
enhancing
quality
reducing
citation
workload.
Methods
A
prospective
will
be
conducted
compare
efficiency
accuracy
between
conventional
method
novel
approach
using
models.
We
use
model,
namely
GPT-4,
conduct
searches
predefined
questions.
objectively
measure
time
required
it
taken
method.
Following
screening,
we
calculate
sensitivity
specificity
results
obtained
from
models-assisted
total
spent
both
approaches
also
compared
assess
workload
reduction.
Trial
registration
research
is
submitted
with
University
hospital
medical
information
network
trial
registry
(UMIN-CTR)
[UMIN000053091].
Conflicts
interest
All
authors
declare
no
conflicts
have.
Funding
None
BACKGROUND
The
integration
of
large
language
models
(LLMs)
in
mental
health
care
is
an
emerging
field.
There
a
need
to
systematically
review
the
application
outcomes
and
delineate
advantages
limitations
clinical
settings.
OBJECTIVE
This
aims
provide
comprehensive
overview
use
LLMs
care,
assessing
their
efficacy,
challenges,
potential
for
future
applications.
METHODS
A
systematic
search
was
conducted
across
multiple
databases
including
PubMed,
Web
Science,
Google
Scholar,
arXiv,
medRxiv,
PsyArXiv
November
2023.
All
forms
original
research,
peer-reviewed
or
not,
published
disseminated
between
October
1,
2019,
December
2,
2023,
are
included
without
restrictions
if
they
used
developed
after
T5
directly
addressed
research
questions
RESULTS
From
initial
pool
313
articles,
34
met
inclusion
criteria
based
on
relevance
LLM
robustness
reported
outcomes.
Diverse
applications
identified,
diagnosis,
therapy,
patient
engagement
enhancement,
etc.
Key
challenges
include
data
availability
reliability,
nuanced
handling
states,
effective
evaluation
methods.
Despite
successes
accuracy
accessibility
improvement,
gaps
applicability
ethical
considerations
were
evident,
pointing
robust
data,
standardized
evaluations,
interdisciplinary
collaboration.
CONCLUSIONS
hold
substantial
promise
enhancing
care.
For
full
be
realized,
emphasis
must
placed
developing
datasets,
development
frameworks,
guidelines,
collaborations
address
current
limitations.
JAMA Network Open,
Journal Year:
2024,
Volume and Issue:
7(7), P. e2420496 - e2420496
Published: July 8, 2024
Importance
Large
language
models
(LLMs)
are
promising
as
tools
for
citation
screening
in
systematic
reviews.
However,
their
applicability
has
not
yet
been
determined.
Objective
To
evaluate
the
accuracy
and
efficiency
of
an
LLM
title
abstract
literature
screening.
Design,
Setting,
Participants
This
prospective
diagnostic
study
used
data
from
process
5
clinical
questions
(CQs)
development
Japanese
Clinical
Practice
Guidelines
Management
Sepsis
Septic
Shock.
The
decided
to
include
or
exclude
citations
based
on
inclusion
exclusion
criteria
terms
patient,
population,
problem;
intervention;
comparison;
design
selected
CQ
was
compared
with
conventional
method
conducted
January
7
15,
2024.
Exposures
(GPT-4
Turbo)–assisted
method.
Main
Outcomes
Measures
sensitivity
specificity
LLM-assisted
calculated,
full-text
result
using
set
reference
standard
primary
analysis.
Pooled
were
also
estimated,
times
2
methods
compared.
Results
In
process,
8
5634
publications
1,
4
3418
2,
1038
3,
17
4326
4,
2253
selected.
analysis
CQs,
demonstrated
integrated
0.75
(95%
CI,
0.43
0.92)
0.99
0.99).
Post
hoc
modifications
command
prompt
improved
0.91
0.77
0.97)
without
substantially
compromising
(0.98
[95%
0.96
0.99]).
Additionally,
associated
reduced
time
processing
100
studies
(1.3
minutes
vs
17.2
methods;
mean
difference,
−15.25
−17.70
−12.79
minutes]).
Conclusions
Relevance
this
investigating
performance
screening,
model
acceptable
reasonably
high
time.
novel
could
potentially
enhance
reduce
workload
Journal of Medical Internet Research,
Journal Year:
2024,
Volume and Issue:
26, P. e52758 - e52758
Published: Aug. 16, 2024
Background
The
screening
process
for
systematic
reviews
is
resource-intensive.
Although
previous
machine
learning
solutions
have
reported
reductions
in
workload,
they
risked
excluding
relevant
papers.
Objective
We
evaluated
the
performance
of
a
3-layer
method
using
GPT-3.5
and
GPT-4
to
streamline
title
abstract-screening
reviews.
Our
goal
develop
that
maximizes
sensitivity
identifying
records.
Methods
conducted
screenings
on
2
our
related
treatment
bipolar
disorder,
with
1381
records
from
first
review
3146
second.
Screenings
were
(gpt-3.5-turbo-0125)
(gpt-4-0125-preview)
across
three
layers:
(1)
research
design,
(2)
target
patients,
(3)
interventions
controls.
was
prompts
tailored
each
study.
During
this
process,
information
extraction
according
study’s
inclusion
criteria
optimization
carried
out
GPT-4–based
flow
without
manual
adjustments.
Records
at
layer,
those
meeting
all
layers
subsequently
judged
as
included.
Results
On
both
able
about
110
per
minute,
total
time
required
second
studies
approximately
1
hour
hours,
respectively.
In
study,
sensitivities/specificities
0.900/0.709
0.806/0.996,
Both
by
6
used
meta-analysis
0.958/0.116
0.875/0.855,
sensitivities
align
human
evaluators:
0.867-1.000
study
0.776-0.979
9
After
accounting
justifiably
excluded
GPT-4,
0.962/0.996
0.943/0.855
Further
investigation
indicated
cases
incorrectly
due
lack
domain
knowledge,
while
misinterpretations
criteria.
Conclusions
demonstrated
acceptable
level
specificity
supports
its
practical
application
screenings.
Future
should
aim
generalize
approach
explore
effectiveness
diverse
settings,
medical
nonmedical,
fully
establish
use
operational
feasibility.
Journal of Medical Internet Research,
Journal Year:
2024,
Volume and Issue:
26, P. e56780 - e56780
Published: May 31, 2024
Large
language
models
(LLMs)
such
as
ChatGPT
have
become
widely
applied
in
the
field
of
medical
research.
In
process
conducting
systematic
reviews,
similar
tools
can
be
used
to
expedite
various
steps,
including
defining
clinical
questions,
performing
literature
search,
document
screening,
information
extraction,
and
refinement,
thereby
conserving
resources
enhancing
efficiency.
However,
when
using
LLMs,
attention
should
paid
transparent
reporting,
distinguishing
between
genuine
false
content,
avoiding
academic
misconduct.
this
viewpoint,
we
highlight
potential
roles
LLMs
creation
reviews
meta-analyses,
elucidating
their
advantages,
limitations,
future
research
directions,
aiming
provide
insights
guidance
for
authors
planning
meta-analyses.
Annals of Internal Medicine,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 24, 2025
Systematic
reviews
(SRs)
are
hindered
by
the
initial
rigorous
article
screen,
which
delays
access
to
reliable
information
synthesis.
To
develop
generic
prompt
templates
for
large
language
model
(LLM)-driven
abstract
and
full-text
screening
that
can
be
adapted
different
reviews.
Diagnostic
test
accuracy.
48
425
citations
were
tested
across
10
SRs.
Full-text
evaluated
all
12
690
freely
available
articles
from
original
search.
Prompt
development
used
GPT4-0125-preview
(OpenAI).
None.
Large
models
prompted
include
or
exclude
based
on
SR
eligibility
criteria.
Model
outputs
compared
with
author
decisions
after
evaluate
performance
(accuracy,
sensitivity,
specificity).
Optimized
prompts
using
achieved
a
weighted
sensitivity
of
97.7%
(range,
86.7%
100%)
specificity
85.2%
68.3%
95.9%)
in
96.5%
89.7%
100.0%)
91.2%
80.7%
In
contrast,
zero-shot
had
poor
(49.0%
abstract,
49.1%
full-text).
Across
LLMs,
Claude-3.5
(Anthropic)
GPT4
variants
similar
performance,
whereas
Gemini
Pro
(Google)
GPT3.5
(OpenAI)
underperformed.
Direct
costs
000
differed
substantially:
Where
single
human
was
estimated
require
more
than
83
hours
$1666.67
USD,
our
LLM-based
approach
completed
under
1
day
$157.02
USD.
Further
optimizations
may
exist.
Retrospective
study.
Convenience
sample
evaluations
limited
free
PubMed
Central
articles.
A
achieving
high
other
SRs
LLMs
developed.
Our
prompting
innovations
have
value
investigators
researchers
conducting
criteria-based
tasks
medical
sciences.
Journal of the American Medical Informatics Association,
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 22, 2025
Abstract
Objective
screening
is
a
labor-intensive
component
of
systematic
review
involving
repetitive
application
inclusion
and
exclusion
criteria
on
large
volume
studies.
We
aimed
to
validate
language
models
(LLMs)
used
automate
abstract
screening.
Materials
Methods
LLMs
(GPT-3.5
Turbo,
GPT-4
GPT-4o,
Llama
3
70B,
Gemini
1.5
Pro,
Claude
Sonnet
3.5)
were
trialed
across
23
Cochrane
Library
reviews
evaluate
their
accuracy
in
zero-shot
binary
classification
for
Initial
evaluation
balanced
development
dataset
(n
=
800)
identified
optimal
prompting
strategies,
the
best
performing
LLM-prompt
combinations
then
validated
comprehensive
replicated
search
results
119
695).
Results
On
dataset,
exhibited
superior
performance
human
researchers
terms
sensitivity
(LLMmax
1.000,
humanmax
0.775),
precision
0.927,
0.911),
0.904,
0.865).
When
evaluated
consistent
(range
0.756-1.000)
but
diminished
0.004-0.096)
due
class
imbalance.
In
addition,
66
LLM-human
LLM-LLM
ensembles
perfect
with
maximal
0.458
decreasing
0.1450
over
dataset;
conferring
workload
reductions
ranging
between
37.55%
99.11%.
Discussion
Automated
can
reduce
while
maintaining
quality.
Performance
variation
highlights
importance
domain-specific
validation
before
autonomous
deployment.
achieve
similar
benefits
oversight
all
records.
Conclusion
may
labor
cost
maintained
or
improved
accuracy,
thereby
increasing
efficiency
quality
evidence
synthesis.
Systematic Reviews,
Journal Year:
2024,
Volume and Issue:
13(1)
Published: May 16, 2024
Abstract
We
aimed
to
compare
the
concordance
of
information
extracted
and
time
taken
between
a
large
language
model
(OpenAI’s
GPT-3.5
Turbo
via
API)
against
conventional
human
extraction
methods
in
retrieving
from
scientific
articles
on
diabetic
retinopathy
(DR).
The
was
done
using
GPT3.5
as
October
2023.
OpenAI’s
significantly
reduced
for
extraction.
Concordance
highest
at
100%
country
study,
64.7%
significant
risk
factors
DR,
47.1%
exclusion
inclusion
criteria,
lastly
41.2%
odds
ratio
(OR)
95%
confidence
interval
(CI).
levels
seemed
indicate
complexity
associated
with
each
prompt.
This
suggests
that
may
be
adopted
extract
simple
is
easily
located
text,
leaving
more
complex
by
researcher.
It
crucial
note
foundation
constantly
improving
new
versions
being
released
quickly.
Subsequent
work
can
focus
retrieval-augmented
generation
(RAG),
embedding,
chunking
PDF
into
useful
sections,
prompting
improve
accuracy
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 3, 2024
Abstract
Systematic
reviews
(SRs)
are
the
highest
standard
of
evidence,
shaping
clinical
practice
guidelines,
policy
decisions,
and
research
priorities.
However,
their
labor-intensive
nature,
including
an
initial
rigorous
article
screen
by
at
least
two
investigators,
delays
access
to
reliable
information
synthesis.
Here,
we
demonstrate
that
large
language
models
(LLMs)
with
intentional
prompting
can
match
human
screening
performance.
We
introduce
Framework
Chain-of-Thought,
a
novel
approach
directs
LLMs
systematically
reason
against
predefined
frameworks.
evaluated
our
prompts
across
ten
SRs
covering
four
common
types
SR
questions
(i.e.,
prevalence,
intervention
benefits,
diagnostic
test
accuracy,
prognosis),
achieving
mean
accuracy
93.6%
(range:
83.3-99.6%)
sensitivity
97.5%
(89.7-100%)
in
full-text
screening.
Compared
experienced
reviewers
(mean
92.4%
[76.8-97.8%],
75.1%
[44.1-100%]),
prompt
demonstrated
significantly
higher
(p<0.05),
one
review
comparable
five
(p>0.05).
While
traditional
for
7000
articles
required
530
hours
$10,000
USD,
completed
day
$430
USD.
Our
results
establish
perform
performance
matching
experts,
setting
foundation
end-to-end
automated
SRs.
International Journal of Medical Informatics,
Journal Year:
2024,
Volume and Issue:
189, P. 105531 - 105531
Published: June 26, 2024
PRISMA-based
literature
reviews
require
meticulous
scrutiny
of
extensive
textual
data
by
multiple
reviewers,
which
is
associated
with
considerable
human
effort.