Regional Anesthesia & Pain Medicine,
Journal Year:
2025,
Volume and Issue:
unknown, P. rapm - 106358
Published: Feb. 16, 2025
Introduction
Artificial
intelligence
(AI),
particularly
large-language
models
like
Chat
Generative
Pre-Trained
Transformer
(ChatGPT),
has
demonstrated
potential
in
streamlining
research
methodologies.
Systematic
reviews
and
meta-analyses,
often
considered
the
pinnacle
of
evidence-based
medicine,
are
inherently
time-intensive
demand
meticulous
planning,
rigorous
data
extraction,
thorough
analysis,
careful
synthesis.
Despite
promising
applications
AI,
its
utility
conducting
systematic
with
meta-analysis
remains
unclear.
This
study
evaluated
ChatGPT’s
accuracy
key
tasks
a
review
meta-analysis.
Methods
validation
used
from
published
on
emotional
functioning
after
spinal
cord
stimulation.
ChatGPT-4o
performed
title/abstract
screening,
full-text
selection,
pooling
for
this
Comparisons
were
made
against
human-executed
steps,
which
gold
standard.
Outcomes
interest
included
accuracy,
sensitivity,
specificity,
positive
predictive
value,
negative
value
screening
tasks.
We
also
assessed
discrepancies
pooled
effect
estimates
forest
plot
generation.
Results
For
title
abstract
ChatGPT
achieved
an
70.4%,
sensitivity
54.9%,
specificity
80.1%.
In
phase,
was
68.4%,
75.6%,
66.8%.
successfully
five
plots,
achieving
100%
calculating
mean
differences,
95%
CIs,
heterogeneity
(
I
2
score
tau-squared
values)
most
outcomes,
minor
values
(range
0.01–0.05).
Forest
plots
showed
no
significant
discrepancies.
Conclusion
demonstrates
modest
to
moderate
selection
tasks,
but
performs
well
meta-analytic
calculations.
These
findings
underscore
AI
augment
methodologies,
while
emphasizing
need
human
oversight
ensure
integrity
workflows.
Journal of Medical Internet Research,
Journal Year:
2023,
Volume and Issue:
26, P. e48996 - e48996
Published: Sept. 28, 2023
The
systematic
review
of
clinical
research
papers
is
a
labor-intensive
and
time-consuming
process
that
often
involves
the
screening
thousands
titles
abstracts.
accuracy
efficiency
this
are
critical
for
quality
subsequent
health
care
decisions.
Traditional
methods
rely
heavily
on
human
reviewers,
requiring
significant
investment
time
resources.
Systematic Reviews,
Journal Year:
2024,
Volume and Issue:
13(1)
Published: June 15, 2024
Abstract
Background
Systematically
screening
published
literature
to
determine
the
relevant
publications
synthesize
in
a
review
is
time-consuming
and
difficult
task.
Large
language
models
(LLMs)
are
an
emerging
technology
with
promising
capabilities
for
automation
of
language-related
tasks
that
may
be
useful
such
purpose.
Methods
LLMs
were
used
as
part
automated
system
evaluate
relevance
certain
topic
based
on
defined
criteria
title
abstract
each
publication.
A
Python
script
was
created
generate
structured
prompts
consisting
text
strings
instruction,
title,
abstract,
provided
LLM.
The
publication
evaluated
by
LLM
Likert
scale
(low
high
relevance).
By
specifying
threshold,
different
classifiers
inclusion/exclusion
could
then
defined.
approach
four
openly
available
ten
data
sets
biomedical
reviews
newly
human-created
set
hypothetical
new
systematic
review.
Results
performance
varied
depending
being
analyzed.
Regarding
sensitivity/specificity,
yielded
94.48%/31.78%
FlanT5
model,
97.58%/19.12%
OpenHermes-NeuralChat
81.93%/75.19%
Mixtral
model
97.58%/38.34%
Platypus
2
sets.
same
100%
sensitivity
at
specificity
12.58%,
4.54%,
62.47%,
24.74%
set.
Changing
standard
settings
(minor
adaption
instruction
prompt
and/or
changing
range
from
1–5
1–10)
had
considerable
impact
performance.
Conclusions
can
scientific
show
some
results.
To
date,
little
known
about
how
well
systems
would
perform
if
prospectively
when
conducting
what
further
implications
this
might
have.
However,
it
likely
future
researchers
will
increasingly
use
evaluating
classifying
publications.
Frontiers in Medicine,
Journal Year:
2025,
Volume and Issue:
11
Published: Jan. 10, 2025
Generative
artificial
intelligence
(GenAI)
is
rapidly
transforming
various
sectors,
including
healthcare
and
education.
This
paper
explores
the
potential
opportunities
risks
of
GenAI
in
graduate
medical
education
(GME).
We
review
existing
literature
provide
commentary
on
how
could
impact
GME,
five
key
areas
opportunity:
electronic
health
record
(EHR)
workload
reduction,
clinical
simulation,
individualized
education,
research
analytics
support,
decision
support.
then
discuss
significant
risks,
inaccuracy
overreliance
AI-generated
content,
challenges
to
authenticity
academic
integrity,
biases
AI
outputs,
privacy
concerns.
As
technology
matures,
it
will
likely
come
have
an
important
role
future
but
its
integration
should
be
guided
by
a
thorough
understanding
both
benefits
limitations.
Journal of Medical Internet Research,
Journal Year:
2024,
Volume and Issue:
26, P. e52758 - e52758
Published: Aug. 16, 2024
Background
The
screening
process
for
systematic
reviews
is
resource-intensive.
Although
previous
machine
learning
solutions
have
reported
reductions
in
workload,
they
risked
excluding
relevant
papers.
Objective
We
evaluated
the
performance
of
a
3-layer
method
using
GPT-3.5
and
GPT-4
to
streamline
title
abstract-screening
reviews.
Our
goal
develop
that
maximizes
sensitivity
identifying
records.
Methods
conducted
screenings
on
2
our
related
treatment
bipolar
disorder,
with
1381
records
from
first
review
3146
second.
Screenings
were
(gpt-3.5-turbo-0125)
(gpt-4-0125-preview)
across
three
layers:
(1)
research
design,
(2)
target
patients,
(3)
interventions
controls.
was
prompts
tailored
each
study.
During
this
process,
information
extraction
according
study’s
inclusion
criteria
optimization
carried
out
GPT-4–based
flow
without
manual
adjustments.
Records
at
layer,
those
meeting
all
layers
subsequently
judged
as
included.
Results
On
both
able
about
110
per
minute,
total
time
required
second
studies
approximately
1
hour
hours,
respectively.
In
study,
sensitivities/specificities
0.900/0.709
0.806/0.996,
Both
by
6
used
meta-analysis
0.958/0.116
0.875/0.855,
sensitivities
align
human
evaluators:
0.867-1.000
study
0.776-0.979
9
After
accounting
justifiably
excluded
GPT-4,
0.962/0.996
0.943/0.855
Further
investigation
indicated
cases
incorrectly
due
lack
domain
knowledge,
while
misinterpretations
criteria.
Conclusions
demonstrated
acceptable
level
specificity
supports
its
practical
application
screenings.
Future
should
aim
generalize
approach
explore
effectiveness
diverse
settings,
medical
nonmedical,
fully
establish
use
operational
feasibility.
Journal of Medical Internet Research,
Journal Year:
2024,
Volume and Issue:
26, P. e56780 - e56780
Published: May 31, 2024
Large
language
models
(LLMs)
such
as
ChatGPT
have
become
widely
applied
in
the
field
of
medical
research.
In
process
conducting
systematic
reviews,
similar
tools
can
be
used
to
expedite
various
steps,
including
defining
clinical
questions,
performing
literature
search,
document
screening,
information
extraction,
and
refinement,
thereby
conserving
resources
enhancing
efficiency.
However,
when
using
LLMs,
attention
should
paid
transparent
reporting,
distinguishing
between
genuine
false
content,
avoiding
academic
misconduct.
this
viewpoint,
we
highlight
potential
roles
LLMs
creation
reviews
meta-analyses,
elucidating
their
advantages,
limitations,
future
research
directions,
aiming
provide
insights
guidance
for
authors
planning
meta-analyses.
Annals of Internal Medicine,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 24, 2025
Systematic
reviews
(SRs)
are
hindered
by
the
initial
rigorous
article
screen,
which
delays
access
to
reliable
information
synthesis.
To
develop
generic
prompt
templates
for
large
language
model
(LLM)-driven
abstract
and
full-text
screening
that
can
be
adapted
different
reviews.
Diagnostic
test
accuracy.
48
425
citations
were
tested
across
10
SRs.
Full-text
evaluated
all
12
690
freely
available
articles
from
original
search.
Prompt
development
used
GPT4-0125-preview
(OpenAI).
None.
Large
models
prompted
include
or
exclude
based
on
SR
eligibility
criteria.
Model
outputs
compared
with
author
decisions
after
evaluate
performance
(accuracy,
sensitivity,
specificity).
Optimized
prompts
using
achieved
a
weighted
sensitivity
of
97.7%
(range,
86.7%
100%)
specificity
85.2%
68.3%
95.9%)
in
96.5%
89.7%
100.0%)
91.2%
80.7%
In
contrast,
zero-shot
had
poor
(49.0%
abstract,
49.1%
full-text).
Across
LLMs,
Claude-3.5
(Anthropic)
GPT4
variants
similar
performance,
whereas
Gemini
Pro
(Google)
GPT3.5
(OpenAI)
underperformed.
Direct
costs
000
differed
substantially:
Where
single
human
was
estimated
require
more
than
83
hours
$1666.67
USD,
our
LLM-based
approach
completed
under
1
day
$157.02
USD.
Further
optimizations
may
exist.
Retrospective
study.
Convenience
sample
evaluations
limited
free
PubMed
Central
articles.
A
achieving
high
other
SRs
LLMs
developed.
Our
prompting
innovations
have
value
investigators
researchers
conducting
criteria-based
tasks
medical
sciences.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: June 3, 2024
Abstract
Systematic
reviews
(SRs)
are
the
highest
standard
of
evidence,
shaping
clinical
practice
guidelines,
policy
decisions,
and
research
priorities.
However,
their
labor-intensive
nature,
including
an
initial
rigorous
article
screen
by
at
least
two
investigators,
delays
access
to
reliable
information
synthesis.
Here,
we
demonstrate
that
large
language
models
(LLMs)
with
intentional
prompting
can
match
human
screening
performance.
We
introduce
Framework
Chain-of-Thought,
a
novel
approach
directs
LLMs
systematically
reason
against
predefined
frameworks.
evaluated
our
prompts
across
ten
SRs
covering
four
common
types
SR
questions
(i.e.,
prevalence,
intervention
benefits,
diagnostic
test
accuracy,
prognosis),
achieving
mean
accuracy
93.6%
(range:
83.3-99.6%)
sensitivity
97.5%
(89.7-100%)
in
full-text
screening.
Compared
experienced
reviewers
(mean
92.4%
[76.8-97.8%],
75.1%
[44.1-100%]),
prompt
demonstrated
significantly
higher
(p<0.05),
one
review
comparable
five
(p>0.05).
While
traditional
for
7000
articles
required
530
hours
$10,000
USD,
completed
day
$430
USD.
Our
results
establish
perform
performance
matching
experts,
setting
foundation
end-to-end
automated
SRs.