BioMedInformatics,
Год журнала:
2025,
Номер
5(1), С. 15 - 15
Опубликована: Март 11, 2025
Large
language
models
(LLMs)
have
emerged
as
powerful
tools
for
(semi-)automating
the
initial
screening
of
abstracts
in
systematic
reviews,
offering
potential
to
significantly
reduce
manual
burden
on
research
teams.
This
paper
provides
a
broad
overview
prompt
engineering
principles
and
highlights
how
traditional
PICO
(Population,
Intervention,
Comparison,
Outcome)
criteria
can
be
converted
into
actionable
instructions
LLMs.
We
analyze
trade-offs
between
“soft”
prompts,
which
maximize
recall
by
accepting
articles
unless
they
explicitly
fail
an
inclusion
requirement,
“strict”
demand
explicit
evidence
every
criterion.
Using
periodontics
case
study,
we
illustrate
design
affects
recall,
precision,
overall
efficiency
discuss
metrics
(accuracy,
F1
score)
evaluate
performance.
also
examine
common
pitfalls,
such
overly
lengthy
prompts
or
ambiguous
instructions,
underscore
continuing
need
expert
oversight
mitigate
hallucinations
biases
inherent
LLM
outputs.
Finally,
explore
emerging
trends,
including
multi-stage
pipelines
fine-tuning,
while
noting
ethical
considerations
related
data
privacy
transparency.
By
applying
rigorous
evaluation,
researchers
optimize
LLM-based
processes,
allowing
faster
more
comprehensive
synthesis
across
biomedical
disciplines.
Journal of Medical Internet Research,
Год журнала:
2023,
Номер
26, С. e48996 - e48996
Опубликована: Сен. 28, 2023
The
systematic
review
of
clinical
research
papers
is
a
labor-intensive
and
time-consuming
process
that
often
involves
the
screening
thousands
titles
abstracts.
accuracy
efficiency
this
are
critical
for
quality
subsequent
health
care
decisions.
Traditional
methods
rely
heavily
on
human
reviewers,
requiring
significant
investment
time
resources.
Systematic Reviews,
Год журнала:
2024,
Номер
13(1)
Опубликована: Июнь 15, 2024
Abstract
Background
Systematically
screening
published
literature
to
determine
the
relevant
publications
synthesize
in
a
review
is
time-consuming
and
difficult
task.
Large
language
models
(LLMs)
are
an
emerging
technology
with
promising
capabilities
for
automation
of
language-related
tasks
that
may
be
useful
such
purpose.
Methods
LLMs
were
used
as
part
automated
system
evaluate
relevance
certain
topic
based
on
defined
criteria
title
abstract
each
publication.
A
Python
script
was
created
generate
structured
prompts
consisting
text
strings
instruction,
title,
abstract,
provided
LLM.
The
publication
evaluated
by
LLM
Likert
scale
(low
high
relevance).
By
specifying
threshold,
different
classifiers
inclusion/exclusion
could
then
defined.
approach
four
openly
available
ten
data
sets
biomedical
reviews
newly
human-created
set
hypothetical
new
systematic
review.
Results
performance
varied
depending
being
analyzed.
Regarding
sensitivity/specificity,
yielded
94.48%/31.78%
FlanT5
model,
97.58%/19.12%
OpenHermes-NeuralChat
81.93%/75.19%
Mixtral
model
97.58%/38.34%
Platypus
2
sets.
same
100%
sensitivity
at
specificity
12.58%,
4.54%,
62.47%,
24.74%
set.
Changing
standard
settings
(minor
adaption
instruction
prompt
and/or
changing
range
from
1–5
1–10)
had
considerable
impact
performance.
Conclusions
can
scientific
show
some
results.
To
date,
little
known
about
how
well
systems
would
perform
if
prospectively
when
conducting
what
further
implications
this
might
have.
However,
it
likely
future
researchers
will
increasingly
use
evaluating
classifying
publications.
Frontiers in Medicine,
Год журнала:
2025,
Номер
11
Опубликована: Янв. 10, 2025
Generative
artificial
intelligence
(GenAI)
is
rapidly
transforming
various
sectors,
including
healthcare
and
education.
This
paper
explores
the
potential
opportunities
risks
of
GenAI
in
graduate
medical
education
(GME).
We
review
existing
literature
provide
commentary
on
how
could
impact
GME,
five
key
areas
opportunity:
electronic
health
record
(EHR)
workload
reduction,
clinical
simulation,
individualized
education,
research
analytics
support,
decision
support.
then
discuss
significant
risks,
inaccuracy
overreliance
AI-generated
content,
challenges
to
authenticity
academic
integrity,
biases
AI
outputs,
privacy
concerns.
As
technology
matures,
it
will
likely
come
have
an
important
role
future
but
its
integration
should
be
guided
by
a
thorough
understanding
both
benefits
limitations.
Journal of Medical Internet Research,
Год журнала:
2024,
Номер
26, С. e52758 - e52758
Опубликована: Авг. 16, 2024
Background
The
screening
process
for
systematic
reviews
is
resource-intensive.
Although
previous
machine
learning
solutions
have
reported
reductions
in
workload,
they
risked
excluding
relevant
papers.
Objective
We
evaluated
the
performance
of
a
3-layer
method
using
GPT-3.5
and
GPT-4
to
streamline
title
abstract-screening
reviews.
Our
goal
develop
that
maximizes
sensitivity
identifying
records.
Methods
conducted
screenings
on
2
our
related
treatment
bipolar
disorder,
with
1381
records
from
first
review
3146
second.
Screenings
were
(gpt-3.5-turbo-0125)
(gpt-4-0125-preview)
across
three
layers:
(1)
research
design,
(2)
target
patients,
(3)
interventions
controls.
was
prompts
tailored
each
study.
During
this
process,
information
extraction
according
study’s
inclusion
criteria
optimization
carried
out
GPT-4–based
flow
without
manual
adjustments.
Records
at
layer,
those
meeting
all
layers
subsequently
judged
as
included.
Results
On
both
able
about
110
per
minute,
total
time
required
second
studies
approximately
1
hour
hours,
respectively.
In
study,
sensitivities/specificities
0.900/0.709
0.806/0.996,
Both
by
6
used
meta-analysis
0.958/0.116
0.875/0.855,
sensitivities
align
human
evaluators:
0.867-1.000
study
0.776-0.979
9
After
accounting
justifiably
excluded
GPT-4,
0.962/0.996
0.943/0.855
Further
investigation
indicated
cases
incorrectly
due
lack
domain
knowledge,
while
misinterpretations
criteria.
Conclusions
demonstrated
acceptable
level
specificity
supports
its
practical
application
screenings.
Future
should
aim
generalize
approach
explore
effectiveness
diverse
settings,
medical
nonmedical,
fully
establish
use
operational
feasibility.
Journal of Medical Internet Research,
Год журнала:
2024,
Номер
26, С. e56780 - e56780
Опубликована: Май 31, 2024
Large
language
models
(LLMs)
such
as
ChatGPT
have
become
widely
applied
in
the
field
of
medical
research.
In
process
conducting
systematic
reviews,
similar
tools
can
be
used
to
expedite
various
steps,
including
defining
clinical
questions,
performing
literature
search,
document
screening,
information
extraction,
and
refinement,
thereby
conserving
resources
enhancing
efficiency.
However,
when
using
LLMs,
attention
should
paid
transparent
reporting,
distinguishing
between
genuine
false
content,
avoiding
academic
misconduct.
this
viewpoint,
we
highlight
potential
roles
LLMs
creation
reviews
meta-analyses,
elucidating
their
advantages,
limitations,
future
research
directions,
aiming
provide
insights
guidance
for
authors
planning
meta-analyses.
Annals of Internal Medicine,
Год журнала:
2025,
Номер
unknown
Опубликована: Фев. 24, 2025
Systematic
reviews
(SRs)
are
hindered
by
the
initial
rigorous
article
screen,
which
delays
access
to
reliable
information
synthesis.
To
develop
generic
prompt
templates
for
large
language
model
(LLM)-driven
abstract
and
full-text
screening
that
can
be
adapted
different
reviews.
Diagnostic
test
accuracy.
48
425
citations
were
tested
across
10
SRs.
Full-text
evaluated
all
12
690
freely
available
articles
from
original
search.
Prompt
development
used
GPT4-0125-preview
(OpenAI).
None.
Large
models
prompted
include
or
exclude
based
on
SR
eligibility
criteria.
Model
outputs
compared
with
author
decisions
after
evaluate
performance
(accuracy,
sensitivity,
specificity).
Optimized
prompts
using
achieved
a
weighted
sensitivity
of
97.7%
(range,
86.7%
100%)
specificity
85.2%
68.3%
95.9%)
in
96.5%
89.7%
100.0%)
91.2%
80.7%
In
contrast,
zero-shot
had
poor
(49.0%
abstract,
49.1%
full-text).
Across
LLMs,
Claude-3.5
(Anthropic)
GPT4
variants
similar
performance,
whereas
Gemini
Pro
(Google)
GPT3.5
(OpenAI)
underperformed.
Direct
costs
000
differed
substantially:
Where
single
human
was
estimated
require
more
than
83
hours
$1666.67
USD,
our
LLM-based
approach
completed
under
1
day
$157.02
USD.
Further
optimizations
may
exist.
Retrospective
study.
Convenience
sample
evaluations
limited
free
PubMed
Central
articles.
A
achieving
high
other
SRs
LLMs
developed.
Our
prompting
innovations
have
value
investigators
researchers
conducting
criteria-based
tasks
medical
sciences.
medRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июнь 3, 2024
Abstract
Systematic
reviews
(SRs)
are
the
highest
standard
of
evidence,
shaping
clinical
practice
guidelines,
policy
decisions,
and
research
priorities.
However,
their
labor-intensive
nature,
including
an
initial
rigorous
article
screen
by
at
least
two
investigators,
delays
access
to
reliable
information
synthesis.
Here,
we
demonstrate
that
large
language
models
(LLMs)
with
intentional
prompting
can
match
human
screening
performance.
We
introduce
Framework
Chain-of-Thought,
a
novel
approach
directs
LLMs
systematically
reason
against
predefined
frameworks.
evaluated
our
prompts
across
ten
SRs
covering
four
common
types
SR
questions
(i.e.,
prevalence,
intervention
benefits,
diagnostic
test
accuracy,
prognosis),
achieving
mean
accuracy
93.6%
(range:
83.3-99.6%)
sensitivity
97.5%
(89.7-100%)
in
full-text
screening.
Compared
experienced
reviewers
(mean
92.4%
[76.8-97.8%],
75.1%
[44.1-100%]),
prompt
demonstrated
significantly
higher
(p<0.05),
one
review
comparable
five
(p>0.05).
While
traditional
for
7000
articles
required
530
hours
$10,000
USD,
completed
day
$430
USD.
Our
results
establish
perform
performance
matching
experts,
setting
foundation
end-to-end
automated
SRs.