medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 4, 2024
Abstract
Importance
Large
language
models
(LLMs)
possess
a
range
of
capabilities
which
may
be
applied
to
the
clinical
domain,
including
text
summarization.
As
ambient
artificial
intelligence
scribes
and
other
LLM-based
tools
begin
deployed
within
healthcare
settings,
rigorous
evaluations
accuracy
these
technologies
are
urgently
needed.
Objective
To
investigate
performance
GPT-4
GPT-3.5-turbo
in
generating
Emergency
Department
(ED)
discharge
summaries
evaluate
prevalence
type
errors
across
each
section
summary.
Design
Cross-sectional
study.
Setting
University
California,
San
Francisco
ED.
Participants
We
identified
all
adult
ED
visits
from
2012
2023
with
an
clinician
note.
randomly
selected
sample
100
for
GPT-summarization.
Exposure
potential
two
state-of-the-art
LLMs,
GPT-3.5-turbo,
summarize
full
note
into
Main
Outcomes
Measures
GPT-4-generated
were
evaluated
by
independent
Medicine
physician
reviewers
three
evaluation
criteria:
1)
Inaccuracy
GPT-summarized
information;
2)
Hallucination
3)
Omission
relevant
information.
On
identifying
error,
additionally
asked
provide
brief
explanation
their
reasoning,
was
manually
classified
subgroups
errors.
Results
From
202,059
eligible
visits,
we
sampled
GPT-generated
summarization
then
expert-driven
evaluation.
In
total,
33%
generated
10%
those
entirely
error-free
domains.
Summaries
mostly
accurate,
inaccuracies
found
only
cases,
however,
42%
exhibited
hallucinations
47%
omitted
clinically
Inaccuracies
most
commonly
Plan
sections
summaries,
while
omissions
concentrated
describing
patients’
Physical
Examination
findings
or
History
Presenting
Complaint.
Conclusions
Relevance
this
cross-sectional
study
encounters,
that
LLMs
could
generate
accurate
but
liable
hallucination
omission
A
comprehensive
understanding
location
is
important
facilitate
review
such
content
prevent
patient
harm.
Journal of the American Medical Informatics Association,
Journal Year:
2024,
Volume and Issue:
31(10), P. 2315 - 2327
Published: June 20, 2024
Although
supervised
machine
learning
is
popular
for
information
extraction
from
clinical
notes,
creating
large
annotated
datasets
requires
extensive
domain
expertise
and
time-consuming.
Meanwhile,
language
models
(LLMs)
have
demonstrated
promising
transfer
capability.
In
this
study,
we
explored
whether
recent
LLMs
could
reduce
the
need
large-scale
data
annotations.
Current Problems in Diagnostic Radiology,
Journal Year:
2024,
Volume and Issue:
53(6), P. 728 - 737
Published: July 9, 2024
The
rise
of
transformer-based
large
language
models
(LLMs),
such
as
ChatGPT,
has
captured
global
attention
with
recent
advancements
in
artificial
intelligence
(AI).
ChatGPT
demonstrates
growing
potential
structured
radiology
reporting—a
field
where
AI
traditionally
focused
on
image
analysis.
A
comprehensive
search
MEDLINE
and
Embase
was
conducted
from
inception
through
May
2024,
primary
studies
discussing
ChatGPT's
role
reporting
were
selected
based
their
content.
Of
the
268
articles
screened,
eight
ultimately
included
this
review.
These
explored
various
applications
generating
reports
unstructured
reports,
extracting
data
free
text,
impressions
findings
creating
imaging
data.
All
demonstrated
optimism
regarding
to
aid
radiologists,
though
common
critiques
privacy
concerns,
reliability,
medical
errors,
lack
medical-specific
training.
assistive
have
significant
transform
reporting,
enhancing
accuracy
standardization
while
optimizing
healthcare
resources.
Future
developments
may
involve
integrating
dynamic
few-shot
prompting,
Retrieval
Augmented
Generation
(RAG)
into
diagnostic
workflows.
Continued
research,
development,
ethical
oversight
are
crucial
fully
realize
AI's
radiology.
Medicine Advances,
Journal Year:
2024,
Volume and Issue:
2(3), P. 318 - 322
Published: Aug. 23, 2024
This
article
performs
a
literature
search
to
determine
the
current
progress,
identifies
research
gaps,
and
highlights
future
opportunities
of
human-in-the-loop
across
machine
learning
lifecycle,
including
data
preparation,
feature
engineering,
model
development,
deployment.
Machine
(ML),
particularly
deep
learning,
has
emerged
as
fundamental
analytical
tool
for
various
medical
tasks
in
electronic
health
records
(EHRs)
[1].
However,
purely
data-driven
methods
do
not
serve
panacea
all
encountered
problems
such
annotation.
To
address
these
issues,
(HITL)
increasingly
gained
prominence.
It
leverages
human
expertise
improve
ML-based
analyses
[2].
In
this
commentary,
we
perform
identify
progress
(Figure
1),
highlight
HITL
ML
Pipeline
records.
The
first
phase
which
enhances
is
preparation.
includes
extraction,
preprocessing,
annotation
large-scale
raw
EHRs
into
formats
operable
downstream
modeling
[3].
Across
three
preparation
steps,
focal
point
latest
because
traditional
paradigm
indiscriminately
annotates
available
samples
by
default,
places
an
unnecessary
burden
on
experts
time-urgent
settings.
Bull
et
al.
[4]
designed
interactive
platform
that
enables
clinicians
verify
or
correct
labels
predicted
ML.
Evaluated
two
databases,
developed
quickly
generated
accurate
models
with
reduced
needs.
Similar
strategies
have
been
implemented
detecting
unauthorized
access
extraction
[5]
deidentifying
free
text
[6]
preprocessing.
Given
powerful
ability
foundation
zero-shot
inference
[7],
studies
may
use
them,
GPT-4,
replace
homemade
computer-aided
annotations
[8].
Moreover,
predominantly
focus
annotation;
hence,
there
remains
vast
unexplored
blue
ocean
integration,
noise
filtering
missing
value
imputation
[9].
Building
well-prepared
datasets,
subsequent
applications
HITL-ML
span
engineering
development.
Feature
without
relies
either
fully
automated
manual
methods,
demand
large
amounts
computation
resources
expert
involvement.
incorporation
enabled
generation
novel
features
comparable
quality
at
speeds
up
20
times
faster
than
original
[10].
classic
ML,
long
deemed
essential,
preceding
development
numerous
contexts
its
demonstrated
efficacy
enhancing
performance.
recent
years,
notable
shift
occurred
toward
end-to-end
gradually
rendering
less
pivotal
An
exemplification
trend
can
be
observed
artificial
neural
networks,
where
shallow
layers
undertake
task
layers,
thereby
enabling
automatic
seamless
optimization
during
Within
paradigm,
improves
both
architecture
design
parameter
optimization.
Sheng
[11]
invited
doctors
modify
structure
causal
relationships
knowledge
graph
distilled
from
EHRs,
demonstrating
elevated
only
accuracy
but
also
interpretability
Rather
adjusting
post-training,
Tari
[12]
applied
training
adding
preferences
target
gold-standard
labels.
[11,
12]
augmented
HITL;
however,
aspect
fairness
sufficiently
emphasized.
Future
researchers
introduce
post-hoc
recalibrations
following
[11],
alternatively,
embed
objective
process
like
mitigate
potential
inequities
[13].
Furthermore,
present
EHR
classification
regression.
endeavors
could
venture
support
privacy-preserving
synthetic
pseudo
[14].
Once
complete,
final
lifecycle
deployment,
encompasses
continuous
monitoring
updating
trained
models.
incorporated
ensure
accuracy,
interpretability,
compatibility
temporal
spatial
shifts.
Doctors
engaged
double-check
intervention
suggestions
models,
positive
infection
cases
[15]
medication
doses
[16].
Instead
seeking
approval
decisions
clinicians,
Zheng
[17]
proposed
should
able
distinguish
difficult
simple
so
solved
respectively.
addition
ensuring
through
[15-17],
Elshawi
[18]
Yuan
[19]
used
clinician-labeled
concepts
interpret
behaviors
clarified
their
advantages
over
explanations
solely
Research
deployment
broaden
consider
privacy.
resource
efficiency
neglected
executed
mobile
devices
limited
capability
[20].
Even
cloud
infrastructure,
time
privacy-sensitive
applications,
efficient
run
edge
low
latency
benefits.
Finally,
most
previous
deployments
were
simulated
using
retrospective
prompts
resolve
performance
deterioration
prospective
clinical
landscapes
[21].
shown
refines
catalyzes
advancements
yielding
tools
are
interpretable.
Despite
elucidation
leveraging
full
yet
harnessed.
Figure
2
shows
overview
existing
gaps
opportunities.
synergistic
interaction
among
healthcare
professionals,
engineers,
high-performance
computers
poised
fulfill
enhancement
human-computer
promises
efficiency,
robustness
foster
impartiality,
privacy
preservation
systems.
Schematic
plot
lifecycle.
Han
Yuan:
Conceptualization
(lead);
curation
formal
analysis
investigation
project
administration
visualization
writing
–
draft
writing–
review
&
editing
(lead).
Lican
Kang:
Data
(supporting);
(supporting).
Yong
Li:
Investigation
methodology
Zhenqian
Fan:
authors
nothing
report.
All
declare
they
no
conflicts
interest.
Not
applicable.
sharing
applicable
new
was
created
analyzed
study.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 4, 2024
Abstract
Importance
Large
language
models
(LLMs)
possess
a
range
of
capabilities
which
may
be
applied
to
the
clinical
domain,
including
text
summarization.
As
ambient
artificial
intelligence
scribes
and
other
LLM-based
tools
begin
deployed
within
healthcare
settings,
rigorous
evaluations
accuracy
these
technologies
are
urgently
needed.
Objective
To
investigate
performance
GPT-4
GPT-3.5-turbo
in
generating
Emergency
Department
(ED)
discharge
summaries
evaluate
prevalence
type
errors
across
each
section
summary.
Design
Cross-sectional
study.
Setting
University
California,
San
Francisco
ED.
Participants
We
identified
all
adult
ED
visits
from
2012
2023
with
an
clinician
note.
randomly
selected
sample
100
for
GPT-summarization.
Exposure
potential
two
state-of-the-art
LLMs,
GPT-3.5-turbo,
summarize
full
note
into
Main
Outcomes
Measures
GPT-4-generated
were
evaluated
by
independent
Medicine
physician
reviewers
three
evaluation
criteria:
1)
Inaccuracy
GPT-summarized
information;
2)
Hallucination
3)
Omission
relevant
information.
On
identifying
error,
additionally
asked
provide
brief
explanation
their
reasoning,
was
manually
classified
subgroups
errors.
Results
From
202,059
eligible
visits,
we
sampled
GPT-generated
summarization
then
expert-driven
evaluation.
In
total,
33%
generated
10%
those
entirely
error-free
domains.
Summaries
mostly
accurate,
inaccuracies
found
only
cases,
however,
42%
exhibited
hallucinations
47%
omitted
clinically
Inaccuracies
most
commonly
Plan
sections
summaries,
while
omissions
concentrated
describing
patients’
Physical
Examination
findings
or
History
Presenting
Complaint.
Conclusions
Relevance
this
cross-sectional
study
encounters,
that
LLMs
could
generate
accurate
but
liable
hallucination
omission
A
comprehensive
understanding
location
is
important
facilitate
review
such
content
prevent
patient
harm.