medRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Апрель 4, 2024
Abstract
Importance
Large
language
models
(LLMs)
possess
a
range
of
capabilities
which
may
be
applied
to
the
clinical
domain,
including
text
summarization.
As
ambient
artificial
intelligence
scribes
and
other
LLM-based
tools
begin
deployed
within
healthcare
settings,
rigorous
evaluations
accuracy
these
technologies
are
urgently
needed.
Objective
To
investigate
performance
GPT-4
GPT-3.5-turbo
in
generating
Emergency
Department
(ED)
discharge
summaries
evaluate
prevalence
type
errors
across
each
section
summary.
Design
Cross-sectional
study.
Setting
University
California,
San
Francisco
ED.
Participants
We
identified
all
adult
ED
visits
from
2012
2023
with
an
clinician
note.
randomly
selected
sample
100
for
GPT-summarization.
Exposure
potential
two
state-of-the-art
LLMs,
GPT-3.5-turbo,
summarize
full
note
into
Main
Outcomes
Measures
GPT-4-generated
were
evaluated
by
independent
Medicine
physician
reviewers
three
evaluation
criteria:
1)
Inaccuracy
GPT-summarized
information;
2)
Hallucination
3)
Omission
relevant
information.
On
identifying
error,
additionally
asked
provide
brief
explanation
their
reasoning,
was
manually
classified
subgroups
errors.
Results
From
202,059
eligible
visits,
we
sampled
GPT-generated
summarization
then
expert-driven
evaluation.
In
total,
33%
generated
10%
those
entirely
error-free
domains.
Summaries
mostly
accurate,
inaccuracies
found
only
cases,
however,
42%
exhibited
hallucinations
47%
omitted
clinically
Inaccuracies
most
commonly
Plan
sections
summaries,
while
omissions
concentrated
describing
patients’
Physical
Examination
findings
or
History
Presenting
Complaint.
Conclusions
Relevance
this
cross-sectional
study
encounters,
that
LLMs
could
generate
accurate
but
liable
hallucination
omission
A
comprehensive
understanding
location
is
important
facilitate
review
such
content
prevent
patient
harm.
Background
The
diagnostic
abilities
of
multimodal
large
language
models
(LLMs)
using
direct
image
inputs
and
the
impact
temperature
parameter
LLMs
remain
unexplored.
Purpose
To
investigate
ability
GPT-4V
Gemini
Pro
Vision
in
generating
differential
diagnoses
at
different
temperatures
compared
with
radiologists
Advances in healthcare information systems and administration book series,
Год журнала:
2024,
Номер
unknown, С. 263 - 287
Опубликована: Фев. 14, 2024
In
this
chapter,
the
authors
explore
transformation
of
ultrasound
training
in
digital
era
higher
education.
As
landscape
redefines
access
to
information
and
learning
modalities,
chapter
critically
examines
integration
innovative
tools
The
focus
on
leveraging
technologies
like
extended
realities
simulations,
alongside
practicality
mobile
applications,
enhance
experience.
underscores
importance
evolving
educational
systems
actively
engage
students
these
advanced
frameworks.
It
aims
stimulate
a
comprehensive
discussion
effectively
incorporating
at
undergraduate
level,
evaluating
their
impact
student
outcomes,
preparing
future
healthcare
professionals
for
technology-driven
medical
landscape.
This
review
offers
forward-looking
perspective
integrating
cutting-edge
education,
signifying
shift
towards
more
interactive,
immersive,
effective
experiences.
Medicine Plus,
Год журнала:
2024,
Номер
1(2), С. 100030 - 100030
Опубликована: Май 17, 2024
With
the
rapid
development
of
artificial
intelligence,
large
language
models
(LLMs)
have
shown
promising
capabilities
in
mimicking
human-level
comprehension
and
reasoning.
This
has
sparked
significant
interest
applying
LLMs
to
enhance
various
aspects
healthcare,
ranging
from
medical
education
clinical
decision
support.
However,
medicine
involves
multifaceted
data
modalities
nuanced
reasoning
skills,
presenting
challenges
for
integrating
LLMs.
review
introduces
fundamental
applications
general-purpose
specialized
LLMs,
demonstrating
their
utilities
knowledge
retrieval,
research
support,
workflow
automation,
diagnostic
assistance.
Recognizing
inherent
multimodality
medicine,
emphasizes
multimodal
discusses
ability
process
diverse
types
like
imaging
electronic
health
records
augment
accuracy.
To
address
LLMs'
limitations
regarding
personalization
complex
reasoning,
further
explores
emerging
LLM-powered
autonomous
agents
healthcare.
Moreover,
it
summarizes
evaluation
methodologies
assessing
reliability
safety
contexts.
transformative
potential
medicine;
however,
there
is
a
pivotal
need
continuous
optimizations
ethical
oversight
before
these
can
be
effectively
integrated
into
practice.
JMIR Medical Informatics,
Год журнала:
2025,
Номер
13, С. e58457 - e58457
Опубликована: Янв. 2, 2025
Background
In
this
study,
we
evaluate
the
accuracy,
efficiency,
and
cost-effectiveness
of
large
language
models
in
extracting
structuring
information
from
free-text
clinical
reports,
particularly
identifying
classifying
patient
comorbidities
within
oncology
electronic
health
records.
We
specifically
compare
performance
gpt-3.5-turbo-1106
gpt-4-1106-preview
against
that
specialized
human
evaluators.
Objective
Methods
implemented
a
script
using
OpenAI
application
programming
interface
to
extract
structured
JavaScript
object
notation
format
reported
250
personal
history
reports.
These
reports
were
manually
reviewed
batches
50
by
5
specialists
radiation
oncology.
compared
results
metrics
such
as
sensitivity,
specificity,
precision,
F-value,
κ
index,
McNemar
test,
addition
examining
common
causes
errors
both
humans
generative
pretrained
transformer
(GPT)
models.
Results
The
GPT-3.5
model
exhibited
slightly
lower
physicians
across
all
metrics,
though
differences
not
statistically
significant
(McNemar
P=.79).
GPT-4
demonstrated
clear
superiority
several
key
P<.001).
Notably,
it
achieved
sensitivity
96.8%,
88.2%
for
88.8%
physicians.
However,
marginally
outperformed
precision
(97.7%
vs
96.8%).
showed
greater
consistency,
replicating
exact
same
76%
10
repeated
analyses,
59%
GPT-3.5,
indicating
more
stable
reliable
performance.
Physicians
likely
miss
explicit
comorbidities,
while
GPT
frequently
inferred
nonexplicit
sometimes
correctly,
also
resulted
false
positives.
Conclusions
This
study
demonstrates
that,
with
well-designed
prompts,
examined
can
match
or
even
surpass
medical
complex
Their
superior
efficiency
time
costs,
along
easy
integration
databases,
makes
them
valuable
tool
large-scale
data
mining
real-world
evidence
generation.