Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy
Radiology,
Год журнала:
2024,
Номер
311(1)
Опубликована: Апрель 1, 2024
Background
Errors
in
radiology
reports
may
occur
because
of
resident-to-attending
discrepancies,
speech
recognition
inaccuracies,
and
large
workload.
Large
language
models,
such
as
GPT-4
(ChatGPT;
OpenAI),
assist
generating
reports.
Purpose
To
assess
effectiveness
identifying
common
errors
reports,
focusing
on
performance,
time,
cost-efficiency.
Materials
Methods
In
this
retrospective
study,
200
(radiography
cross-sectional
imaging
[CT
MRI])
were
compiled
between
June
2023
December
at
one
institution.
There
150
from
five
error
categories
(omission,
insertion,
spelling,
side
confusion,
other)
intentionally
inserted
into
100
the
used
reference
standard.
Six
radiologists
(two
senior
radiologists,
two
attending
physicians,
residents)
tasked
with
detecting
these
errors.
Overall
detection
categories,
reading
time
assessed
using
Wald
χ2
tests
paired-sample
t
tests.
Results
(detection
rate,
82.7%;124
150;
95%
CI:
75.8,
87.9)
matched
average
performance
independent
their
experience
(senior
89.3%
[134
83.4,
93.3];
80.0%
[120
72.9,
85.6];
residents,
P
value
range,
.522–.99).
One
radiologist
outperformed
94.7%;
142
89.8,
97.3;
=
.006).
required
less
processing
per
report
than
fastest
human
reader
study
(mean
3.5
seconds
±
0.5
[SD]
vs
25.1
20.1,
respectively;
<
.001;
Cohen
d
−1.08).
The
use
resulted
lower
mean
correction
cost
most
cost-efficient
($0.03
0.01
$0.42
0.41;
−1.12).
Conclusion
rate
was
comparable
that
potentially
reducing
work
hours
cost.
©
RSNA,
2024
See
also
editorial
by
Forman
issue.
Язык: Английский
Evaluation of Reliability, Repeatability, Robustness, and Confidence of GPT-3.5 and GPT-4 on a Radiology Board–style Examination
Radiology,
Год журнала:
2024,
Номер
311(2)
Опубликована: Май 1, 2024
Background
ChatGPT
(OpenAI)
can
pass
a
text-based
radiology
board–style
examination,
but
its
stochasticity
and
confident
language
when
it
is
incorrect
may
limit
utility.
Purpose
To
assess
the
reliability,
repeatability,
robustness,
confidence
of
GPT-3.5
GPT-4
(ChatGPT;
OpenAI)
through
repeated
prompting
with
examination.
Materials
Methods
In
this
exploratory
prospective
study,
150
multiple-choice
questions,
previously
used
to
benchmark
ChatGPT,
were
administered
default
versions
(GPT-3.5
GPT-4)
on
three
separate
attempts
(separated
by
≥1
month
then
1
week).
Accuracy
answer
choices
between
compared
reliability
(accuracy
over
time)
repeatability
(agreement
time).
On
third
attempt,
regardless
choice,
was
challenged
times
adversarial
prompt,
"Your
choice
incorrect.
Please
choose
different
option,"
robustness
(ability
withstand
prompting).
prompted
rate
from
1–10
(with
10
being
highest
level
lowest)
attempt
after
each
challenge
prompt.
Results
Neither
version
showed
difference
in
accuracy
attempts:
for
first,
second,
69.3%
(104
150),
63.3%
(95
60.7%
(91
respectively
(P
=
.06);
80.6%
(121
78.0%
(117
76.7%
(115
.42).
Though
both
had
only
moderate
intrarater
agreement
(κ
0.78
0.64,
respectively),
more
consistent
across
than
those
(agreement,
[115
150]
vs
61.3%
[92
150],
respectively;
P
.006).
After
changed
responses
most
though
did
so
frequently
(97.3%
[146
71.3%
[107
<
.001).
Both
rated
"high
confidence"
(≥8
scale)
initial
(GPT-3.5,
100%
[150
150];
GPT-4,
94.0%
[141
150])
as
well
(ie,
overconfidence;
GPT-3.5,
[59
59];
77%
[27
35],
.89).
Conclusion
Default
reliably
accurate
attempts,
poor
overconfident.
influenced
an
©
RSNA,
2024
Supplemental
material
available
article.
See
also
editorial
Ballard
issue.
Язык: Английский
Deep Learning in Breast Cancer Imaging: State of the Art and Recent Advancements in Early 2024
Diagnostics,
Год журнала:
2024,
Номер
14(8), С. 848 - 848
Опубликована: Апрель 19, 2024
The
rapid
advancement
of
artificial
intelligence
(AI)
has
significantly
impacted
various
aspects
healthcare,
particularly
in
the
medical
imaging
field.
This
review
focuses
on
recent
developments
application
deep
learning
(DL)
techniques
to
breast
cancer
imaging.
DL
models,
a
subset
AI
algorithms
inspired
by
human
brain
architecture,
have
demonstrated
remarkable
success
analyzing
complex
images,
enhancing
diagnostic
precision,
and
streamlining
workflows.
models
been
applied
diagnosis
via
mammography,
ultrasonography,
magnetic
resonance
Furthermore,
DL-based
radiomic
approaches
may
play
role
risk
assessment,
prognosis
prediction,
therapeutic
response
monitoring.
Nevertheless,
several
challenges
limited
widespread
adoption
clinical
practice,
emphasizing
importance
rigorous
validation,
interpretability,
technical
considerations
when
implementing
solutions.
By
examining
fundamental
concepts
synthesizing
latest
advancements
trends,
this
narrative
aims
provide
valuable
up-to-date
insights
for
radiologists
seeking
harness
power
care.
Язык: Английский
Natural language processing for chest X‐ray reports in the transformer era: BERT‐like encoders for comprehension and GPT‐like decoders for generation
iRadiology,
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 6, 2025
We
conducted
a
comprehensive
literature
search
in
PubMed
to
illustrate
the
current
landscape
of
transformer-based
tools
from
perspective
transformer's
two
integral
components:
encoder
exemplified
by
BERT
and
decoder
characterized
GPT.
Also,
we
discussed
adoption
barriers
potential
solutions
terms
computational
burdens,
interpretability
concerns,
ethical
issues,
hallucination
problems,
malpractice,
legal
liabilities.
hope
that
this
commentary
will
serve
as
foundational
introduction
for
radiologists
seeking
explore
evolving
technical
chest
X-ray
report
analysis
transformer
era.
Natural
language
processing
(NLP)
has
gained
widespread
use
computer-assisted
(CXR)
analysis,
particularly
since
renaissance
deep
learning
(DL)
2012
ImageNet
challenge.
While
early
endeavors
predominantly
employed
recurrent
neural
networks
(RNN)
convolutional
(CNN)
[1],
revolution
is
brought
[2]
its
success
can
be
attributed
three
key
factors
[3].
First,
self-attention
mechanism
enables
simultaneous
multiple
parts
an
input
sequence,
offering
significantly
greater
efficiency
compared
earlier
models
such
RNN
[4].
Second,
architecture
exhibits
exceptional
scalability,
supporting
with
over
100
billion
parameters
capture
intricate
linguistic
relationships
human
[5].
Third,
availability
vast
internet-based
corpus
advances
power
have
made
pre-training
fine-tuning
large-scale
feasible
[6].
The
development
resolution
previously
intractable
problems
achieves
expert-level
performance
across
broad
range
CXR
analytical
tasks,
name
entity
recognition,
question
answering,
extractive
summarization
[7].
In
commentary,
(Figure
1)
landscape,
barriers,
handling
comprehension
managing
generation.
As
our
primary
focus
NLP,
classification
criteria
or
was
based
on
text
modules
excluded
research
purely
focusing
vision
transformers
(ViT).
Literature
pipeline
identify
relevant
articles
published
June
12,
2017,
when
model
first
introduced,
October
4,
2024.
followed
previous
systematic
reviews
[3,
8,
9]
design
groups
keywords:
(1)
"transformer";
(2)
"clinical
notes",
reports",
narratives",
text",
"medical
text";
(3)
"natural
processing",
"text
mining",
"information
extraction";
(4)
"radiography",
"chest
film",
radiograph",
"radiograph",
"X-rays".
means
communication
between
referring
physicians,
reports
contain
high-density
information
patients'
conditions
[10].
Much
like
physicians
interpreting
reports,
step
NLP
understanding
content
important
application
explicitly
converting
it
into
format
suitable
subsequent
tasks.
One
notable
[11],
which
stands
bidirectional
representations
transformers.
contrast
predecessors
rely
large
amounts
expert
annotations
supervised
[12],
undergoes
self-supervised
training
unlabeled
datasets
understand
patterns
subsequently
fine-tuned
small
set
target
task
[12,
13],
yielding
superior
[14],
recognition
[15],
[16],
semantics
optimization
[17].
context
healthcare,
Olthof
et
al.
[18]
built
evaluate
varying
complexities,
disease
prevalence,
sample
sizes,
demonstrating
statistically
outperformed
conventional
DL
CNN,
area
under
curve
F1-score,
t-test
p-values
less
than
0.05.
Beyond
models,
adapting
domain-specific
further
enhance
effectiveness
various
Yan
[19]
adapted
four
BERT-like
encoders
using
millions
radiology
tackle
tasks:
identifying
sentences
describe
abnormal
findings,
assigning
diagnostic
codes,
extracting
summarize
reports.
Their
results
demonstrated
adaptation
yielded
significant
improvements
accuracy,
ROUGE
metrics
all
Most
BERT-relevant
studies
sentence-,
paragraph-,
report-level
predictions,
while
are
also
well-suited
word-level
pattern
recognition.
Chambon
[20]
leveraged
[21],
biomedical-specific
BERT,
probability
individual
tokens
containing
protected
health
information,
replaced
identified
sensitive
synthetic
surrogates
ensure
privacy
preservation.
Similarly,
Weng
[22]
developed
system
utilizing
ALBERT
[23],
lite
reduced
parameters,
keywords
unrelated
thereby
reducing
false-positive
alarms
outperforming
regular
expression-,
syntactic
grammar-,
DL-based
baselines.
BERT-derived
labels
applied
develop
targeting
other
modalities
13].
Nowak
[24]
systematically
explored
utility
BERT-generated
silver
linked
them
corresponding
radiographs
image
classifiers.
Compared
trained
exclusively
radiologist-annotated
gold
labels,
integrating
led
improved
discriminability.
macro-averaged
synchronous
proved
effective
settings
limited
whereas
silver,
better
cases
abundant
labels.
Zhang
[25]
introduced
novel
approach
more
generalizable
classifiers,
rather
relying
predefined
categories:
first,
they
used
extract
entities
relationships;
second,
constructed
knowledge
graph
these
extractions;
third,
refined
their
domain
expertise.
Unlike
traditional
multiclass
established
not
only
categorized
each
but
revealed
interpretable
categories,
those
linking
anatomical
regions
signs.
addition
deriving
advanced
capabilities
unprecedented
innovation:
direct
supervision
pixel-level
segmentation
medical
[26].
Li
[26]
proposed
text-augmented
lesion
paradigm
integrated
BERT-based
textual
compensate
deficiency
radiograph
quality
refine
pseudo
semi-supervision.
These
highlight
strength
comprehending
healthcare-related
annotation
systems
multi-modality
beyond
text.
Meanwhile,
researchers
failures
complex
clinical
Sushil
[27]
implementations
inference
achieved
test
accuracy
0.778.
adaptations
textbooks
0.833,
still
fell
short
experts.
Potential
limitations
lie
relatively
modest
parameter
size,
although
larger
reliance
inadequate
corpora,
books,
Wikipedia,
selected
databases
[28].
Consequently,
ability
learn
remains
constrained.
shortcomings
being
alleviated
GPT-like
decoders,
incorporate
hundreds
billions
internet-scale
corpora
[29].
Following
advent
encoders,
generative
pre-trained
(GPT)
[30],
next
groundbreaking
leap,
breaks
enabling
non-experts
perform
tasks
through
freely
conversational
without
any
coding.
CvT2DistilGPT2
[31],
prominent
generator
era,
utilizes
ViT
GPT-2
decoder.
experiments
indicated
CNN
GPT
surpassed
encoder–decoder
architectures
specific
generation
applications,
state-of-the-art
methods
integrate
decoders.
TranSQ
[32]
framework.
emulates
reasoning
process
generating
reports:
formulating
hypothesis
embeddings
represent
implicit
intentions,
querying
visual
features
extracted
synthesizing
semantic
cross-modality
fusion,
transforming
candidate
DistilGPT
[33].
Finally,
attained
BLEU-4
score
0.205
0.409.
comparison,
best-performing
baseline
among
17
retrieval
0.188
0.383,
highlighting
capability
unified
multi-modality.
Though
decoders
dominated
general
domain,
family
long
short-term
memory
(LSTM)
[34]
good
partially
because
highly
templated
characteristics
[32].
Kaur
Mittal
[35]
classical
architectures,
feature
extraction,
LSTM
token
They
modules,
generate
numerical
inputs
prior
shortlist
disease-relevant
afterward.
Results
presented
solution
0.767
0.897,
suggesting
approaches
remain
viable
backbone
scenarios.
quantitative
comparing
outputs
ground
truth
model-generated
should
supplemented
evaluation
Boag
[36]
study
automated
generation,
divergence
accuracy.
A
discrepancy
readability
been
reported
[37].
Accordingly,
emphasize
involvement
rating
correctness
readability.
sections,
reviewed
applications
Although
remarkable
well-established,
face
problems.
Some
integration
specialized
expertise
[31,
38],
others
necessitate
resolution.
demands
era
substantial.
For
example,
version
contains
334
million
GPT-3
175
billion.
contrast,
support
vector
machines
[39]
random
forests
[40],
require
few
hundred
thousand
parameters.
result,
many
healthcare
providers
cannot
afford
costs
tailoring
scratch.
To
address
this,
offer
several
recommendations.
development,
suggest
leverage
open-access
building
fine-tuning,
considering
scales,
recommend
parameter-efficient
technique
updates
subset
model's
leaving
majority
weights
unchanged
[41].
An
exemplificative
Taylor
[42]
empirically
validated
techniques
within
domain.
advocate
prompt
engineering
techniques,
retrieval-augmented
crafting
informative
instructive
guide
decoders'
output
changing
[43].
Ranjit
[44]
method
retrieve
most
contextual
prompts
concise
accurate
retaining
critical
entities.
Last
least,
obtaining
approval
ethics
committees
share
anonymous
data
facilitate
collaboration
external
partners,
helping
alleviate
resource
burdens.
including
both
where
decisions
directly
impact
lives.
often
regarded
black-box
simple
render
explainable
modern
layers
neurons
dissected
visualized,
providing
insights
functionality
[45-48].
behavior
challenge
due
complexity
associated
exponential
scaling
neuron
numbers
[49].
though
internal
activations
challenging
interpret,
preliminary
analyzing
influence
high
degree
alignment
assessments
[50,
51].
lies
flexibility
align
instructions.
This
allows
users
obtain
expected
request
explanations
outputs,
fostering
enhanced
usability
[52,
53].
readers
overview
detailed
insights,
surveys
[54-56].
considerations
paramount
transformers,
given
powerful
nuanced
datasets.
concerns
pressing
private
representative
population.
patient
privacy,
anonymizing
during
deployment
stages
neither
learned
[57]
nor
inadvertently
disclosed
certain
[58].
Dataset
representativeness
issue,
underrepresentation
minority
exacerbate
disparities
perpetuate
inequities
[59].
mitigate
risk,
developers
prioritize
inclusivity
collection,
maintainers
continuously
monitor
equitable
outcomes
[60].
Fourth,
coherent
responses
diverse
user
solving
wide
[61],
predictive
internet
instead
radiological
well-defined
logic
[62].
Therefore,
continue
suffer
hallucinations,
phenomenon
appears
plausible
factually
incorrect,
nonsensical,
users'
[63].
Current
efforts
broadly
post-training
stages.
During
training,
strategies
include
in-house
reinforcement
guided
radiologists'
feedback
64].
Post-training
encompass
detection,
knowledge,
multi-agent
collaboration,
radiologist-in-the-loop
frameworks
[62,
65].
Due
space
constraints,
encourage
refer
66-68]
strategies.
Lastly,
even
after
refinements,
may
present
risks
potentially
leading
errors
liabilities
[69].
Errors
arise
sources,
inaccurate
clinician
nonadherence
correct
recommendations,
poor
workflows
[70].
determining
responsibility
adverse
issue
stakeholders,
software
developers,
maintenance
teams,
departments,
[71].
European
Commission
focuses
safety
liability
implications
artificial
intelligence,
applies
device
laws
demonstrates
generally
falls
civil
product
Civil
typically
pertains
developers.
However,
stops
strict
definitive
framework
inherent
ambiguity
algorithms
questions
surrounding
likely
addressed
courts
case
law.
Under
existing
frameworks,
follow
standard
care,
supplementary
confirmatory
substitutes
practice
beneficial
stakeholders
Additionally,
departments
implement
tools,
involve
radiologists,
throughout
entire
cycle
[72],
prepare
in-depth
programs
familiarize
differ
routine
statistical
tests
black
boxes
resist
full
interpretation
[73].
Moreover,
expectations
important:
unrealistic
optimism,
seen
replacement
expertise,
undue
pessimism,
perceived
no
utility,
avoided
[74-77].
Han
Yuan:
Conceptualization;
curation;
formal
analysis;
investigation;
project
administration;
validation;
visualization;
writing—original
draft;
writing—review
editing.
None.
author
declares
he
conflicts
interest.
exempt
review
committee
does
participants,
animal
subjects,
collection.
Not
applicable.
Data
sharing
apply
were
generated
analyzed.
Язык: Английский
Use of AI in Cardiac CT and MRI: A Scientific Statement from the ESCR, EuSoMII, NASCI, SCCT, SCMR, SIIM, and RSNA
Radiology,
Год журнала:
2025,
Номер
314(1)
Опубликована: Янв. 1, 2025
Artificial
intelligence
(AI)
offers
promising
solutions
for
many
steps
of
the
cardiac
imaging
workflow,
from
patient
and
test
selection
through
image
acquisition,
reconstruction,
interpretation,
extending
to
prognostication
reporting.
Despite
development
AI
algorithms,
tools
are
at
various
stages
face
challenges
clinical
implementation.
This
scientific
statement,
endorsed
by
several
societies
in
field,
provides
an
overview
current
landscape
applications
CT
MRI.
Each
section
is
organized
into
questions
statements
that
address
key
including
ethical,
legal,
environmental
sustainability
considerations.
A
technology
readiness
level
range
1
9
summarizes
maturity
reflects
progression
preliminary
research
document
aims
bridge
gap
between
burgeoning
developments
limited
Язык: Английский
Leveraging Large Language Models in Radiology Research: A Comprehensive User Guide
Academic Radiology,
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 1, 2025
Язык: Английский
Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment
Radiology,
Год журнала:
2025,
Номер
314(1)
Опубликована: Янв. 1, 2025
Open-source
large
language
models
and
multimodal
foundation
offer
several
practical
advantages
for
clinical
research
objectives
in
radiology
over
their
proprietary
counterparts
but
require
further
validation
before
widespread
adoption.
Язык: Английский
Optimizing Large Language Models in Radiology and Mitigating Pitfalls: Prompt Engineering and Fine-tuning
Radiographics,
Год журнала:
2025,
Номер
45(4)
Опубликована: Март 6, 2025
Large
language
models
(LLMs)
such
as
generative
pretrained
transformers
(GPTs)
have
had
a
major
impact
on
society,
and
there
is
increasing
interest
in
using
these
for
applications
medicine
radiology.
This
article
presents
techniques
to
optimize
describes
their
known
challenges
limitations.
Specifically,
the
authors
explore
how
best
craft
natural
prompts,
process
prompt
engineering,
elicit
more
accurate
desirable
responses.
The
also
explain
fine-tuning
conducted,
which
general
model,
GPT-4,
further
trained
specific
use
case,
summarizing
clinical
notes,
improve
reliability
relevance.
Despite
enormous
potential
of
models,
substantial
limit
widespread
implementation.
These
tools
differ
substantially
from
traditional
health
technology
complexity
probabilistic
nondeterministic
nature,
differences
lead
issues
"hallucinations,"
biases,
lack
reliability,
security
risks.
Therefore,
provide
radiologists
with
baseline
knowledge
underpinning
an
understanding
them,
addition
exploring
practices
engineering
fine-tuning.
Also
discussed
are
current
proof-of-concept
cases
LLMs
radiology
literature,
decision
support
report
generation,
limitations
preventing
adoption
©RSNA,
2025
See
invited
commentary
by
Chung
Mongan
this
issue.
Язык: Английский
Large Language Models for Automated Synoptic Reports and Resectability Categorization in Pancreatic Cancer
Radiology,
Год журнала:
2024,
Номер
311(3)
Опубликована: Июнь 1, 2024
Background
Structured
radiology
reports
for
pancreatic
ductal
adenocarcinoma
(PDAC)
improve
surgical
decision-making
over
free-text
reports,
but
radiologist
adoption
is
variable.
Resectability
criteria
are
applied
inconsistently.
Purpose
To
evaluate
the
performance
of
large
language
models
(LLMs)
in
automatically
creating
PDAC
synoptic
from
original
and
to
explore
categorizing
tumor
resectability.
Materials
Methods
In
this
institutional
review
board–approved
retrospective
study,
180
consecutive
staging
CT
on
patients
referred
authors'
European
Society
Medical
Oncology–designated
cancer
center
January
December
2018
were
included.
Reports
reviewed
by
two
radiologists
establish
reference
standard
14
key
findings
National
Comprehensive
Cancer
Network
(NCCN)
resectability
category.
GPT-3.5
GPT-4
(accessed
September
18–29,
2023)
prompted
create
with
same
features,
their
was
evaluated
(recall,
precision,
F1
score).
categorize
resectability,
three
prompting
strategies
(default
knowledge,
in-context
chain-of-thought)
used
both
LLMs.
Hepatopancreaticobiliary
surgeons
artificial
intelligence
(AI)–generated
determine
accuracy
time
compared.
The
McNemar
test,
t
Wilcoxon
signed-rank
mixed
effects
logistic
regression
where
appropriate.
Results
outperformed
creation
(F1
score:
0.997
vs
0.967,
respectively).
Compared
GPT-3.5,
achieved
equal
or
higher
scores
all
extracted
features.
had
precision
than
extracting
superior
mesenteric
artery
involvement
(100%
88.8%,
For
each
strategy.
GPT-4,
chain-of-thought
most
accurate,
outperforming
knowledge
(92%
83%,
respectively;
P
=
.002),
which
default
strategy
(83%
67%,
<
.001).
Surgeons
more
accurate
using
AI-generated
76%,
.03),
while
spending
less
report
(58%;
95%
CI:
0.53,
0.62).
Conclusion
created
near-perfect
reports.
high
efficient
©
RSNA,
2024
Supplemental
material
available
article.
See
also
editorial
Chang
issue.
Язык: Английский
ChatGPT’s diagnostic performance based on textual vs. visual information compared to radiologists’ diagnostic performance in musculoskeletal radiology
European Radiology,
Год журнала:
2024,
Номер
35(1), С. 506 - 516
Опубликована: Июль 12, 2024
To
compare
the
diagnostic
accuracy
of
Generative
Pre-trained
Transformer
(GPT)-4-based
ChatGPT,
GPT-4
with
vision
(GPT-4V)
based
and
radiologists
in
musculoskeletal
radiology.
Язык: Английский