BMC Medical Informatics and Decision Making,
Journal Year:
2025,
Volume and Issue:
25(1)
Published: April 14, 2025
The
integration
of
artificial
intelligence
(AI)
in
healthcare
has
rapidly
expanded,
particularly
clinical
decision-making.
Large
language
models
(LLMs)
such
as
GPT-4
and
GPT-3.5
have
shown
potential
various
medical
applications,
including
diagnostics
treatment
planning.
However,
their
efficacy
specialized
fields
like
sports
surgery
physiotherapy
remains
underexplored.
This
study
aims
to
compare
the
performance
decision-making
within
these
domains
using
a
structured
assessment
approach.
cross-sectional
included
56
professionals
specializing
physiotherapy.
Participants
evaluated
10
standardized
scenarios
generated
by
5-point
Likert
scale.
encompassed
common
musculoskeletal
conditions,
assessments
focused
on
diagnostic
accuracy,
appropriateness,
surgical
technique
detailing,
rehabilitation
plan
suitability.
Data
were
collected
anonymously
via
Google
Forms.
Statistical
analysis
paired
t-tests
for
direct
model
comparisons,
one-way
ANOVA
assess
across
multiple
criteria,
Cronbach's
alpha
evaluate
inter-rater
reliability.
significantly
outperformed
all
criteria.
Paired
t-test
results
(t(55)
=
10.45,
p
<
0.001)
demonstrated
that
provided
more
accurate
diagnoses,
superior
plans,
detailed
recommendations.
confirmed
higher
suitability
planning
(F(1,
55)
35.22,
protocols
32.10,
0.001).
values
indicated
internal
consistency
(α
0.478)
compared
0.234),
reflecting
reliable
performance.
demonstrates
These
findings
suggest
advanced
AI
can
aid
planning,
strategies.
should
function
decision-support
tool
rather
than
substitute
expert
judgment.
Future
studies
explore
into
real-world
workflows,
validate
larger
datasets,
additional
beyond
GPT
series.
NEJM AI,
Journal Year:
2024,
Volume and Issue:
1(5)
Published: April 16, 2024
As
artificial
intelligence
(AI)
tools
become
widely
accessible,
more
patients
and
medical
professionals
will
turn
to
them
for
information.
Large
language
models
(LLMs),
a
subset
of
AI,
excel
in
natural
processing
tasks
hold
considerable
promise
clinical
use.
Fields
such
as
oncology,
which
decisions
are
highly
dependent
on
continuous
influx
new
trial
data
evolving
guidelines,
stand
gain
immensely
from
advancements.
It
is
therefore
critical
importance
benchmark
these
describe
their
performance
characteristics
guide
safe
application
oncology.
Accordingly,
the
primary
objectives
this
work
were
conduct
comprehensive
evaluations
LLMs
field
oncology
identify
characterize
strategies
that
can
use
bolster
confidence
model's
response.
Diagnostic and Interventional Imaging,
Journal Year:
2024,
Volume and Issue:
105(7-8), P. 251 - 265
Published: April 27, 2024
The
purpose
of
this
study
was
to
systematically
review
the
reported
performances
ChatGPT,
identify
potential
limitations,
and
explore
future
directions
for
its
integration,
optimization,
ethical
considerations
in
radiology
applications.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: March 5, 2024
Abstract
The
introduction
of
large
language
models
(LLMs)
into
clinical
practice
promises
to
improve
patient
education
and
empowerment,
thereby
personalizing
medical
care
broadening
access
knowledge.
Despite
the
popularity
LLMs,
there
is
a
significant
gap
in
systematized
information
on
their
use
care.
Therefore,
this
systematic
review
aims
synthesize
current
applications
limitations
LLMs
using
data-driven
convergent
synthesis
approach.
We
searched
5
databases
for
qualitative,
quantitative,
mixed
methods
articles
published
between
2022
2023.
From
4,349
initial
records,
89
studies
across
29
specialties
were
included,
primarily
examining
based
GPT-3.5
(53.2%,
n=66
124
different
examined
per
study)
GPT-4
(26.6%,
n=33/124)
architectures
question
answering,
followed
by
generation,
including
text
summarization
or
translation,
documentation.
Our
analysis
delineates
two
primary
domains
LLM
limitations:
design
output.
Design
included
6
second-order
12
third-order
codes,
such
as
lack
domain
optimization,
data
transparency,
accessibility
issues,
while
output
9
32
example,
non-reproducibility,
non-comprehensiveness,
incorrectness,
unsafety,
bias.
In
conclusion,
study
first
systematically
map
care,
providing
foundational
framework
taxonomy
implementation
evaluation
healthcare
settings.
npj Digital Medicine,
Journal Year:
2024,
Volume and Issue:
7(1)
Published: Sept. 28, 2024
Abstract
With
generative
artificial
intelligence
(GenAI),
particularly
large
language
models
(LLMs),
continuing
to
make
inroads
in
healthcare,
assessing
LLMs
with
human
evaluations
is
essential
assuring
safety
and
effectiveness.
This
study
reviews
existing
literature
on
evaluation
methodologies
for
healthcare
across
various
medical
specialties
addresses
factors
such
as
dimensions,
sample
types
sizes,
selection,
recruitment
of
evaluators,
frameworks
metrics,
process,
statistical
analysis
type.
Our
review
142
studies
shows
gaps
reliability,
generalizability,
applicability
current
practices.
To
overcome
significant
obstacles
LLM
developments
deployments,
we
propose
QUEST,
a
comprehensive
practical
framework
covering
three
phases
workflow:
Planning,
Implementation
Adjudication,
Scoring
Review.
QUEST
designed
five
proposed
principles:
Quality
Information,
Understanding
Reasoning,
Expression
Style
Persona,
Safety
Harm,
Trust
Confidence.
Communications Medicine,
Journal Year:
2025,
Volume and Issue:
5(1)
Published: Jan. 21, 2025
Abstract
Background
The
introduction
of
large
language
models
(LLMs)
into
clinical
practice
promises
to
improve
patient
education
and
empowerment,
thereby
personalizing
medical
care
broadening
access
knowledge.
Despite
the
popularity
LLMs,
there
is
a
significant
gap
in
systematized
information
on
their
use
care.
Therefore,
this
systematic
review
aims
synthesize
current
applications
limitations
LLMs
Methods
We
systematically
searched
5
databases
for
qualitative,
quantitative,
mixed
methods
articles
published
between
2022
2023.
From
4349
initial
records,
89
studies
across
29
specialties
were
included.
Quality
assessment
was
performed
using
Mixed
Appraisal
Tool
2018.
A
data-driven
convergent
synthesis
approach
applied
thematic
syntheses
LLM
free
line-by-line
coding
Dedoose.
Results
show
that
most
investigate
Generative
Pre-trained
Transformers
(GPT)-3.5
(53.2%,
n
=
66
124
different
examined)
GPT-4
(26.6%,
33/124)
answering
questions,
followed
by
generation,
including
text
summarization
or
translation,
documentation.
Our
analysis
delineates
two
primary
domains
limitations:
design
output.
Design
include
6
second-order
12
third-order
codes,
such
as
lack
domain
optimization,
data
transparency,
accessibility
issues,
while
output
9
32
example,
non-reproducibility,
non-comprehensiveness,
incorrectness,
unsafety,
bias.
Conclusions
This
maps
care,
providing
foundational
framework
taxonomy
implementation
evaluation
healthcare
settings.
Journal of Medical Internet Research,
Journal Year:
2023,
Volume and Issue:
26, P. e51926 - e51926
Published: Nov. 30, 2023
Benefiting
from
rich
knowledge
and
the
exceptional
ability
to
understand
text,
large
language
models
like
ChatGPT
have
shown
great
potential
in
English
clinical
environments.
However,
performance
of
non-English
settings,
as
well
its
reasoning,
not
been
explored
depth.
npj Digital Medicine,
Journal Year:
2024,
Volume and Issue:
7(1)
Published: April 23, 2024
Reliably
processing
and
interlinking
medical
information
has
been
recognized
as
a
critical
foundation
to
the
digital
transformation
of
workflows,
despite
development
ontologies,
optimization
these
major
bottleneck
medicine.
The
advent
large
language
models
brought
great
excitement,
maybe
solution
medicines'
'communication
problem'
is
in
sight,
but
how
can
known
weaknesses
models,
such
hallucination
non-determinism,
be
tempered?
Retrieval
Augmented
Generation,
particularly
through
knowledge
graphs,
an
automated
approach
that
deliver
structured
reasoning
model
truth
alongside
LLMs,
relevant
structuring
therefore
also
decision
support.