Evaluating the Performance and Safety of Large Language Models in Generating Type 2 Diabetes Mellitus Management Plans: A Comparative Study With Physicians Using Real Patient Records
Agnibho Mondal,
No information about this author
Arindam Naskar,
No information about this author
Bhaskar Roy Choudhury
No information about this author
et al.
Cureus,
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 17, 2025
Background
The
integration
of
large
language
models
(LLMs)
such
as
GPT-4
into
healthcare
presents
potential
benefits
and
challenges.
While
LLMs
show
promise
in
applications
ranging
from
scientific
writing
to
personalized
medicine,
their
practical
utility
safety
clinical
settings
remain
under
scrutiny.
Concerns
about
accuracy,
ethical
considerations,
bias
necessitate
rigorous
evaluation
these
technologies
against
established
medical
standards.
Methods
This
study
involved
a
comparative
analysis
using
anonymized
patient
records
setting
the
state
West
Bengal,
India.
Management
plans
for
50
patients
with
type
2
diabetes
mellitus
were
generated
by
three
physicians,
who
blinded
each
other's
responses.
These
evaluated
reference
management
plan
based
on
American
Diabetes
Society
guidelines.
Completeness,
necessity,
dosage
accuracy
quantified
Prescribing
Error
Score
was
devised
assess
quality
plans.
also
assessed.
Results
indicated
that
physicians'
had
fewer
missing
medications
compared
those
(p=0.008).
However,
GPT-4-generated
included
unnecessary
(p=0.003).
No
significant
difference
observed
drug
dosages
(p=0.975).
overall
error
scores
comparable
between
physicians
(p=0.301).
Safety
issues
noted
16%
GPT-4,
highlighting
risks
associated
AI-generated
Conclusion
demonstrates
while
can
effectively
reduce
prescriptions,
it
does
not
yet
match
performance
terms
completeness.
findings
support
use
supplementary
tools
healthcare,
need
enhanced
algorithms
continuous
human
oversight
ensure
efficacy
artificial
intelligence
settings.
Language: Английский
Evaluating the Effectiveness and Safety of Large Language Model in Generating Type 2 Diabetes Mellitus Management Plans: A Comparative Study with Medical Experts Based on Real Patient Records
Agnibho Mondal,
No information about this author
Arindam Naskar,
No information about this author
Bhaskar Roy Choudhury
No information about this author
et al.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: May 22, 2024
Abstract
Background
The
integration
of
large
language
models
(LLMs)
such
as
GPT-4
into
healthcare
presents
potential
benefits
and
challenges.
While
LLMs
have
shown
promise
in
applications
ranging
from
scientific
writing
to
personalized
medicine,
their
practical
utility
safety
clinical
settings
remain
under
scrutiny.
Concerns
about
accuracy,
ethical
considerations
bias
necessitate
rigorous
evaluation
these
technologies
against
established
medical
standards.
Objective
To
compare
the
completeness,
necessity,
dosage
accuracy
overall
type
2
diabetes
management
plans
created
by
with
those
devised
experts.
Methods
This
study
involved
a
comparative
analysis
using
anonymized
patient
records
setting
West
Bengal,
India.
Management
for
50
Type
patients
were
generated
three
blinded
These
evaluated
reference
plan
based
on
American
Diabetes
Society
guidelines.
Completeness,
necessity
quantified
an
error
score
was
assess
quality
plans.
also
assessed.
Results
indicated
that
experts’
had
fewer
missing
medications
compared
(p=0.008).
However,
included
unnecessary
(p=0.003).
No
significant
difference
observed
drug
dosages
(p=0.975).
scores
comparable
between
human
experts
(p=0.301).
Safety
issues
noted
16%
GPT-4,
highlighting
risks
associated
AI-generated
Conclusion
demonstrates
while
can
effectively
reduce
prescriptions,
it
does
not
yet
match
performance
terms
completeness
safety.
findings
support
use
supplementary
tools
healthcare,
underscoring
need
enhanced
algorithms
continuous
oversight
ensure
efficacy
AI
settings.
Further
research
is
necessary
improve
complex
environments.
Language: Английский