BMC Medical Informatics and Decision Making,
Journal Year:
2024,
Volume and Issue:
24(1)
Published: Nov. 26, 2024
The
large
language
models
(LLMs),
most
notably
ChatGPT,
released
since
November
30,
2022,
have
prompted
shifting
attention
to
their
use
in
medicine,
particularly
for
supporting
clinical
decision-making.
However,
there
is
little
consensus
the
medical
community
on
how
LLM
performance
contexts
should
be
evaluated.
We
performed
a
literature
review
of
PubMed
identify
publications
between
December
1,
and
April
2024,
that
discussed
assessments
LLM-generated
diagnoses
or
treatment
plans.
selected
108
relevant
articles
from
analysis.
frequently
used
LLMs
were
GPT-3.5,
GPT-4,
Bard,
LLaMa/Alpaca-based
models,
Bing
Chat.
five
criteria
scoring
outputs
"accuracy",
"completeness",
"appropriateness",
"insight",
"consistency".
defining
high-quality
been
consistently
by
researchers
over
past
1.5
years.
identified
high
degree
variation
studies
reported
findings
assessed
performance.
Standardized
reporting
qualitative
evaluation
metrics
assess
quality
can
developed
facilitate
research
healthcare.
npj Digital Medicine,
Journal Year:
2025,
Volume and Issue:
8(1)
Published: April 30, 2025
Large
language
models
(LLMs)
show
promise
in
mental
health
care
for
handling
human-like
conversations,
but
their
effectiveness
remains
uncertain.
This
scoping
review
synthesizes
existing
research
on
LLM
applications
care,
reviews
model
performance
and
clinical
effectiveness,
identifies
gaps
current
evaluation
methods
following
a
structured
framework,
provides
recommendations
future
development.
A
systematic
search
identified
726
unique
articles,
of
which
16
met
the
inclusion
criteria.
These
studies,
encompassing
such
as
assistance,
counseling,
therapy,
emotional
support,
initial
promises.
However,
were
often
non-standardized,
with
most
studies
relying
ad-hoc
scales
that
limit
comparability
robustness.
reliance
prompt-tuning
proprietary
models,
OpenAI's
GPT
series,
also
raises
concerns
about
transparency
reproducibility.
As
evidence
does
not
fully
support
use
standalone
interventions,
more
rigorous
development
guidelines
are
needed
safe,
effective
integration.
JMIR Formative Research,
Journal Year:
2023,
Volume and Issue:
7, P. e51798 - e51798
Published: Dec. 4, 2023
Refractive
surgery
research
aims
to
optimally
precategorize
patients
by
their
suitability
for
various
types
of
surgery.
Recent
advances
have
led
the
development
artificial
intelligence-powered
algorithms,
including
machine
learning
approaches,
assess
risks
and
enhance
workflow.
Large
language
models
(LLMs)
like
ChatGPT-4
(OpenAI
LP)
emerged
as
potential
general
intelligence
tools
that
can
assist
across
disciplines,
possibly
refractive
decision-making.
However,
actual
capabilities
in
precategorizing
based
on
real-world
parameters
remain
unexplored.
The Egyptian Journal of Neurology Psychiatry and Neurosurgery,
Journal Year:
2024,
Volume and Issue:
60(1)
Published: Jan. 30, 2024
Abstract
ChatGPT
has
become
a
hot
topic
of
discussion
since
its
release
in
November
2022.
The
number
publications
on
the
potential
applications
various
fields
is
rise.
However,
viewpoints
use
psychiatry
are
lacking.
This
article
aims
to
address
this
gap
by
examining
promises
and
pitfalls
using
psychiatric
practice.
While
offers
several
opportunities,
further
research
warranted,
as
chatbots
like
raises
technical
ethical
concerns.
Some
practical
ways
addressing
challenges
for
also
discussed.
BMC Medical Informatics and Decision Making,
Journal Year:
2024,
Volume and Issue:
24(1)
Published: Nov. 26, 2024
The
large
language
models
(LLMs),
most
notably
ChatGPT,
released
since
November
30,
2022,
have
prompted
shifting
attention
to
their
use
in
medicine,
particularly
for
supporting
clinical
decision-making.
However,
there
is
little
consensus
the
medical
community
on
how
LLM
performance
contexts
should
be
evaluated.
We
performed
a
literature
review
of
PubMed
identify
publications
between
December
1,
and
April
2024,
that
discussed
assessments
LLM-generated
diagnoses
or
treatment
plans.
selected
108
relevant
articles
from
analysis.
frequently
used
LLMs
were
GPT-3.5,
GPT-4,
Bard,
LLaMa/Alpaca-based
models,
Bing
Chat.
five
criteria
scoring
outputs
"accuracy",
"completeness",
"appropriateness",
"insight",
"consistency".
defining
high-quality
been
consistently
by
researchers
over
past
1.5
years.
identified
high
degree
variation
studies
reported
findings
assessed
performance.
Standardized
reporting
qualitative
evaluation
metrics
assess
quality
can
developed
facilitate
research
healthcare.