BMC Medical Informatics and Decision Making,
Journal Year:
2024,
Volume and Issue:
24(1)
Published: Nov. 26, 2024
The
large
language
models
(LLMs),
most
notably
ChatGPT,
released
since
November
30,
2022,
have
prompted
shifting
attention
to
their
use
in
medicine,
particularly
for
supporting
clinical
decision-making.
However,
there
is
little
consensus
the
medical
community
on
how
LLM
performance
contexts
should
be
evaluated.
We
performed
a
literature
review
of
PubMed
identify
publications
between
December
1,
and
April
2024,
that
discussed
assessments
LLM-generated
diagnoses
or
treatment
plans.
selected
108
relevant
articles
from
analysis.
frequently
used
LLMs
were
GPT-3.5,
GPT-4,
Bard,
LLaMa/Alpaca-based
models,
Bing
Chat.
five
criteria
scoring
outputs
"accuracy",
"completeness",
"appropriateness",
"insight",
"consistency".
defining
high-quality
been
consistently
by
researchers
over
past
1.5
years.
identified
high
degree
variation
studies
reported
findings
assessed
performance.
Standardized
reporting
qualitative
evaluation
metrics
assess
quality
can
developed
facilitate
research
healthcare.
Acta Neuropsychiatrica,
Journal Year:
2024,
Volume and Issue:
unknown, P. 1 - 14
Published: Nov. 11, 2024
Tools
based
on
generative
artificial
intelligence
(AI)
such
as
ChatGPT
have
the
potential
to
transform
modern
society,
including
field
of
medicine.
Due
prominent
role
language
in
psychiatry,
e.g.,
for
diagnostic
assessment
and
psychotherapy,
these
tools
may
be
particularly
useful
within
this
medical
field.
Therefore,
aim
study
was
systematically
review
literature
AI
applications
psychiatry
mental
health.
Healthcare,
Journal Year:
2024,
Volume and Issue:
12(16), P. 1637 - 1637
Published: Aug. 16, 2024
The
use
of
artificial
intelligence
(AI)
in
education
is
dynamically
growing,
and
models
such
as
ChatGPT
show
potential
enhancing
medical
education.
In
Poland,
to
obtain
a
diploma,
candidates
must
pass
the
Medical
Final
Examination,
which
consists
200
questions
with
one
correct
answer
per
question,
administered
Polish,
assesses
students'
comprehensive
knowledge
readiness
for
clinical
practice.
aim
this
study
was
determine
how
ChatGPT-3.5
handles
included
exam.
This
considered
980
from
five
examination
sessions
Examination
conducted
by
Center
years
2022-2024.
analysis
field
medicine,
difficulty
index
questions,
their
type,
namely
theoretical
versus
case-study
questions.
average
rate
achieved
hovered
around
60%
lower
(p
<
0.001)
than
score
examinees.
lowest
percentage
answers
hematology
(42.1%),
while
highest
endocrinology
(78.6%).
showed
statistically
significant
correlation
correctness
=
0.04).
Questions
provided
incorrect
had
responses.
type
analyzed
did
not
significantly
affect
0.46).
indicates
that
can
be
an
effective
tool
assisting
passing
final
exam,
but
results
should
interpreted
cautiously.
It
recommended
further
verify
using
various
AI
tools.
PLOS Digital Health,
Journal Year:
2025,
Volume and Issue:
4(1), P. e0000711 - e0000711
Published: Jan. 8, 2025
Generative
artificial
intelligence
(genAI)
has
potential
to
improve
healthcare
by
reducing
clinician
burden
and
expanding
services,
among
other
uses.
There
is
a
significant
gap
between
the
need
for
mental
health
care
available
clinicians
in
United
States–this
makes
it
an
attractive
target
improved
efficiency
through
genAI.
Among
most
sensitive
topics
suicide,
demand
crisis
intervention
grown
recent
years.
We
aimed
evaluate
quality
of
genAI
tool
responses
suicide-related
queries.
entered
10
queries
into
five
tools–ChatGPT
3.5,
GPT-4,
version
GPT-4
safe
protected
information,
Gemini,
Bing
Copilot.
The
response
each
query
was
coded
on
seven
metrics
including
presence
suicide
hotline
number,
content
related
evidence-based
interventions,
supportive
content,
harmful
content.
Pooling
across
tools,
(79%)
were
supportive.
Only
24%
included
number
only
4%
consistent
with
prevention
interventions.
Harmful
rare
(5%);
all
such
instances
delivered
Our
results
suggest
that
developers
have
taken
very
conservative
approach
constrained
their
models’
support-seeking,
but
little
else.
Finding
balance
providing
much
needed
information
without
introducing
excessive
risk
within
capabilities
developers.
At
this
nascent
stage
integrating
tools
systems,
ensuring
parity
should
be
goal
organizations.
Global Medical Education,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 13, 2025
Abstract
Objectives
Artificial
intelligence
(AI)
is
being
increasingly
used
in
medical
education.
This
narrative
review
presents
a
comprehensive
analysis
of
generative
AI
tools’
performance
answering
and
generating
exam
questions,
thereby
providing
broader
perspective
on
AI’s
strengths
limitations
the
education
context.
Methods
The
Scopus
database
was
searched
for
studies
examinations
from
2022
to
2024.
Duplicates
were
removed,
relevant
full
texts
retrieved
following
inclusion
exclusion
criteria.
Narrative
descriptive
statistics
analyze
contents
included
studies.
Results
A
total
70
analysis.
results
showed
that
varied
when
different
types
questions
specialty
with
best
average
accuracy
psychiatry,
influenced
by
prompts.
With
well-crafted
prompts,
models
can
efficiently
produce
high-quality
examination
questions.
Conclusion
Generative
possesses
ability
answer
using
carefully
designed
Its
potential
use
assessment
vast,
ranging
detecting
question
error,
aiding
preparation,
facilitating
formative
assessments,
supporting
personalized
learning.
However,
it’s
crucial
educators
always
double-check
responses
maintain
prevent
spread
misinformation.
The Spanish Journal of Psychology,
Journal Year:
2025,
Volume and Issue:
28
Published: Jan. 1, 2025
Abstract
Since
the
publication
of
“What
is
Current
and
Future
Status
Digital
Mental
Health
Interventions?”
exponential
growth
widespread
adoption
ChatGPT
have
underscored
importance
reassessing
its
utility
in
digital
mental
health
interventions.
This
review
critically
examined
potential
ChatGPT,
particularly
focusing
on
application
within
clinical
psychology
settings
as
technology
has
continued
evolving
through
2023
2024.
Alongside
this,
our
literature
spanned
US
Medical
Licensing
Examination
(USMLE)
validations,
assessments
capacity
to
interpret
human
emotions,
analyses
concerning
identification
depression
determinants
at
treatment
initiation,
reported
findings.
Our
evaluated
capabilities
GPT-3.5
GPT-4.0
separately
settings,
highlighting
conversational
AI
overcome
traditional
barriers
such
stigma
accessibility
treatment.
Each
model
displayed
different
levels
proficiency,
indicating
a
promising
yet
cautious
pathway
for
integrating
into
practices.
Journal of Medical Internet Research,
Journal Year:
2025,
Volume and Issue:
27, P. e70535 - e70535
Published: March 19, 2025
Chronic
diseases
are
a
major
global
health
burden,
accounting
for
nearly
three-quarters
of
the
deaths
worldwide.
Large
language
models
(LLMs)
advanced
artificial
intelligence
systems
with
transformative
potential
to
optimize
chronic
disease
management;
however,
robust
evidence
is
lacking.
This
review
aims
synthesize
on
feasibility,
opportunities,
and
challenges
LLMs
across
management
spectrum,
from
prevention
screening,
diagnosis,
treatment,
long-term
care.
Following
PRISMA
(Preferred
Reporting
Items
Systematic
Reviews
Meta-Analysis)
guidelines,
11
databases
(Cochrane
Central
Register
Controlled
Trials,
CINAHL,
Embase,
IEEE
Xplore,
MEDLINE
via
Ovid,
ProQuest
Health
&
Medicine
Collection,
ScienceDirect,
Scopus,
Web
Science
Core
China
National
Knowledge
Internet,
SinoMed)
were
searched
April
17,
2024.
Intervention
simulation
studies
that
examined
in
included.
The
methodological
quality
included
was
evaluated
using
rating
rubric
designed
simulation-based
research
risk
bias
nonrandomized
interventions
tool
quasi-experimental
studies.
Narrative
analysis
descriptive
figures
used
study
findings.
Random-effects
meta-analyses
conducted
assess
pooled
effect
estimates
feasibility
management.
A
total
20
general-purpose
(n=17)
retrieval-augmented
generation-enhanced
(n=3)
diseases,
including
cancer,
cardiovascular
metabolic
disorders.
demonstrated
spectrum
by
generating
relevant,
comprehensible,
accurate
recommendations
(pooled
rate
71%,
95%
CI
0.59-0.83;
I2=88.32%)
having
higher
accuracy
rates
compared
(odds
ratio
2.89,
1.83-4.58;
I2=54.45%).
facilitated
equitable
information
access;
increased
patient
awareness
regarding
ailments,
preventive
measures,
treatment
options;
promoted
self-management
behaviors
lifestyle
modification
symptom
coping.
Additionally,
facilitate
compassionate
emotional
support,
social
connections,
care
resources
improve
outcomes
diseases.
However,
face
addressing
privacy,
language,
cultural
issues;
undertaking
tasks,
medication,
comorbidity
personalized
regimens
real-time
adjustments
multiple
modalities.
have
transform
at
individual,
social,
levels;
their
direct
application
clinical
settings
still
its
infancy.
multifaceted
approach
incorporates
data
security,
domain-specific
model
fine-tuning,
multimodal
integration,
wearables
crucial
evolution
into
invaluable
adjuncts
professionals
PROSPERO
CRD42024545412;
https://www.crd.york.ac.uk/PROSPERO/view/CRD42024545412.
Journal of Pain Research,
Journal Year:
2025,
Volume and Issue:
Volume 18, P. 1387 - 1405
Published: March 1, 2025
Large
language
models
have
been
proposed
as
diagnostic
aids
across
various
medical
fields,
including
dentistry.
Burning
mouth
syndrome,
characterized
by
burning
sensations
in
the
oral
cavity
without
identifiable
cause,
poses
challenges.
This
study
explores
accuracy
of
large
identifying
hypothesizing
potential
limitations.
Clinical
vignettes
100
synthesized
syndrome
cases
were
evaluated
using
three
(ChatGPT-4o,
Gemini
Advanced
1.5
Pro,
and
Claude
3.5
Sonnet).
Each
vignette
included
patient
demographics,
symptoms,
history.
prompted
to
provide
a
primary
diagnosis,
differential
diagnoses,
their
reasoning.
Accuracy
was
determined
comparing
responses
with
expert
evaluations.
ChatGPT
achieved
an
rate
99%,
while
Gemini's
89%
(p
<
0.001).
Misdiagnoses
Persistent
Idiopathic
Facial
Pain
combined
diagnoses
inappropriate
conditions.
Differences
also
observed
reasoning
patterns
additional
data
requests
models.
Despite
high
overall
accuracy,
exhibited
variations
approaches
occasional
errors,
underscoring
importance
clinician
oversight.
Limitations
include
nature
vignettes,
over-reliance
on
exclusionary
criteria,
challenges
differentiating
overlapping
disorders.
demonstrate
strong
supplementary
tools
for
especially
settings
lacking
specialist
expertise.
However,
reliability
depends
thorough
assessment
verification.
Integrating
into
routine
diagnostics
could
enhance
early
detection
management,
ultimately
improving
clinical
decision-making
dentists
specialists
alike.