BMC Medical Informatics and Decision Making,
Journal Year:
2024,
Volume and Issue:
24(1)
Published: Nov. 26, 2024
The
large
language
models
(LLMs),
most
notably
ChatGPT,
released
since
November
30,
2022,
have
prompted
shifting
attention
to
their
use
in
medicine,
particularly
for
supporting
clinical
decision-making.
However,
there
is
little
consensus
the
medical
community
on
how
LLM
performance
contexts
should
be
evaluated.
We
performed
a
literature
review
of
PubMed
identify
publications
between
December
1,
and
April
2024,
that
discussed
assessments
LLM-generated
diagnoses
or
treatment
plans.
selected
108
relevant
articles
from
analysis.
frequently
used
LLMs
were
GPT-3.5,
GPT-4,
Bard,
LLaMa/Alpaca-based
models,
Bing
Chat.
five
criteria
scoring
outputs
"accuracy",
"completeness",
"appropriateness",
"insight",
"consistency".
defining
high-quality
been
consistently
by
researchers
over
past
1.5
years.
identified
high
degree
variation
studies
reported
findings
assessed
performance.
Standardized
reporting
qualitative
evaluation
metrics
assess
quality
can
developed
facilitate
research
healthcare.
International Journal of Eating Disorders,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 19, 2025
ABSTRACT
Objective
Artificial
intelligence
(AI)
could
revolutionize
the
delivery
of
mental
health
care,
helping
to
streamline
clinician
workflows
and
assist
with
diagnostic
treatment
decisions.
Yet,
before
AI
can
be
integrated
into
practice,
it
is
necessary
understand
perspectives
these
tools
inform
facilitators
barriers
their
uptake.
We
gathered
data
on
community
participant
incorporating
in
clinical
management
eating
disorders.
Method
A
survey
was
distributed
internationally
clinicians
(
n
=
116)
experience
disorder
(psychologists,
psychiatrists,
etc.)
participants
155)
who
reported
occurrence
behaviors.
Results
59%
use
systems
(most
commonly
ChatGPT)
for
professional
reasons,
compared
18%
using
them
help‐related
purposes.
While
more
than
half
(58%)
(53%)
were
open
help
support
them,
fewer
enthusiastic
about
integration
(40%
27%,
respectively)
believed
that
they
would
significantly
improve
client
outcomes
(28%
13%,
respectively).
Nine
10
agreed
may
improperly
used
if
individuals
are
not
adequately
trained,
pose
new
privacy
security
concerns.
Most
will
convenient,
beneficial
administrative
tasks,
an
avenue
continuous
support,
but
never
outperform
human
relational
skills.
Conclusion
many
recognize
its
possible
wide‐ranging
benefits,
most
remain
cautious
uncertain
implementation.
JAMA Network Open,
Journal Year:
2025,
Volume and Issue:
8(2), P. e2457879 - e2457879
Published: Feb. 4, 2025
Importance
There
is
much
interest
in
the
clinical
integration
of
large
language
models
(LLMs)
health
care.
Many
studies
have
assessed
ability
LLMs
to
provide
advice,
but
quality
their
reporting
uncertain.
Objective
To
perform
a
systematic
review
examine
variability
among
peer-reviewed
evaluating
performance
generative
artificial
intelligence
(AI)–driven
chatbots
for
summarizing
evidence
and
providing
advice
inform
development
Chatbot
Assessment
Reporting
Tool
(CHART).
Evidence
Review
A
search
MEDLINE
via
Ovid,
Embase
Elsevier,
Web
Science
from
inception
October
27,
2023,
was
conducted
with
help
sciences
librarian
yield
7752
articles.
Two
reviewers
screened
articles
by
title
abstract
followed
full-text
identify
primary
accuracy
AI-driven
(chatbot
studies).
then
performed
data
extraction
137
eligible
studies.
Findings
total
were
included.
Studies
examined
topics
surgery
(55
[40.1%]),
medicine
(51
[37.2%]),
care
(13
[9.5%]).
focused
on
treatment
(91
[66.4%]),
diagnosis
(60
[43.8%]),
or
disease
prevention
(29
[21.2%]).
Most
(136
[99.3%])
evaluated
inaccessible,
closed-source
did
not
enough
information
version
LLM
under
evaluation.
All
lacked
sufficient
description
characteristics,
including
temperature,
token
length,
fine-tuning
availability,
layers,
other
details.
describe
prompt
engineering
phase
study.
The
date
querying
reported
54
(39.4%)
(89
[65.0%])
used
subjective
means
define
successful
chatbot,
while
less
than
one-third
addressed
ethical,
regulatory,
patient
safety
implications
LLMs.
Conclusions
Relevance
In
this
chatbot
studies,
heterogeneous
may
CHART
standards.
Ethical,
considerations
are
crucial
as
grows
Connection Science,
Journal Year:
2024,
Volume and Issue:
36(1)
Published: May 16, 2024
In
2022,
OpenAI's
unveiling
of
generative
AI
Large
Language
Models
(LLMs)-
ChatGPT,
heralded
a
significant
leap
forward
in
human-machine
interaction
through
cutting-edge
technologies.
With
its
surging
popularity,
scholars
across
various
fields
have
begun
to
delve
into
the
myriad
applications
ChatGPT.
While
existing
literature
reviews
on
LLMs
like
ChatGPT
are
available,
there
is
notable
absence
systematic
(SLRs)
and
bibliometric
analyses
assessing
research's
multidisciplinary
geographical
breadth.
This
study
aims
bridge
this
gap
by
synthesising
evaluating
how
has
been
integrated
diverse
research
areas,
focussing
scope
distribution
studies.
Through
review
scholarly
articles,
we
chart
global
utilisation
scientific
domains,
exploring
contribution
advancing
paradigms
adoption
trends
among
different
disciplines.
Our
findings
reveal
widespread
endorsement
multiple
fields,
with
implementations
healthcare
(38.6%),
computer
science/IT
(18.6%),
education/research
(17.3%).
Moreover,
our
demographic
analysis
underscores
ChatGPT's
reach
accessibility,
indicating
participation
from
80
unique
countries
ChatGPT-related
research,
most
frequent
keyword
occurrence,
USA
(719),
China
(181),
India
(157)
leading
contributions.
Additionally,
highlights
roles
institutions
such
as
King
Saud
University,
All
Institute
Medical
Sciences,
Taipei
University
pioneering
dataset.
not
only
sheds
light
vast
opportunities
challenges
posed
pursuits
but
also
acts
pivotal
resource
for
future
inquiries.
It
emphasises
that
(LLM)
role
revolutionising
every
field.
The
insights
provided
paper
particularly
valuable
academics,
researchers,
practitioners
disciplines,
well
policymakers
looking
grasp
extensive
impact
technologies
community.
JMIR Mental Health,
Journal Year:
2024,
Volume and Issue:
11, P. e57400 - e57400
Published: Sept. 3, 2024
Background
Large
language
models
(LLMs)
are
advanced
artificial
neural
networks
trained
on
extensive
datasets
to
accurately
understand
and
generate
natural
language.
While
they
have
received
much
attention
demonstrated
potential
in
digital
health,
their
application
mental
particularly
clinical
settings,
has
generated
considerable
debate.
Objective
This
systematic
review
aims
critically
assess
the
use
of
LLMs
specifically
focusing
applicability
efficacy
early
screening,
interventions,
settings.
By
systematically
collating
assessing
evidence
from
current
studies,
our
work
analyzes
models,
methodologies,
data
sources,
outcomes,
thereby
highlighting
challenges
present,
prospects
for
use.
Methods
Adhering
PRISMA
(Preferred
Reporting
Items
Systematic
Reviews
Meta-Analyses)
guidelines,
this
searched
5
open-access
databases:
MEDLINE
(accessed
by
PubMed),
IEEE
Xplore,
Scopus,
JMIR,
ACM
Digital
Library.
Keywords
used
were
(mental
health
OR
illness
disorder
psychiatry)
AND
(large
models).
study
included
articles
published
between
January
1,
2017,
April
30,
2024,
excluded
languages
other
than
English.
Results
In
total,
40
evaluated,
including
15
(38%)
conditions
suicidal
ideation
detection
through
text
analysis,
7
(18%)
as
conversational
agents,
18
(45%)
applications
evaluations
health.
show
good
effectiveness
detecting
issues
providing
accessible,
destigmatized
eHealth
services.
However,
assessments
also
indicate
that
risks
associated
with
might
surpass
benefits.
These
include
inconsistencies
text;
production
hallucinations;
absence
a
comprehensive,
benchmarked
ethical
framework.
Conclusions
examines
inherent
risks.
The
identifies
several
issues:
lack
multilingual
annotated
experts,
concerns
regarding
accuracy
reliability
content,
interpretability
due
“black
box”
nature
LLMs,
ongoing
dilemmas.
clear,
framework;
privacy
issues;
overreliance
both
physicians
patients,
which
could
compromise
traditional
medical
practices.
As
result,
should
not
be
considered
substitutes
professional
rapid
development
underscores
valuable
aids,
emphasizing
need
continued
research
area.
Trial
Registration
PROSPERO
CRD42024508617;
https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=508617
Frontiers in Psychiatry,
Journal Year:
2024,
Volume and Issue:
15
Published: June 24, 2024
Background
With
their
unmatched
ability
to
interpret
and
engage
with
human
language
context,
large
models
(LLMs)
hint
at
the
potential
bridge
AI
cognitive
processes.
This
review
explores
current
application
of
LLMs,
such
as
ChatGPT,
in
field
psychiatry.
Methods
We
followed
PRISMA
guidelines
searched
through
PubMed,
Embase,
Web
Science,
Scopus,
up
until
March
2024.
Results
From
771
retrieved
articles,
we
included
16
that
directly
examine
LLMs’
use
particularly
ChatGPT
GPT-4,
showed
diverse
applications
clinical
reasoning,
social
media,
education
within
They
can
assist
diagnosing
mental
health
issues,
managing
depression,
evaluating
suicide
risk,
supporting
field.
However,
our
also
points
out
limitations,
difficulties
complex
cases
underestimation
risks.
Conclusion
Early
research
psychiatry
reveals
versatile
applications,
from
diagnostic
support
educational
roles.
Given
rapid
pace
advancement,
future
investigations
are
poised
explore
extent
which
these
might
redefine
traditional
roles
care.
Informatics,
Journal Year:
2025,
Volume and Issue:
12(1), P. 9 - 9
Published: Jan. 17, 2025
The
rapid
advancement
of
large
language
models
like
ChatGPT
has
significantly
impacted
natural
processing,
expanding
its
applications
across
various
fields,
including
healthcare.
However,
there
remains
a
significant
gap
in
understanding
the
consistency
and
reliability
ChatGPT’s
performance
different
medical
domains.
We
conducted
this
systematic
review
according
to
an
LLM-assisted
PRISMA
setup.
high-recall
search
term
“ChatGPT”
yielded
1101
articles
from
2023
onwards.
Through
dual-phase
screening
process,
initially
automated
via
subsequently
manually
by
human
reviewers,
128
studies
were
included.
covered
range
specialties,
focusing
on
diagnosis,
disease
management,
patient
education.
assessment
metrics
varied,
but
most
compared
accuracy
against
evaluations
clinicians
or
reliable
references.
In
several
areas,
demonstrated
high
accuracy,
underscoring
effectiveness.
some
contexts
revealed
lower
accuracy.
mixed
outcomes
domains
emphasize
challenges
opportunities
integrating
AI
into
certain
areas
suggests
that
substantial
utility,
yet
inconsistent
all
indicates
need
for
ongoing
evaluation
refinement.
This
highlights
potential
improve
healthcare
delivery
alongside
necessity
continued
research
ensure
reliability.
JMIR Medical Education,
Journal Year:
2024,
Volume and Issue:
10, P. e54067 - e54067
Published: April 4, 2024
Abstract
Background
Undergraduate
medical
studies
represent
a
wide
range
of
learning
opportunities
served
in
the
form
various
teaching-learning
modalities
for
learners.
A
clinical
scenario
is
frequently
used
as
modality,
followed
by
multiple-choice
and
open-ended
questions
among
other
teaching
methods.
As
such,
script
concordance
tests
(SCTs)
can
be
to
promote
higher
level
reasoning.
Recent
technological
developments
have
made
generative
artificial
intelligence
(AI)–based
systems
such
ChatGPT
(OpenAI)
available
assist
clinician-educators
creating
instructional
materials.
Objective
The
main
objective
this
project
explore
how
SCTs
generated
compared
produced
experts
on
3
major
elements:
(stem),
questions,
expert
opinion.
Methods
This
mixed
method
study
evaluated
ChatGPT-generated
with
expert-created
using
predefined
framework.
Clinician-educators
well
resident
doctors
psychiatry
involved
undergraduate
education
Quebec,
Canada,
via
web-based
survey
6
criteria:
scenario,
They
were
also
asked
describe
strengths
weaknesses
SCTs.
Results
total
102
respondents
assessed
There
no
significant
distinctions
between
2
types
concerning
(
P
=.84),
=.99),
opinion
=.07),
interpretated
respondents.
Indeed,
struggled
differentiate
ChatGPT-
expert-generated
showcased
promise
expediting
SCT
design,
aligning
Diagnostic
Statistical
Manual
Mental
Disorders,
Fifth
Edition
criteria,
albeit
tendency
toward
caricatured
scenarios
simplistic
content.
Conclusions
first
concentrate
design
supported
AI
period
where
medicine
changing
swiftly
technologies
from
are
expanding
much
faster.
suggests
that
valuable
tool
educational
materials,
further
validation
essential
ensure
efficacy
accuracy.
Journal of Pediatric Orthopaedics,
Journal Year:
2024,
Volume and Issue:
44(7), P. e592 - e597
Published: April 30, 2024
Objective:
Chat
generative
pre-trained
transformer
(ChatGPT)
has
garnered
attention
in
health
care
for
its
potential
to
reshape
patient
interactions.
As
patients
increasingly
rely
on
artificial
intelligence
platforms,
concerns
about
information
accuracy
arise.
In-toeing,
a
common
lower
extremity
variation,
often
leads
pediatric
orthopaedic
referrals
despite
observation
being
the
primary
treatment.
Our
study
aims
assess
ChatGPT’s
responses
in-toeing
questions,
contributing
discussions
innovation
and
technology
education.
Methods:
We
compiled
list
of
34
questions
from
“Frequently
Asked
Questions”
sections
9
care–affiliated
websites,
identifying
25
as
most
encountered.
On
January
17,
2024,
we
queried
ChatGPT
3.5
separate
sessions
recorded
responses.
These
were
posed
again
21,
reproducibility.
Two
surgeons
evaluated
using
scale
“excellent
(no
clarification)”
“unsatisfactory
(substantial
clarification).”
Average
ratings
used
when
evaluators’
grades
within
one
level
each
other.
In
discordant
cases,
senior
author
provided
decisive
rating.
Results:
found
46%
“excellent”
44%
“satisfactory
(minimal
addition,
8%
cases
(moderate
2%
“unsatisfactory.”
Questions
had
appropriate
readability,
with
an
average
Flesch-Kincaid
Grade
Level
4.9
(±2.1).
However,
at
collegiate
level,
averaging
12.7
(±1.4).
No
significant
differences
observed
between
question
topics.
Furthermore,
exhibited
moderate
consistency
after
repeated
queries,
evidenced
by
Spearman
rho
coefficient
0.55
(
P
=
0.005).
The
chatbot
appropriately
described
normal
or
spontaneously
resolving
62%
consistently
recommended
evaluation
provider
100%.
Conclusion:
presented
serviceable,
though
not
perfect,
representation
diagnosis
management
while
demonstrating
reproducibility
utility
could
be
enhanced
improving
readability
incorporating
evidence-based
guidelines.
Evidence:
IV—diagnostic.
SAGE Open Medicine,
Journal Year:
2024,
Volume and Issue:
12
Published: Jan. 1, 2024
Objectives:
ChatGPT
is
an
advanced
chatbot
based
on
Large
Language
Model
that
has
the
ability
to
answer
questions.
Undoubtedly,
capable
of
transforming
communication,
education,
and
customer
support;
however,
can
it
play
role
a
doctor?
In
Poland,
prior
obtaining
medical
diploma,
candidates
must
successfully
pass
Medical
Final
Examination.
Methods:
The
purpose
this
research
was
determine
how
well
performed
Polish
Examination,
which
passing
required
become
doctor
in
Poland
(an
exam
considered
passed
if
at
least
56%
tasks
are
answered
correctly).
A
total
2138
categorized
Examination
questions
(from
11
examination
sessions
held
between
2013–2015
2021–2023)
were
presented
ChatGPT-3.5
from
19
26
May
2023.
For
further
analysis,
divided
into
quintiles
difficulty
duration,
as
question
types
(simple
A-type
or
complex
K-type).
answers
provided
by
compared
official
key,
reviewed
for
any
changes
resulting
advancement
knowledge.
Results:
correctly
53.4%–64.9%
8
out
sessions,
achieved
scores
(60%).
correlation
efficacy
artificial
intelligence
level
complexity,
difficulty,
length
found
be
negative.
AI
outperformed
humans
one
category:
psychiatry
(77.18%
vs.
70.25%,
p
=
0.081).
Conclusions:
performance
deemed
satisfactory;
observed
markedly
inferior
human
graduates
majority
instances.
Despite
its
potential
utility
many
areas,
constrained
inherent
limitations
prevent
entirely
supplanting
expertise