Journal of Nursing Scholarship,
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 24, 2024
Abstract
Aim
The
aim
of
this
study
was
to
evaluate
and
compare
artificial
intelligence
(AI)‐based
large
language
models
(LLMs)
(ChatGPT‐3.5,
Bing,
Bard)
with
human‐based
formulations
in
generating
relevant
clinical
queries,
using
comprehensive
methodological
evaluations.
Methods
To
interact
the
major
LLMs
ChatGPT‐3.5,
Bing
Chat,
Google
Bard,
scripts
prompts
were
designed
formulate
PICOT
(population,
intervention,
comparison,
outcome,
time)
questions
search
strategies.
Quality
responses
assessed
a
descriptive
approach
independent
assessment
by
two
researchers.
determine
number
hits,
PubMed,
Web
Science,
Cochrane
Library,
CINAHL
Ultimate
results
imported
separately,
without
restrictions,
strings
generated
three
an
additional
one
expert.
Hits
from
scenarios
also
exported
for
relevance
evaluation.
use
single
scenario
chosen
provide
focused
analysis.
Cronbach's
alpha
intraclass
correlation
coefficient
(ICC)
calculated.
Results
In
five
different
scenarios,
ChatGPT‐3.5
11,859
1,376,854,
Bard
16,583,
expert
5919
hits.
We
then
used
first
assess
obtained
results.
human
resulted
65.22%
(56/105)
articles.
most
accurate
AI‐based
LLM
70.79%
(63/89),
followed
21.05%
(12/45),
13.29%
(42/316)
Based
on
evaluators,
received
highest
score
(
M
=
48.50;
SD
0.71).
showed
high
level
agreement
between
evaluators.
Although
lower
percentage
hits
compared
reflects
nuanced
evaluation
criteria,
where
subjective
prioritized
contextual
accuracy
quality
over
mere
relevance.
Conclusion
This
provides
valuable
insights
into
ability
LLMs,
such
as
demonstrate
significant
potential
augmenting
workflows,
improving
query
development,
supporting
However,
findings
highlight
limitations
that
necessitate
further
refinement
continued
oversight.
Clinical
Relevance
AI
could
assist
nurses
formulating
offer
support
healthcare
professionals
structure
enhancing
strategies,
thereby
significantly
increasing
efficiency
information
retrieval.
PLoS ONE,
Год журнала:
2025,
Номер
20(1), С. e0317423 - e0317423
Опубликована: Янв. 29, 2025
This
study
aims
to
evaluate
the
performance
of
latest
large
language
models
(LLMs)
in
answering
dental
multiple
choice
questions
(MCQs),
including
both
text-based
and
image-based
questions.
A
total
1490
MCQs
from
two
board
review
books
for
United
States
National
Board
Dental
Examination
were
selected.
evaluated
six
LLMs
as
August
2024,
ChatGPT
4.0
omni
(OpenAI),
Gemini
Advanced
1.5
Pro
(Google),
Copilot
with
GPT-4
Turbo
(Microsoft),
Claude
3.5
Sonnet
(Anthropic),
Mistral
Large
2
(Mistral
AI),
Llama
3.1
405b
(Meta).
χ2
tests
performed
determine
whether
there
significant
differences
percentages
correct
answers
among
sample
each
discipline
(p
<
0.05).
Significant
observed
percentage
accurate
across
questions,
(p<0.001).
For
sample,
(85.5%),
(84.0%),
(83.8%)
demonstrated
highest
accuracy,
followed
by
(78.3%)
(77.1%),
(72.4%)
exhibiting
lowest.
Newer
versions
demonstrate
superior
compared
earlier
versions.
Copilot,
Claude,
achieved
high
accuracy
on
low
capable
handling
limited
clinicians
students
should
prioritize
most
up-to-date
when
supporting
their
learning,
clinical
practice,
research.
BMC Medical Education,
Год журнала:
2025,
Номер
25(1)
Опубликована: Фев. 10, 2025
AI-powered
chatbots
have
spread
to
various
fields
including
dental
education
and
clinical
assistance
treatment
planning.
The
aim
of
this
study
is
assess
compare
leading
chatbot
performances
in
specialization
exam
(DUS)
administered
Turkey
it
with
the
best
performer
that
year.
DUS
questions
for
2020
2021
were
directed
ChatGPT-4.0
Gemini
Advanced
individually.
manually
entered
into
their
original
form,
Turkish.
results
obtained
compared
each
other
year's
performers.
Candidates
who
score
at
least
45
points
on
centralized
are
deemed
passed
eligible
select
preferred
department
institution.
data
was
statistically
analyzed
using
Pearson's
chi-squared
test
(p
<
0.05).
received
83.3%
correct
response
rate
exam,
while
65%
rate.
On
80.5%
rate,
whereas
60.2%
outperformed
both
exams
performed
worse
overall
(for
2020:
ChatGPT-4.0,
65,5
Advanced,
50.1;
2021:
65,6
48.6)
when
scores
year
(68.5
72.3
2021).
This
poor
performance
also
includes
basic
sciences
sections
0.001).
Additionally,
periodontology
specialty
which
achieved
results,
lowest
determined
endodontics
orthodontics.
chatbots,
namely
by
exceeding
threshold
45.
However,
they
still
lagged
behind
top
performers
year,
particularly
sciences,
score.
exhibited
lower
some
specialties
such
as
Dental Traumatology,
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 22, 2024
ABSTRACT
Background/Aim
Artificial
intelligence
(AI)
chatbots
have
become
increasingly
prevalent
in
recent
years
as
potential
sources
of
online
healthcare
information
for
patients
when
making
medical/dental
decisions.
This
study
assessed
the
readability,
quality,
and
accuracy
responses
provided
by
three
AI
to
questions
related
traumatic
dental
injuries
(TDIs),
either
retrieved
from
popular
question‐answer
sites
or
manually
created
based
on
hypothetical
case
scenarios.
Materials
Methods
A
total
59
injury
queries
were
directed
at
ChatGPT
3.5,
4.0,
Google
Gemini.
Readability
was
evaluated
using
Flesch
Reading
Ease
(FRE)
Flesch–Kincaid
Grade
Level
(FKGL)
scores.
To
assess
response
quality
accuracy,
DISCERN
tool,
Global
Quality
Score
(GQS),
misinformation
scores
used.
The
understandability
actionability
analyzed
Patient
Education
Assessment
Tool
Printed
(PEMAT‐P)
tool.
Statistical
analysis
included
Kruskal–Wallis
with
Dunn's
post
hoc
test
non‐normal
variables,
one‐way
ANOVA
Tukey's
normal
variables
(
p
<
0.05).
Results
mean
FKGL
FRE
Gemini
11.2
49.25,
11.8
46.42,
10.1
51.91,
respectively,
indicating
that
difficult
read
required
a
college‐level
reading
ability.
3.5
had
lowest
PEMAT‐P
among
0.001).
4.0
rated
higher
(GQS
score
5)
compared
Conclusions
In
this
study,
although
widely
used,
some
misleading
inaccurate
about
TDIs.
contrast,
generated
more
accurate
comprehensive
answers,
them
reliable
auxiliary
sources.
However,
complex
issues
like
TDIs,
no
chatbot
can
replace
dentist
diagnosis,
treatment,
follow‐up
care.
Pediatric Pulmonology,
Год журнала:
2025,
Номер
60(3)
Опубликована: Март 1, 2025
To
evaluate
the
accuracy
and
comprehensiveness
of
eight
free,
publicly
available
large
language
model
(LLM)
chatbots
in
addressing
common
questions
related
to
chronic
neonatal
lung
disease
(CNLD)
home
oxygen
therapy
(HOT).
Twenty
CNLD
HOT-related
were
curated
across
nine
domains.
Responses
from
ChatGPT-3.5,
Google
Bard,
Bing
Chat,
Claude
3.5
Sonnet,
ERNIE
Bot
3.5,
GLM-4
generated
evaluated
by
three
experienced
neonatologists
using
Likert
scales
for
comprehensiveness.
Updated
LLM
models
(ChatGPT-4o
mini
Gemini
2.0
Flash
Experimental)
incorporated
assess
rapid
technological
advancement.
Statistical
analyses
included
ANOVA,
Kruskal-Wallis
tests,
intraclass
correlation
coefficients.
Chat
Sonnet
demonstrated
superior
performance,
with
highest
mean
scores
(5.78
±
0.48
5.75
0.54,
respectively)
competence
(2.65
0.58
2.80
0.41,
respectively).
In
subsequent
testing,
Experimental
ChatGPT-4o
achieved
comparable
high
performance.
Performance
varied
domains,
all
excelling
"equipment
safety
protocols"
"caregiver
support."
showed
self-correction
capabilities
when
prompted.
LLMs
promise
accurate
CNLD/HOT
information.
However,
performance
variability
risk
misinformation
necessitate
expert
oversight
continued
refinement
before
widespread
clinical
implementation.
International Dental Journal,
Год журнала:
2024,
Номер
unknown
Опубликована: Окт. 1, 2024
Infective
endocarditis
(IE)
is
a
serious,
life-threatening
condition
requiring
antibiotic
prophylaxis
for
high-risk
individuals
undergoing
invasive
dental
procedures.
As
LLMs
are
rapidly
adopted
by
professionals
their
efficiency
and
accessibility,
assessing
accuracy
in
answering
critical
questions
about
IE
prevention
crucial.
Dental Traumatology,
Год журнала:
2025,
Номер
unknown
Опубликована: Янв. 23, 2025
ABSTRACT
Background/Aim
The
use
of
AI‐driven
chatbots
for
accessing
medical
information
is
increasingly
popular
among
educators
and
students.
This
study
aims
to
assess
two
different
ChatGPT
models—ChatGPT
3.5
4.0—regarding
their
responses
queries
about
traumatic
dental
injuries,
specifically
students
professionals.
Material
Methods
A
total
40
questions
were
prepared,
divided
equally
between
those
concerning
definitions
diagnosis
on
treatment
follow‐up.
from
both
versions
evaluated
several
criteria:
quality,
reliability,
similarity,
readability.
These
evaluations
conducted
using
the
Global
Quality
Scale
(GQS),
Reliability
Scoring
System
(adapted
DISCERN),
Flesch
Reading
Ease
Score
(FRES),
Flesch–Kincaid
Grade
Level
(FKRGL),
Similarity
Index.
Normality
was
checked
with
Shapiro–Wilk
test,
variance
homogeneity
assessed
Levene
test.
Results
analysis
revealed
that
provided
more
original
compared
4.0.
According
FRES
scores,
challenging
read,
having
a
higher
score
(39.732
±
9.713)
than
4.0
(34.813
9.356),
indicating
relatively
better
There
no
significant
differences
regarding
GQS,
DISCERN,
FKRGL
scores.
However,
in
definition
section,
had
statistically
quality
3.5.
In
contrast,
answers
follow‐up
section.
For
4.0,
readability
similarity
rates
section
No
observed
3.5's
FRES,
FKRGL,
index
measurements
by
topic.
Conclusions
Both
offer
high‐quality
information,
though
they
present
challenges
reliability.
They
are
valuable
resources
professionals
but
should
be
used
conjunction
additional
sources
comprehensive
understanding.