Journal of Evaluation in Clinical Practice,
Journal Year:
2024,
Volume and Issue:
30(6), P. 1017 - 1023
Published: May 19, 2024
ChatGPT,
a
large-scale
language
model,
is
notable
example
of
AI's
potential
in
health
care.
However,
its
effectiveness
clinical
settings,
especially
when
compared
to
human
physicians,
not
fully
understood.
This
study
evaluates
ChatGPT's
capabilities
and
limitations
answering
questions
for
Japanese
internal
medicine
specialists,
aiming
clarify
accuracy
tendencies
both
correct
incorrect
responses.
European Journal Of Dental Education,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 31, 2025
This
study
aimed
to
simulate
diverse
scenarios
of
students
employing
LLMs
for
CDLE
examination
preparation,
providing
a
detailed
evaluation
their
performance
in
medical
education.
A
stratified
random
sampling
strategy
was
implemented
select
and
subsequently
revise
200
questions
from
the
CDLE.
Seven
LLMs,
recognised
exceptional
Chinese
domain,
were
selected
as
test
subjects.
Three
distinct
testing
constructed:
answering
questions,
explaining
adversarial
testing.
The
metrics
included
accuracy,
agreement
rate
teaching
effectiveness
score.
Wald
χ2
tests
Kruskal-Wallis
employed
determine
whether
differences
among
across
various
before
after
statistically
significant.
majority
tested
met
passing
threshold
on
benchmark,
with
Doubao-pro
32k
Qwen2-72b
(81%)
achieving
highest
accuracy
rates.
demonstrated
98%
reference
answers
when
explanations.
Although
significant
existed
scores
based
Likert
scale,
all
these
models
commendable
ability
deliver
comprehensible
effective
instructional
content.
In
testing,
GPT-4
exhibited
smallest
decline
(2%,
p
=
0.623),
while
ChatGLM-4
least
reduction
(14.6%,
0.001).
trained
corpora,
such
32k,
superior
compared
no
difference.
However,
during
diminished
performance,
displaying
comparatively
greater
robustness.
Future
research
should
further
investigate
interpretability
LLM
outputs
develop
strategies
mitigate
hallucinations
generated
European Journal of Therapeutics,
Journal Year:
2025,
Volume and Issue:
31(1), P. 35 - 43
Published: Feb. 28, 2025
Objective:
Large
language
models
(LLMs),
such
as
ChatGPT,
Gemini,
and
Copilot,
have
garnered
significant
attention
across
various
domains,
including
education.
Their
application
is
becoming
increasingly
prevalent,
particularly
in
medical
education,
where
rapid
access
to
accurate
up-to-date
information
imperative.
This
study
aimed
assess
the
validity,
accuracy,
comprehensiveness
of
utilizing
LLMs
for
preparation
lecture
notes
school
anatomy
Methods:
The
evaluated
performance
four
large
models—ChatGPT-4o,
ChatGPT-4o-Mini,
Copilot—in
generating
students.
In
first
phase,
produced
by
these
using
identical
prompts
were
compared
a
widely
used
textbook
through
thematic
analysis
relevance
alignment
with
standard
educational
materials.
second
generated
content
validity
index
(CVI)
analysis.
threshold
values
S-CVI/Ave
S-CVI/UA
set
at
0.90
0.80,
respectively,
determine
acceptability
content.
Results:
ChatGPT-4o
demonstrated
highest
performance,
achieving
theme
success
rate
94.6%
subtheme
76.2%.
ChatGPT-4o-Mini
followed,
rates
89.2%
62.3%,
respectively.
Copilot
achieved
moderate
results,
91.8%
54.9%,
while
Gemini
showed
lowest
86.4%
52.3%.
Content
Validity
Index
analysis,
again
outperformed
other
models,
exceeding
thresholds
an
value
0.943
0.857.
met
(0.714)
but
fell
slightly
short
(0.800).
however,
exhibited
significantly
lower
CVI
results.
0.486
0.286,
obtained
scores,
0.286
0.143.
Conclusion:
assessed
two
distinct
methods,
revealing
that
performed
best
both
evaluations.
These
results
suggest
educators
students
could
benefit
from
adopting
supplementary
tool
generation.
Conversely,
like
require
further
improvements
meet
standards
necessary
reliable
use
MedEdPublish,
Journal Year:
2025,
Volume and Issue:
15, P. 11 - 11
Published: March 26, 2025
Background
Generative
AI
(GenAI)
such
as
ChatGPT
can
take
over
tasks
that
previously
could
only
be
done
by
humans.
Although
GenAI
provides
many
educational
opportunities,
it
also
poses
risks
invalid
assessments
and
irrelevant
learning
outcomes.
This
article
presents
a
broadly
applicable
method
to
(1)
determine
current
assessment
validity,
(2)
assess
which
outcomes
are
impacted
student
use
(3)
decide
whether
alter
formats
and/or
is
exemplified
the
case-study
on
our
medical
informatics
curriculum.
We
developed
five-step
evaluate
address
impact
of
GenAI.
In
collaborative
manner,
courses
in
curriculum
analysed
their
plans
together
with
teachers,
adapted
usage.
Results
57%
assessments,
especially
writing
programming,
were
at
risk
reduced
validity
relevance.
was
closer
related
content
structure
than
complexity
according
Bloom’s
taxonomy.
During
retreats,
lecturers
discussed
relevance
students
should
able
achieve
them
or
without
Furthermore,
results
led
plan
increase
literacy
years
study.
Subsequently
coordinators
asked
either
adjust
preclude
use,
include
literacy.
For
64%
format
for
36%
adapted.
Conclusion
The
majority
outcomes,
leading
us
adapt
offer
potential
blueprint
institutions
facing
similar
challenges.
Cureus,
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 2, 2025
Introduction
Artificial
intelligence
(AI)
chatbots
have
been
widely
tested
in
their
performance
on
various
examinations,
with
limited
data
clinical
scenarios.
The
role
of
Chat
Generative
Pre-Trained
Transformer
(ChatGPT)
(OpenAI,
San
Francisco,
California,
United
States)
and
Gemini
Advanced
(Google
LLC,
Mountain
View,
multiple
aspects
gastroenterology
including
answering
patient
questions,
providing
medical
advice,
as
tools
to
potentially
assist
healthcare
providers
has
shown
some
promise,
though
associated
many
limitations.
We
aimed
study
the
ChatGPT-4.0,
ChatGPT-3.5,
across
20
clinicopathologic
scenarios
unexplored
realm
gastrointestinal
pathology.
Materials
methods
Twenty
clinicopathological
pathology
were
provided
these
three
large
language
models.
Two
fellowship-trained
pathologists
independently
assessed
responses,
evaluating
both
diagnostic
accuracy
confidence
results
then
compared
using
chi-squared
test.
also
evaluated
each
model's
ability
four
key
areas,
namely,
(1)
provide
differential
diagnoses,
(2)
interpretation
immunohistochemical
stains,
(3)
deliver
a
concise
final
diagnosis,
(4)
explanation
for
thought
process,
five-point
scoring
system.
mean,
median
score±standard
deviation
(SD),
interquartile
ranges
calculated.
A
comparative
analysis
parameters
was
conducted
Mann-Whitney
U
p-value
<0.05
considered
statistically
significant.
Other
tumor,
node,
metastasis
(TNM)
stage
incidence
pseudo-references
"hallucinations"
while
citing
reference
material.
Results
(diagnostic
accuracy:
p=0.01;
diagnosis:
p=0.03)
ChatGPT-4.0
(interpretation
immunohistochemistry
(IHC)
stains:
p=0.001;
p=0.002)
performed
significantly
better
certain
realms
than
indicating
continuously
improving
training
sets.
However,
mean
performances
ranged
between
3.0
3.7
at
best
classified
average.
None
models
could
accurate
TNM
staging
scenarios,
25-50%
references
that
do
not
exist
(hallucinations).
Conclusion
This
indicated
are
evolving,
they
need
human
supervision
definite
improvements
before
being
used
medicine.
is
first
its
kind
our
knowledge.
JAMA Network Open,
Journal Year:
2025,
Volume and Issue:
8(4), P. e256359 - e256359
Published: April 22, 2025
Importance
Large
language
models
(LLMs)
are
being
implemented
in
health
care.
Enhanced
accuracy
and
methods
to
maintain
over
time
needed
maximize
LLM
benefits.
Objective
To
evaluate
whether
performance
on
the
US
Medical
Licensing
Examination
(USMLE)
can
be
improved
by
including
formally
represented
semantic
clinical
knowledge.
Design,
Setting,
Participants
This
comparative
effectiveness
research
study
was
conducted
between
June
2024
February
2025
at
Department
of
Biomedical
Informatics,
Jacobs
School
Medicine
Sciences,
University
Buffalo,
New
York,
using
sample
questions
from
USMLE
Steps
1,
2,
3.
Intervention
Semantic
artificial
intelligence
(SCAI)
developed
insert
knowledge
into
LLMs
retrieval
augmented
generation
(RAG).
Main
Outcomes
Measures
The
SCAI
method
evaluated
comparing
3
Llama
(13B,
70B,
405B;
Meta)
with
without
RAG
text-based
for
answering
determined
output
answer
key.
Results
were
tested
87
Step
103
123
13B
enhanced
associated
significantly
1
but
only
met
60%
passing
threshold
(74
correct
[60.2%]).
70B
405B
passed
all
steps
RAG.
model
scored
80
(92.0%)
correctly
82
(79.6%)
112
(91.1%)
79
(90.8%)
(84.5%)
117
(95.1%)
Significant
improvements
found
3,
parameter
better
than
model,
not
model.
Conclusions
Relevance
In
this
study,
scores
RAG,
well
or
augmentation.
forms
reasoning
LLMs,
like
reasoning,
have
potential
improve
important
medical
questions.
Improving
care
targeted,
up-to-date
is
an
step
implementation
acceptance.
Journal of Medical Internet Research,
Journal Year:
2025,
Volume and Issue:
27, P. e64486 - e64486
Published: April 30, 2025
Large
language
models
(LLMs)
have
flourished
and
gradually
become
an
important
research
application
direction
in
the
medical
field.
However,
due
to
high
degree
of
specialization,
complexity,
specificity
medicine,
which
results
extremely
accuracy
requirements,
controversy
remains
about
whether
LLMs
can
be
used
More
studies
evaluated
performance
various
types
but
conclusions
are
inconsistent.
This
study
uses
a
network
meta-analysis
(NMA)
assess
when
answering
clinical
questions
provide
high-level
evidence-based
evidence
for
its
future
development
In
this
systematic
review
NMA,
we
searched
PubMed,
Embase,
Web
Science,
Scopus
from
inception
until
October
14,
2024.
Studies
on
were
included
screened
by
reading
published
reports.
The
NMA
conducted
compare
different
questions,
including
objective
open-ended
top
1
diagnosis,
3
5
triage
classification.
was
performed
using
Bayesian
frequency
theory
methods.
Indirect
intercomparisons
between
programs
grading
scale.
A
larger
surface
under
cumulative
ranking
curve
(SUCRA)
value
indicates
higher
corresponding
LLM
accuracy.
examined
168
articles
encompassing
35,896
3063
cases.
Of
studies,
40
(23.8%)
considered
low
risk
bias,
128
(76.2%)
had
moderate
risk,
none
rated
as
having
risk.
ChatGPT-4o
(SUCRA=0.9207)
demonstrated
strong
terms
followed
Aeyeconsult
(SUCRA=0.9187)
ChatGPT-4
(SUCRA=0.8087).
(SUCRA=0.8708)
excelled
at
questions.
diagnosis
cases,
human
experts
(SUCRA=0.9001
SUCRA=0.7126,
respectively)
ranked
highest,
while
Claude
Opus
(SUCRA=0.9672)
well
diagnosis.
Gemini
(SUCRA=0.9649)
highest
SUCRA
area
Our
that
has
advantage
For
may
more
credible.
Humans
accurate
performs
better
classification,
is
advantageous.
analysis
offers
valuable
insights
clinicians
practitioners,
empowering
them
effectively
leverage
improved
decision-making
learning,
management
scenarios.
PROSPERO
CRD42024558245;
https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245.
JMIR Formative Research,
Journal Year:
2024,
Volume and Issue:
9, P. e51319 - e51319
Published: Sept. 3, 2024
The
integration
of
large
language
models
(LLMs),
as
seen
with
the
generative
pretrained
transformers
series,
into
health
care
education
and
clinical
management
represents
a
transformative
potential.
practical
use
current
LLMs
in
sparks
great
anticipation
for
new
avenues,
yet
its
embracement
also
elicits
considerable
concerns
that
necessitate
careful
deliberation.
This
study
aims
to
evaluate
application
state-of-the-art
education,
highlighting
following
shortcomings
areas
requiring
significant
urgent
improvements:
(1)
threats
academic
integrity,
(2)
dissemination
misinformation
risks
automation
bias,
(3)
challenges
information
completeness
consistency,
(4)
inequity
access,
(5)
algorithmic
(6)
exhibition
moral
instability,
(7)
technological
limitations
plugin
tools,
(8)
lack
regulatory
oversight
addressing
legal
ethical
challenges.
Future
research
should
focus
on
strategically
persistent
highlighted
this
paper,
opening
door
effective
measures
can
improve
their
education.
Journal of Multidisciplinary Healthcare,
Journal Year:
2024,
Volume and Issue:
Volume 17, P. 3917 - 3929
Published: Aug. 1, 2024
Chatbots,
which
are
based
on
large
language
models,
increasingly
being
used
in
public
health.
However,
the
effectiveness
of
chatbot
responses
has
been
debated,
and
their
performance
myopia
prevention
control
not
fully
explored.
This
study
aimed
to
evaluate
three
well-known
chatbots-ChatGPT,
Claude,
Bard-in
responding
health
questions
about
myopia.
Data & Metadata,
Journal Year:
2024,
Volume and Issue:
3
Published: Jan. 1, 2024
In
the
last
decade,
advancement
of
artificial
intelligence
has
transformed
multiple
sectors,
with
natural
language
processing
standing
out
as
one
most
dynamic
and
promising
areas.
This
study
focused
on
comparing
GPT-3.5,
GPT-4
GPT-4o
models,
evaluating
their
efficiency
performance
in
Natural
Language
Processing
tasks
such
text
generation,
machine
translation
sentiment
analysis.
Using
a
controlled
experimental
design,
response
speed
quality
outputs
generated
by
each
model
were
measured.
The
results
showed
that
significantly
outperforms
terms
speed,
completing
25%
faster
generation
20%
translation.
analysis,
was
30%
than
GPT-4.
Additionally,
analysis
quality,
assessed
using
human
reviews,
while
GPT-3.5
delivers
fast
consistent
responses,
produce
higher
more
de-tailed
content.
findings
suggest
is
ideal
for
applications
require
consistency,
GPT-4,
although
slower,
might
be
preferred
contexts
where
accuracy
are
important.
highlights
need
to
balance
selection
models
suggests
implementing
additional
automatic
evaluations
future
research
complement
current