Cureus,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 4, 2024
Objective
With
the
rapid
advancement
of
artificial
intelligence
(AI)
technologies,
models
like
Chat
Generative
Pre-Trained
Transformer
(ChatGPT)
are
increasingly
being
evaluated
for
their
potential
applications
in
healthcare.
The
Prescribing
Safety
Assessment
(PSA)
is
a
standardised
test
junior
physicians
UK
to
evaluate
prescribing
competence.
This
study
aims
assess
ChatGPT's
ability
pass
PSA
and
its
performance
across
different
exam
sections.
Methodology
ChatGPT
(version
GPT-4)
was
tested
on
four
official
practice
papers,
each
containing
30
questions,
three
independent
trials
per
paper,
with
answers
using
mark
schemes.
Performance
measured
by
calculating
overall
percentage
scores
comparing
them
marks
provided
paper.
Subsection
also
analysed
identify
strengths
weaknesses.
Results
achieved
mean
257/300
(85.67%),
236/300
(78.67%),
199/300
(66.33%),
233/300
(77.67%)
consistently
surpassing
where
available.
performed
well
sections
requiring
factual
recall,
such
as
"Adverse
Drug
Reactions",
scoring
63/72
(87.50%),
"Communicating
Information",
(88.89%).
However,
it
struggled
"Data
Interpretation",
32/72
(44.44%),
showing
variability
indicating
limitations
handling
more
complex
clinical
reasoning
tasks.
Conclusion
While
demonstrated
strong
passing
excelling
knowledge,
data
interpretation
highlight
current
gaps
AI's
fully
replicate
human
judgement.
shows
promise
supporting
safe
prescribing,
particularly
areas
prone
error,
drug
interactions
communicating
correct
information.
due
tasks,
not
yet
ready
replace
prescribers
should
instead
serve
supplemental
tool
practice.
Cureus,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Aug. 25, 2024
Advances
in
artificial
intelligence
(AI),
particularly
large
language
models
(LLMs)
like
ChatGPT
(versions
3.5
and
4.0)
Google
Gemini,
are
transforming
healthcare.
This
study
explores
the
performance
of
these
AI
solving
diagnostic
quizzes
from
"Neuroradiology:
A
Core
Review"
to
evaluate
their
potential
as
tools
radiology.
Canadian Journal of Ophthalmology,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 1, 2025
To
evaluate
the
performance
of
large
language
models
(LLMs),
specifically
Microsoft
Copilot,
GPT-4
(GPT-4o
and
GPT-4o
mini),
Google
Gemini
(Gemini
Advanced),
in
answering
ophthalmological
questions
assessing
impact
prompting
techniques
on
their
accuracy.
Prospective
qualitative
study.
Advanced).
A
total
300
from
StatPearls
were
tested,
covering
a
range
subspecialties
image-based
tasks.
Each
question
was
evaluated
using
2
techniques:
zero-shot
forced
(prompt
1)
combined
role-based
plan-and-solve+
2).
With
prompting,
demonstrated
significantly
superior
overall
performance,
correctly
72.3%
outperforming
all
other
models,
including
Copilot
(53.7%),
mini
(62.0%),
(54.3%),
Advanced
(62.0%)
(p
<
0.0001).
Both
showed
notable
improvements
with
Prompt
over
1,
elevating
Copilot's
accuracy
lowest
(53.7%)
to
second
highest
(72.3%)
among
LLMs.
While
newer
iterations
LLMs,
such
as
Advanced,
outperformed
less
advanced
counterparts
Gemini),
this
study
emphasizes
need
for
caution
clinical
applications
these
models.
The
choice
influences
highlighting
necessity
further
research
refine
LLMs
capabilities,
particularly
visual
data
interpretation,
ensure
safe
integration
into
medical
practice.
Cureus,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 24, 2025
Introduction:
Large
language
models
(LLMs)
like
Gemini
2.0
Advanced
and
ChatGPT-4o
are
increasingly
applied
in
medical
contexts.
This
study
assesses
their
accuracy
answering
cataract-related
questions
from
Brazilian
ophthalmology
board
exams,
evaluating
potential
for
clinical
decision
support.
Methods:
A
retrospective
analysis
was
conducted
using
221
multiple-choice
questions.
Responses
both
LLMs
were
evaluated
by
two
independent
ophthalmologists
against
the
official
answer
key.
Accuracy
rates
inter-evaluator
agreement
(Cohen's
kappa)
analyzed.
Results:
achieved
85.45%
80.91%
accuracy,
while
scored
80.00%
84.09%.
Inter-evaluator
moderate
(κ
=
0.514
0.431,
respectively).
Performance
varied
across
exam
years.
Conclusion:
Both
demonstrated
high
questions,
supporting
as
educational
tools.
However,
performance
variability
indicate
need
further
refinement
validation.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: July 16, 2024
Previous
studies
evaluated
the
ability
of
large
language
models
(LLMs)
in
medical
disciplines;
however,
few
have
focused
on
image
analysis,
and
none
specifically
cardiovascular
imaging
or
nuclear
cardiology.