From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
Information,
Journal Year:
2024,
Volume and Issue:
15(9), P. 543 - 543
Published: Sept. 5, 2024
ChatGPT
is
a
large
language
model
trained
on
increasingly
datasets
to
perform
diverse
language-based
tasks.
It
capable
of
answering
multiple-choice
questions,
such
as
those
posed
by
medical
examinations.
has
been
generating
considerable
attention
in
both
academic
and
non-academic
domains
recent
months.
In
this
study,
we
aimed
assess
GPT’s
performance
anatomical
questions
retrieved
from
licensing
examinations
Germany.
Two
different
versions
were
compared.
GPT-3.5
demonstrated
moderate
accuracy,
correctly
60–64%
the
autumn
2022
spring
2021
exams.
contrast,
GPT-4.o
showed
significant
improvement,
achieving
93%
accuracy
exam
100%
exam.
When
tested
30
unique
not
available
online,
maintained
96%
rate.
Furthermore,
consistently
outperformed
students
across
six
state
exams,
with
statistically
mean
score
95.54%
compared
students’
72.15%.
The
study
demonstrates
that
outperforms
its
predecessor,
GPT-3.5,
cohort
students,
indicating
potential
powerful
tool
education
assessment.
This
improvement
highlights
rapid
evolution
LLMs
suggests
AI
could
play
an
important
role
supporting
enhancing
training,
potentially
offering
supplementary
resources
for
professionals.
However,
further
research
needed
limitations
practical
applications
systems
real-world
practice.
Language: Английский
Which current chatbot is more competent in urological theoretical knowledge? A comparative analysis by the European board of urology in-service assessment
World Journal of Urology,
Journal Year:
2025,
Volume and Issue:
43(1)
Published: Feb. 11, 2025
Abstract
Introduction
The
European
Board
of
Urology
(EBU)
In-Service
Assessment
(ISA)
test
evaluates
urologists’
knowledge
and
interpretation.
Artificial
Intelligence
(AI)
chatbots
are
being
used
widely
by
physicians
for
theoretical
information.
This
research
compares
five
existing
chatbots’
performances
questions’
Materials
methods
GPT-4o,
Copilot
Pro,
Gemini
Advanced,
Claude
3.5,
Sonar
Huge
solved
596
questions
in
6
exams
between
2017
2022.
were
divided
into
two
categories:
that
measure
require
data
exam
compared.
Results
Overall,
all
except
3.5
passed
the
examinations
with
a
percentage
60%
overall
score.
Pro
scored
best,
3.5’s
score
difference
was
significant
(71.6%
vs.
56.2%,
p
=
0.001
).
When
total
444
152
analysis
compared,
offered
greatest
information,
whereas
provided
least
(72.1%
57.4%,
also
true
analytical
skills
(70.4%
52.6%,
0.019
Conclusions
Four
out
exams,
achieving
scores
exceeding
60%,
while
only
one
did
not
pass
EBU
examination.
performed
best
ISA
examinations,
worst.
Chatbots
worse
on
than
questions.
Thus,
although
successful
terms
knowledge,
their
competence
analyzing
is
questionable.
Language: Английский
Evaluating ChatGPT’s role in urological counseling and clinical decision support
World Journal of Urology,
Journal Year:
2025,
Volume and Issue:
43(1)
Published: Feb. 13, 2025
Language: Английский
Comparative analysis of the effectiveness of microsoft copilot artificial intelligence chatbot and google search in answering patient inquiries about infertility: evaluating readability, understandability, and actionability
International Journal of Impotence Research,
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 22, 2025
Abstract
Failure
to
achieve
spontaneous
pregnancy
within
12
months
despite
unprotected
intercourse
is
called
infertility.
The
rapid
development
of
digital
health
data
has
led
more
people
search
for
healthcare-related
topics
on
the
Internet.
Many
infertile
individuals
and
couples
use
Internet
as
their
primary
source
information
infertility
diagnosis
treatment.
However,
it
important
assess
readability,
understandability,
actionability
provided
by
these
sources
patients.
There
a
gap
in
literature
addressing
this
aspect.
This
study
aims
compare
responses
generated
Microsoft
Copilot
(MC),
an
AI
chatbot,
Google
Search
(GS),
internet
engine,
infertility-related
queries.
Prospectively
Trends
analysis
was
conducted
identify
top
20
queries
related
February,
2024.
Then
were
entered
into
GS
MC
May
Answers
from
both
platforms
recorded
further
analysis.
Outputs
assessed
using
automated
readability
tools,
scores
calculated.
Understandability
answers
evaluated
Patient
Education
Materials
Assessment
Tool
Printable
(PEMAT-P)
tool.
found
have
significantly
higher
Automated
Readability
Index
(ARI)
Flesch-Kincaid
Grade
Level
(FKGL)
than
(
p
=
0.044),
while
no
significant
differences
observed
Flesch
Reading
Ease,
Gunning
Fog
Index,
Simplified
Measure
Gobbledygook
(SMOG),
Coleman-Liau
scores.
Both
outputs
had
above
8th-grade
level,
indicating
advanced
reading
levels.
According
PEMAT-P,
outperformed
terms
understandability
(68.65
±
11.99
vs.
54.50
15.09,
0.001)
(29.85
17.8
1
4.47,
0.000).
provides
understandable
actionable
queries,
that
might
great
potential
patient
education.
Language: Английский
Letter to the editor for the article “Performance of ChatGPT-3.5 and ChatGPT-4 on the European Board of Urology (EBU) exams: a comparative analysis”
Yuxuan Song,
No information about this author
Tao Xu
No information about this author
World Journal of Urology,
Journal Year:
2024,
Volume and Issue:
42(1)
Published: Oct. 3, 2024
Language: Английский
Editorial Comment on Can artificial intelligence pass the Japanese urology board examinations?
International Journal of Urology,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Oct. 15, 2024
The
study
titled
"Can
artificial
intelligence
pass
the
Japanese
urology
board
examinations?"
by
Okada
et
al.
provides
an
insightful
and
timely
exploration
of
potential
for
large
language
models
(LLMs)
such
as
GPT-4
Claude3
to
succeed
in
highly
specialized
medical
examinations.1
As
(AI)
continues
advance,
its
applications
education
certification
processes
are
expanding,
making
this
particularly
relevant.
This
research
demonstrates
that
achieved
highest
accuracy
among
tested
LLMs,
with
passing
scores
three
four
prompt
conditions.
effectively
highlights
strengths
handling
complex,
domain-specific
questions
within
context
Urology
Board
Examinations.
ability
surpass
a
60%
threshold
multiple
scenarios
indicates
LLMs
nearing
level
proficiency
could
complement
professionals
educational
evaluative
settings.
For
instance,
Nakao
evaluated
GPT-4V's
performance
National
Medical
Licensing
Examination,
highlighting
interpret
complex
visual
data,
crucial
component
diagnostics.2
Despite
these
promising
results,
also
underscores
limitations
LLMs.
Hager
noted
while
perform
well
examination
settings,
they
often
struggle
clinical
decision-making
adherence
guidelines,
which
essential
real-world
practice.3
Agerri
further
identified
issues
outdated
knowledge
hallucinations
AI-generated
content,
posing
risks
when
relying
on
AI
contexts.4
Moreover,
Schoch
conducted
comparative
analysis
ChatGPT-3.5
European
examinations,
revealing
inconsistencies
across
different
test
settings.5
al.'s
study1
is
valuable
contribution
ongoing
discourse
integration
certification.
By
demonstrating
paves
way
development
tools
can
support
enhance
expertise
professionals.
None.
Language: Английский
Evaluating the Performance of ChatGPT in the Prescribing Safety Assessment: Implications for Artificial Intelligence-Assisted Prescribing
D.R. Bull,
No information about this author
Dide Okaygoun
No information about this author
Cureus,
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 4, 2024
Objective
With
the
rapid
advancement
of
artificial
intelligence
(AI)
technologies,
models
like
Chat
Generative
Pre-Trained
Transformer
(ChatGPT)
are
increasingly
being
evaluated
for
their
potential
applications
in
healthcare.
The
Prescribing
Safety
Assessment
(PSA)
is
a
standardised
test
junior
physicians
UK
to
evaluate
prescribing
competence.
This
study
aims
assess
ChatGPT's
ability
pass
PSA
and
its
performance
across
different
exam
sections.
Methodology
ChatGPT
(version
GPT-4)
was
tested
on
four
official
practice
papers,
each
containing
30
questions,
three
independent
trials
per
paper,
with
answers
using
mark
schemes.
Performance
measured
by
calculating
overall
percentage
scores
comparing
them
marks
provided
paper.
Subsection
also
analysed
identify
strengths
weaknesses.
Results
achieved
mean
257/300
(85.67%),
236/300
(78.67%),
199/300
(66.33%),
233/300
(77.67%)
consistently
surpassing
where
available.
performed
well
sections
requiring
factual
recall,
such
as
"Adverse
Drug
Reactions",
scoring
63/72
(87.50%),
"Communicating
Information",
(88.89%).
However,
it
struggled
"Data
Interpretation",
32/72
(44.44%),
showing
variability
indicating
limitations
handling
more
complex
clinical
reasoning
tasks.
Conclusion
While
demonstrated
strong
passing
excelling
knowledge,
data
interpretation
highlight
current
gaps
AI's
fully
replicate
human
judgement.
shows
promise
supporting
safe
prescribing,
particularly
areas
prone
error,
drug
interactions
communicating
correct
information.
due
tasks,
not
yet
ready
replace
prescribers
should
instead
serve
supplemental
tool
practice.
Language: Английский