Journal of Hand Surgery Global Online,
Journal Year:
2024,
Volume and Issue:
7(1), P. 23 - 28
Published: Nov. 13, 2024
Artificial
intelligence
advancements
have
the
potential
to
transform
medical
education
and
patient
care.
The
increasing
popularity
of
large
language
models
has
raised
important
questions
regarding
their
accuracy
agreement
with
human
users.
purpose
this
study
was
evaluate
performance
Chat
Generative
Pre-Trained
Transformer
(ChatGPT),
versions
3.5
4,
as
well
Microsoft
Copilot,
which
is
powered
by
ChatGPT-4,
on
self-assessment
examination
for
hand
surgery
compare
results
between
versions.
Input
included
1,000
across
5
years
(2015-2019)
examinations
provided
American
Society
Surgery
Hand.
primary
outcomes
correctness,
percentage
concordance
relative
other
users,
whether
an
additional
prompt
required.
Secondary
according
question
type
difficulty.
All
formats
including
image-based
were
used
analysis.
ChatGPT-3.5
correctly
answered
51.6%
ChatGPT-4
63.4%,
a
statistically
significant
difference.
Copilot
59.9%
outperformed
but
scored
significantly
lower
than
ChatGPT-4.
However,
sided
average
72.2%
users
when
correct
62.1%
incorrect,
compared
67.0%
53.2%
respectively,
79.7%
52.1%
incorrect.
highest
scoring
subject
Miscellaneous,
lowest
Neuromuscular
in
all
In
study,
perform
better
subspecialty
did
ChatGPT-3.5.
more
accurate
ChatGPT3.5
less
ChatGPT4.
able
"pass"
2015-2019
Hand
examinations.
While
holding
promise
within
education,
caution
should
be
detailed
evaluation
consistency
needed.
Future
studies
explore
how
these
multiple
trials
contexts
truly
assess
reliability.
Cureus,
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 24, 2024
This
study
aims
to
compare
the
performance
of
ChatGPT-3.5
(GPT-3.5)
and
ChatGPT-4
(GPT-4)
on
American
Society
for
Surgery
Hand
(ASSH)
Self-Assessment
Examination
(SAE)
determine
their
potential
as
educational
tools.
Journal of Clinical Medicine,
Journal Year:
2024,
Volume and Issue:
13(10), P. 2832 - 2832
Published: May 11, 2024
Background:
OpenAI's
ChatGPT
(San
Francisco,
CA,
USA)
and
Google's
Gemini
(Mountain
View,
are
two
large
language
models
that
show
promise
in
improving
expediting
medical
decision
making
hand
surgery.
Evaluating
the
applications
of
these
within
field
surgery
is
warranted.
This
study
aims
to
evaluate
ChatGPT-4
classifying
injuries
recommending
treatment.
Methods:
were
given
68
fictionalized
clinical
vignettes
twice.
The
asked
use
a
specific
classification
system
recommend
surgical
or
nonsurgical
Classifications
scored
based
on
correctness.
Results
analyzed
using
descriptive
statistics,
paired
two-tailed
t-test,
sensitivity
testing.
Results:
Gemini,
correctly
70.6%
injuries,
demonstrated
superior
ability
over
(mean
score
1.46
vs.
0.87,
p-value
<
0.001).
For
management,
higher
intervention
compared
(98.0%
88.8%),
but
lower
specificity
(68.4%
94.7%).
When
ChatGPT,
greater
response
replicability.
Conclusions:
Large
like
assisting
making,
particularly
surgery,
with
generally
outperforming
ChatGPT.
These
findings
emphasize
importance
considering
strengths
limitations
different
when
integrating
them
into
practice.
BMC Medical Informatics and Decision Making,
Journal Year:
2024,
Volume and Issue:
24(1)
Published: Nov. 26, 2024
The
large
language
models
(LLMs),
most
notably
ChatGPT,
released
since
November
30,
2022,
have
prompted
shifting
attention
to
their
use
in
medicine,
particularly
for
supporting
clinical
decision-making.
However,
there
is
little
consensus
the
medical
community
on
how
LLM
performance
contexts
should
be
evaluated.
We
performed
a
literature
review
of
PubMed
identify
publications
between
December
1,
and
April
2024,
that
discussed
assessments
LLM-generated
diagnoses
or
treatment
plans.
selected
108
relevant
articles
from
analysis.
frequently
used
LLMs
were
GPT-3.5,
GPT-4,
Bard,
LLaMa/Alpaca-based
models,
Bing
Chat.
five
criteria
scoring
outputs
"accuracy",
"completeness",
"appropriateness",
"insight",
"consistency".
defining
high-quality
been
consistently
by
researchers
over
past
1.5
years.
identified
high
degree
variation
studies
reported
findings
assessed
performance.
Standardized
reporting
qualitative
evaluation
metrics
assess
quality
can
developed
facilitate
research
healthcare.
Information,
Journal Year:
2024,
Volume and Issue:
15(9), P. 543 - 543
Published: Sept. 5, 2024
ChatGPT
is
a
large
language
model
trained
on
increasingly
datasets
to
perform
diverse
language-based
tasks.
It
capable
of
answering
multiple-choice
questions,
such
as
those
posed
by
medical
examinations.
has
been
generating
considerable
attention
in
both
academic
and
non-academic
domains
recent
months.
In
this
study,
we
aimed
assess
GPT’s
performance
anatomical
questions
retrieved
from
licensing
examinations
Germany.
Two
different
versions
were
compared.
GPT-3.5
demonstrated
moderate
accuracy,
correctly
60–64%
the
autumn
2022
spring
2021
exams.
contrast,
GPT-4.o
showed
significant
improvement,
achieving
93%
accuracy
exam
100%
exam.
When
tested
30
unique
not
available
online,
maintained
96%
rate.
Furthermore,
consistently
outperformed
students
across
six
state
exams,
with
statistically
mean
score
95.54%
compared
students’
72.15%.
The
study
demonstrates
that
outperforms
its
predecessor,
GPT-3.5,
cohort
students,
indicating
potential
powerful
tool
education
assessment.
This
improvement
highlights
rapid
evolution
LLMs
suggests
AI
could
play
an
important
role
supporting
enhancing
training,
potentially
offering
supplementary
resources
for
professionals.
However,
further
research
needed
limitations
practical
applications
systems
real-world
practice.
Global Medical Education,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 13, 2025
Abstract
Objectives
Artificial
intelligence
(AI)
is
being
increasingly
used
in
medical
education.
This
narrative
review
presents
a
comprehensive
analysis
of
generative
AI
tools’
performance
answering
and
generating
exam
questions,
thereby
providing
broader
perspective
on
AI’s
strengths
limitations
the
education
context.
Methods
The
Scopus
database
was
searched
for
studies
examinations
from
2022
to
2024.
Duplicates
were
removed,
relevant
full
texts
retrieved
following
inclusion
exclusion
criteria.
Narrative
descriptive
statistics
analyze
contents
included
studies.
Results
A
total
70
analysis.
results
showed
that
varied
when
different
types
questions
specialty
with
best
average
accuracy
psychiatry,
influenced
by
prompts.
With
well-crafted
prompts,
models
can
efficiently
produce
high-quality
examination
questions.
Conclusion
Generative
possesses
ability
answer
using
carefully
designed
Its
potential
use
assessment
vast,
ranging
detecting
question
error,
aiding
preparation,
facilitating
formative
assessments,
supporting
personalized
learning.
However,
it’s
crucial
educators
always
double-check
responses
maintain
prevent
spread
misinformation.
Hand,
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 20, 2025
Background:
The
integration
of
artificial
intelligence
(AI)
into
health
care
witnessed
significant
advancements,
particularly
with
AI-driven
tools
like
ChatGPT.
Initial
evaluations
indicated
that
ChatGPT
3.5
did
not
perform
as
well
humans
on
specialized
hand
surgery
self-assessment
examinations.
purpose
this
study
is
to
evaluate
the
performance
4o
American
Society
for
Surgery
Hand
(ASSH)
questions
and
whether
using
enhanced
techniques
such
better
prompts
file
search
improve
accuracy.
Methods:
Using
data
from
ASSH
examinations
(2008-2013),
we
explored
impact
model
version,
prompt,
accuracy
AI-generated
responses.
We
used
OpenAI’s
application
programming
interface
automate
question
input
response
scoring.
Statistical
analysis
was
conducted
one-way
variance.
KR-20
assess
reliability
test.
Results:
Results
indicate
latest
AI
models,
prompting
access
peer-reviewed
literature,
can
achieve
levels
comparable
human
examinees,
text-based
questions.
performed
significantly
than
showed
marked
improvement
capabilities.
2013
examination
0.946,
indicating
a
very
reliable
Conclusions:
These
findings
highlight
AI’s
potential
support
medical
education
practice,
demonstrating
at
human-equivalent
level
Our
results
suggest
utility
supplementary
tool
in
educational
settings
supportive
resource
clinical
practice.
Cureus,
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 3, 2025
In
recent
years,
the
integration
of
artificial
intelligence
(AI)
in
surgical
education
has
been
prominent,
as
evidenced
by
publications.
Given
unique
requirements
and
challenges
associated
with
orthopaedic
training,
we
conducted
a
systematic
scoping
review
that
examined
applications
AI
only
this
setting.
A
comprehensive
search
was
across
four
databases:
Embase,
CENTRAL,
Medline,
Scopus.
Original
research
articles
utilised
an
model
within
specific
educational
context
were
considered
for
inclusion.
Data
from
included
studies
extracted
onto
bespoke
form,
followed
thematic
analysis
to
detect
patterns
data.
Our
findings
then
summarised
descriptively.
total
21
review,
encompassing
273
participants.
relation
two
overarching
themes
identified:
refinement
competencies
enhancement
knowledge
acquisition.
All
studies,
exception
one,
last
five
years.
Twelve
distinct
models
large
language
accounting
over
half
applications.
Multiple
promising
interventions
highlighted,
particularly
use
personalised
automated
feedback
evaluating
performance
tasks.
holds
major
potential
revolutionise
training.
However,
evidence
supporting
its
field
remains
limited.
Further
preferably
randomised
controlled
trials
larger
sample
sizes,
are
required
strengthen
base.