Advances in Medical Education and Practice,
Год журнала:
2024,
Номер
Volume 15, С. 857 - 871
Опубликована: Сен. 1, 2024
Artificial
intelligence
(AI)
chatbots
excel
in
language
understanding
and
generation.
These
models
can
transform
healthcare
education
practice.
However,
it
is
important
to
assess
the
performance
of
such
AI
various
topics
highlight
its
strengths
possible
limitations.
This
study
aimed
evaluate
ChatGPT
(GPT-3.5
GPT-4),
Bing,
Bard
compared
human
students
at
a
postgraduate
master's
level
Medical
Laboratory
Sciences.
Medical Education,
Год журнала:
2024,
Номер
unknown
Опубликована: Апрель 19, 2024
Abstract
Introduction
In
the
past
year,
use
of
large
language
models
(LLMs)
has
generated
significant
interest
and
excitement
because
their
potential
to
revolutionise
various
fields,
including
medical
education
for
aspiring
physicians.
Although
students
undergo
a
demanding
educational
process
become
competent
health
care
professionals,
emergence
LLMs
presents
promising
solution
challenges
like
information
overload,
time
constraints
pressure
on
clinical
educators.
However,
integrating
into
raises
critical
concerns
educators,
professionals
students.
This
systematic
review
aims
explore
LLM
applications
in
education,
specifically
impact
students'
learning
experiences.
Methods
A
search
was
performed
PubMed,
Web
Science
Embase
articles
discussing
using
selected
keywords
related
from
ChatGPT's
debut
until
February
2024.
Only
available
full
text
or
English
were
reviewed.
The
credibility
each
study
critically
appraised
by
two
independent
reviewers.
Results
identified
166
studies,
which
40
found
be
relevant
study.
Among
key
themes
included
capabilities,
benefits
such
as
personalised
regarding
content
accuracy.
Importantly,
42.5%
these
studies
evaluated
novel
way,
ChatGPT,
contexts
exams
clinical/biomedical
information,
highlighting
replicating
human‐level
performance
knowledge.
remaining
broadly
discussed
prospective
role
reflecting
keen
future
despite
current
constraints.
Conclusions
responsible
implementation
offers
opportunity
enhance
ensuring
accuracy,
emphasising
skill‐building
maintaining
ethical
safeguards
are
crucial.
Continuous
evaluation
interdisciplinary
collaboration
essential
appropriate
integration
education.
Journal of Medical Internet Research,
Год журнала:
2024,
Номер
26, С. e60807 - e60807
Опубликована: Июнь 15, 2024
Over
the
past
2
years,
researchers
have
used
various
medical
licensing
examinations
to
test
whether
ChatGPT
(OpenAI)
possesses
accurate
knowledge.
The
performance
of
each
version
on
examination
in
multiple
environments
showed
remarkable
differences.
At
this
stage,
there
is
still
a
lack
comprehensive
understanding
variability
ChatGPT's
different
examinations.
JMIR Medical Education,
Год журнала:
2024,
Номер
10, С. e54393 - e54393
Опубликована: Март 12, 2024
Previous
research
applying
large
language
models
(LLMs)
to
medicine
was
focused
on
text-based
information.
Recently,
multimodal
variants
of
LLMs
acquired
the
capability
recognizing
images.
Japanese Journal of Radiology,
Год журнала:
2024,
Номер
42(8), С. 918 - 926
Опубликована: Май 11, 2024
Abstract
Purpose
To
assess
the
performance
of
GPT-4
Turbo
with
Vision
(GPT-4TV),
OpenAI’s
latest
multimodal
large
language
model,
by
comparing
its
ability
to
process
both
text
and
image
inputs
that
text-only
(GPT-4
T)
in
context
Japan
Diagnostic
Radiology
Board
Examination
(JDRBE).
Materials
methods
The
dataset
comprised
questions
from
JDRBE
2021
2023.
A
total
six
board-certified
diagnostic
radiologists
discussed
provided
ground-truth
answers
consulting
relevant
literature
as
necessary.
following
were
excluded:
those
lacking
associated
images,
no
unanimous
agreement
on
answers,
including
images
rejected
OpenAI
application
programming
interface.
for
GPT-4TV
included
whereas
T
entirely
text.
Both
models
deployed
dataset,
their
was
compared
using
McNemar’s
exact
test.
radiological
credibility
responses
assessed
two
through
assignment
legitimacy
scores
a
five-point
Likert
scale.
These
subsequently
used
compare
model
Wilcoxon's
signed-rank
Results
139
questions.
correctly
answered
62
(45%),
57
(41%).
statistical
analysis
found
significant
difference
between
(P
=
0.44).
received
significantly
lower
than
responses.
Conclusion
No
enhancement
accuracy
observed
when
input
BioMedInformatics,
Год журнала:
2024,
Номер
4(2), С. 1097 - 1143
Опубликована: Апрель 16, 2024
Recent
advances
in
the
field
of
large
language
models
(LLMs)
underline
their
high
potential
for
applications
a
variety
sectors.
Their
use
healthcare,
particular,
holds
out
promising
prospects
improving
medical
practices.
As
we
highlight
this
paper,
LLMs
have
demonstrated
remarkable
capabilities
understanding
and
generation
that
could
indeed
be
put
to
good
field.
We
also
present
main
architectures
these
models,
such
as
GPT,
Bloom,
or
LLaMA,
composed
billions
parameters.
then
examine
recent
trends
datasets
used
train
models.
classify
them
according
different
criteria,
size,
source,
subject
(patient
records,
scientific
articles,
etc.).
mention
help
improve
patient
care,
accelerate
research,
optimize
efficiency
healthcare
systems
assisted
diagnosis.
several
technical
ethical
issues
need
resolved
before
can
extensively
Consequently,
propose
discussion
offered
by
new
generations
linguistic
limitations
when
deployed
domain
healthcare.
BMC Medical Education,
Год журнала:
2024,
Номер
24(1)
Опубликована: Июнь 26, 2024
Abstract
Background
Artificial
intelligence
(AI)
chatbots
are
emerging
educational
tools
for
students
in
healthcare
science.
However,
assessing
their
accuracy
is
essential
prior
to
adoption
settings.
This
study
aimed
assess
the
of
predicting
correct
answers
from
three
AI
(ChatGPT-4,
Microsoft
Copilot
and
Google
Gemini)
Italian
entrance
standardized
examination
test
science
degrees
(CINECA
test).
Secondarily,
we
assessed
narrative
coherence
chatbots’
responses
(i.e.,
text
output)
based
on
qualitative
metrics:
logical
rationale
behind
chosen
answer,
presence
information
internal
question,
external
question.
Methods
An
observational
cross-sectional
design
was
performed
September
2023.
Accuracy
evaluated
CINECA
test,
where
questions
were
formatted
using
a
multiple-choice
structure
with
single
best
answer.
The
outcome
binary
(correct
or
incorrect).
Chi-squared
post
hoc
analysis
Bonferroni
correction
differences
among
performance
accuracy.
A
p
-value
<
0.05
considered
statistically
significant.
sensitivity
performed,
excluding
that
not
applicable
(e.g.,
images).
Narrative
analyzed
by
absolute
relative
frequencies
errors.
Results
Overall,
820
inputted
into
all
chatbots,
20
imported
ChatGPT-4
(
n
=
808)
Gemini
due
technical
limitations.
We
found
significant
vs
comparisons
0.001).
revealed
“Logical
reasoning”
as
prevalent
answer
622,
81.5%)
error”
incorrect
40,
88.9%).
Conclusions
Our
main
findings
reveal
that:
(A)
well;
(B)
better
than
Gemini;
(C)
primarily
logical.
Although
showed
promising
university
encourage
candidates
cautiously
incorporate
this
new
technology
supplement
learning
rather
primary
resource.
Trial
registration
Not
required.
Mesopotamian Journal of Artificial Intelligence in Healthcare,
Год журнала:
2024,
Номер
2024, С. 1 - 7
Опубликована: Янв. 10, 2024
Background:
The
role
of
artificial
intelligence
(AI)
is
increasingly
recognized
to
enhance
digital
health
literacy.
There
particular
importance
with
widespread
availability
and
popularity
AI
chatbots
such
as
ChatGPT
its
possible
impact
on
involves
the
need
understand
models’
performance
across
different
languages,
dialects,
cultural
contexts.
This
study
aimed
evaluate
in
response
prompting
two
Arabic
namely
Tunisian
Jordanian.
Methods:
descriptive
followed
METRICS
checklist
for
design
reporting
based
studies
healthcare.
Ten
general
queries
were
translated
into
Jordanian
dialects
by
bilingual
native
speakers.
models,
ChatGPT-3.5
ChatGPT-4
Tunisian,
Jordanian,
English
evaluated
using
CLEAR
tool
tailored
assessment
information
generated
models.
Results:
was
categorized
average
Arabic,
an
overall
score
2.83,
compared
above
3.40
Arabic.
showed
a
similar
pattern
marginally
better
outcomes
3.20
rated
3.53.
components
consistently
superior
dialect
both
models
despite
lack
statistical
significance.
Using
content
reference,
responses
significantly
inferior
(P<.001).
Conclusion:
findings
highlight
critical
dialectical
gap
ChatGPT,
underlining
linguistic
diversity
development,
particularly
health-related
content.
Collaborative
efforts
among
developers,
linguists,
healthcare
professionals
are
needed
improve
Future
recommended
broaden
scope
extensive
range
languages
which
would
help
achieving
equitable
access
various
communities.
Scientific Reports,
Год журнала:
2024,
Номер
14(1)
Опубликована: Апрель 12, 2024
Abstract
Health
equity
and
accessing
Spanish
kidney
transplant
information
continues
being
a
substantial
challenge
facing
the
Hispanic
community.
This
study
evaluated
ChatGPT’s
capabilities
in
translating
54
English
frequently
asked
questions
(FAQs)
into
using
two
versions
of
AI
model,
GPT-3.5
GPT-4.0.
The
FAQs
included
19
from
Organ
Procurement
Transplantation
Network
(OPTN),
15
National
Service
(NHS),
20
Kidney
Foundation
(NKF).
Two
native
Spanish-speaking
nephrologists,
both
whom
are
Mexican
heritage,
scored
translations
for
linguistic
accuracy
cultural
sensitivity
tailored
to
Hispanics
1–5
rubric.
inter-rater
reliability
evaluators,
measured
by
Cohen’s
Kappa,
was
0.85.
Overall
4.89
±
0.31
versus
4.94
0.23
GPT-4.0
(non-significant
p
=
0.23).
Both
4.96
0.19
(p
1.00).
By
source,
4.84
0.37
4.93
0.26
4.90
4.95
0.22
For
sensitivity,
5.00
0.00
(NKF),
while
These
high
scores
demonstrate
Chat
GPT
effectively
translated
across
systems.
findings
suggest
GPT’s
potential
promote
health
improving
access
essential
information.
Additional
research
should
evaluate
its
medical
translation
diverse
contexts/languages.
English-to-Spanish
may
increase
vital
underserved
patients.
Introduction:
ChatGPT
has
been
tested
in
many
disciplines,
but
only
a
few
have
involved
hearing
diagnosis
and
none
to
physiology
or
audiology
more
generally.
The
consistency
of
the
chatbot's
responses
same
question
posed
multiple
times
not
well
investigated
either.
This
study
aimed
assess
accuracy
repeatability
3.5
4
on
test
questions
concerning
objective
measures
hearing.
Of
particular
interest
was
short-term
which
here
four
separate
days
extended
over
one
week.
Methods:
We
used
30
single-answer,
multiple-choice
exam
from
one-year
course
methods
testing
were
five
both
(the
free
version)
paid
each
(two
week
two
following
week).
evaluated
terms
response
key.
To
evaluate
time,
percent
agreement
Cohen's
Kappa
calculated.
Results:
overall
48-49%,
while
that
65-69%.
consistently
failed
pass
threshold
50%
correct
responses.
Within
single
day,
76-79%
for
87-88%
(Cohen's
0.67-0.71
0.81-0.84
respectively).
between
different
75-79%
85-88%
0.65-0.69
0.80-0.85
Conclusion:
outperforms
higher
time.
However,
great
variability
casts
doubt
possible
professional
applications
versions.