ACM Transactions on Computing Education,
Journal Year:
2024,
Volume and Issue:
24(3), P. 1 - 56
Published: June 20, 2024
The
recent
integration
of
visual
capabilities
into
Large
Language
Models
(LLMs)
has
the
potential
to
play
a
pivotal
role
in
science
and
technology
education,
where
elements
such
as
diagrams,
charts,
tables
are
commonly
used
improve
learning
experience.
This
study
investigates
performance
ChatGPT-4
Vision,
OpenAI’s
most
advanced
model
at
time
was
conducted,
on
Bachelor
Computer
Science
section
Brazil’s
2021
National
Undergraduate
Exam
(ENADE).
By
presenting
with
exam’s
open
multiple-choice
questions
their
original
image
format
allowing
for
reassessment
response
differing
answer
keys,
we
were
able
evaluate
model’s
reasoning
self-reflecting
large-scale
academic
assessment
involving
textual
content.
Vision
significantly
outperformed
average
exam
participant,
positioning
itself
within
top
10
best
score
percentile.
While
it
excelled
that
incorporated
elements,
also
encountered
challenges
question
interpretation,
logical
reasoning,
acuity.
A
positive
correlation
between
distribution
human
participants
suggests
multimodal
LLMs
can
provide
useful
tool
testing
refinement.
However,
involvement
an
independent
expert
panel
review
cases
disagreement
key
revealed
some
poorly
constructed
containing
vague
or
ambiguous
statements,
calling
attention
critical
need
improved
design
future
exams.
Our
findings
suggest
while
shows
promise
evaluations,
oversight
remains
crucial
verifying
accuracy
ensuring
fairness
high-stakes
educational
paper’s
research
materials
publicly
available
https://github.com/nabormendonca/gpt-4v-enade-cs-2021
.
npj Digital Medicine,
Journal Year:
2024,
Volume and Issue:
7(1)
Published: Feb. 20, 2024
Abstract
The
use
of
large
language
models
(LLMs)
in
clinical
medicine
is
currently
thriving.
Effectively
transferring
LLMs’
pertinent
theoretical
knowledge
from
computer
science
to
their
application
crucial.
Prompt
engineering
has
shown
potential
as
an
effective
method
this
regard.
To
explore
the
prompt
LLMs
and
examine
reliability
LLMs,
different
styles
prompts
were
designed
used
ask
about
agreement
with
American
Academy
Orthopedic
Surgeons
(AAOS)
osteoarthritis
(OA)
evidence-based
guidelines.
Each
question
was
asked
5
times.
We
compared
consistency
findings
guidelines
across
evidence
levels
for
assessed
by
asking
same
gpt-4-Web
ROT
prompting
had
highest
overall
(62.9%)
a
significant
performance
strong
recommendations,
total
77.5%.
not
stable
(Fleiss
kappa
ranged
−0.002
0.984).
This
study
revealed
that
variable
effects
various
models,
most
consistent.
An
appropriate
could
improve
accuracy
responses
professional
medical
questions.
Japanese Journal of Radiology,
Journal Year:
2024,
Volume and Issue:
42(7), P. 685 - 696
Published: March 29, 2024
Abstract
The
advent
of
Deep
Learning
(DL)
has
significantly
propelled
the
field
diagnostic
radiology
forward
by
enhancing
image
analysis
and
interpretation.
introduction
Transformer
architecture,
followed
development
Large
Language
Models
(LLMs),
further
revolutionized
this
domain.
LLMs
now
possess
potential
to
automate
refine
workflow,
extending
from
report
generation
assistance
in
diagnostics
patient
care.
integration
multimodal
technology
with
could
potentially
leapfrog
these
applications
unprecedented
levels.
However,
come
unresolved
challenges
such
as
information
hallucinations
biases,
which
can
affect
clinical
reliability.
Despite
issues,
legislative
guideline
frameworks
have
yet
catch
up
technological
advancements.
Radiologists
must
acquire
a
thorough
understanding
technologies
leverage
LLMs’
fullest
while
maintaining
medical
safety
ethics.
This
review
aims
aid
that
endeavor.
Japanese Journal of Radiology,
Journal Year:
2024,
Volume and Issue:
42(11), P. 1231 - 1235
Published: July 1, 2024
Abstract
Purpose
Large
language
models
(LLMs)
are
rapidly
advancing
and
demonstrating
high
performance
in
understanding
textual
information,
suggesting
potential
applications
interpreting
patient
histories
documented
imaging
findings.
As
LLMs
continue
to
improve,
their
diagnostic
abilities
expected
be
enhanced
further.
However,
there
is
a
lack
of
comprehensive
comparisons
between
from
different
manufacturers.
In
this
study,
we
aimed
test
the
three
latest
major
(GPT-4o,
Claude
3
Opus,
Gemini
1.5
Pro)
using
Radiology
Diagnosis
Please
Cases,
monthly
quiz
series
for
radiology
experts.
Materials
methods
Clinical
history
findings,
provided
textually
by
case
submitters,
were
extracted
324
questions
originating
cases
published
1998
2023.
The
top
differential
diagnoses
generated
GPT-4o,
Pro,
respective
application
programming
interfaces.
A
comparative
analysis
among
these
was
conducted
Cochrane’s
Q
post
hoc
McNemar’s
tests.
Results
accuracies
Pro
primary
diagnosis
41.0%,
54.0%,
33.9%,
which
further
improved
49.4%,
62.0%,
when
considering
accuracy
any
diagnoses.
Significant
differences
observed
all
pairs
models.
Conclusion
Opus
outperformed
GPT-4o
solving
cases.
These
appear
capable
assisting
radiologists
supplied
with
accurate
evaluations
worded
descriptions
Diagnostic and Interventional Imaging,
Journal Year:
2024,
Volume and Issue:
105(7-8), P. 251 - 265
Published: April 27, 2024
The
purpose
of
this
study
was
to
systematically
review
the
reported
performances
ChatGPT,
identify
potential
limitations,
and
explore
future
directions
for
its
integration,
optimization,
ethical
considerations
in
radiology
applications.
Japanese Journal of Radiology,
Journal Year:
2024,
Volume and Issue:
42(8), P. 918 - 926
Published: May 11, 2024
Abstract
Purpose
To
assess
the
performance
of
GPT-4
Turbo
with
Vision
(GPT-4TV),
OpenAI’s
latest
multimodal
large
language
model,
by
comparing
its
ability
to
process
both
text
and
image
inputs
that
text-only
(GPT-4
T)
in
context
Japan
Diagnostic
Radiology
Board
Examination
(JDRBE).
Materials
methods
The
dataset
comprised
questions
from
JDRBE
2021
2023.
A
total
six
board-certified
diagnostic
radiologists
discussed
provided
ground-truth
answers
consulting
relevant
literature
as
necessary.
following
were
excluded:
those
lacking
associated
images,
no
unanimous
agreement
on
answers,
including
images
rejected
OpenAI
application
programming
interface.
for
GPT-4TV
included
whereas
T
entirely
text.
Both
models
deployed
dataset,
their
was
compared
using
McNemar’s
exact
test.
radiological
credibility
responses
assessed
two
through
assignment
legitimacy
scores
a
five-point
Likert
scale.
These
subsequently
used
compare
model
Wilcoxon's
signed-rank
Results
139
questions.
correctly
answered
62
(45%),
57
(41%).
statistical
analysis
found
significant
difference
between
(P
=
0.44).
received
significantly
lower
than
responses.
Conclusion
No
enhancement
accuracy
observed
when
input
BMC Medical Education,
Journal Year:
2024,
Volume and Issue:
24(1)
Published: June 26, 2024
Abstract
Background
Artificial
intelligence
(AI)
chatbots
are
emerging
educational
tools
for
students
in
healthcare
science.
However,
assessing
their
accuracy
is
essential
prior
to
adoption
settings.
This
study
aimed
assess
the
of
predicting
correct
answers
from
three
AI
(ChatGPT-4,
Microsoft
Copilot
and
Google
Gemini)
Italian
entrance
standardized
examination
test
science
degrees
(CINECA
test).
Secondarily,
we
assessed
narrative
coherence
chatbots’
responses
(i.e.,
text
output)
based
on
qualitative
metrics:
logical
rationale
behind
chosen
answer,
presence
information
internal
question,
external
question.
Methods
An
observational
cross-sectional
design
was
performed
September
2023.
Accuracy
evaluated
CINECA
test,
where
questions
were
formatted
using
a
multiple-choice
structure
with
single
best
answer.
The
outcome
binary
(correct
or
incorrect).
Chi-squared
post
hoc
analysis
Bonferroni
correction
differences
among
performance
accuracy.
A
p
-value
<
0.05
considered
statistically
significant.
sensitivity
performed,
excluding
that
not
applicable
(e.g.,
images).
Narrative
analyzed
by
absolute
relative
frequencies
errors.
Results
Overall,
820
inputted
into
all
chatbots,
20
imported
ChatGPT-4
(
n
=
808)
Gemini
due
technical
limitations.
We
found
significant
vs
comparisons
0.001).
revealed
“Logical
reasoning”
as
prevalent
answer
622,
81.5%)
error”
incorrect
40,
88.9%).
Conclusions
Our
main
findings
reveal
that:
(A)
well;
(B)
better
than
Gemini;
(C)
primarily
logical.
Although
showed
promising
university
encourage
candidates
cautiously
incorporate
this
new
technology
supplement
learning
rather
primary
resource.
Trial
registration
Not
required.
The Oncologist,
Journal Year:
2024,
Volume and Issue:
29(5), P. 407 - 414
Published: Feb. 3, 2024
Abstract
Background
The
capability
of
large
language
models
(LLMs)
to
understand
and
generate
human-readable
text
has
prompted
the
investigation
their
potential
as
educational
management
tools
for
patients
with
cancer
healthcare
providers.
Materials
Methods
We
conducted
a
cross-sectional
study
aimed
at
evaluating
ability
ChatGPT-4,
ChatGPT-3.5,
Google
Bard
answer
questions
related
4
domains
immuno-oncology
(Mechanisms,
Indications,
Toxicities,
Prognosis).
generated
60
open-ended
(15
each
section).
Questions
were
manually
submitted
LLMs,
responses
collected
on
June
30,
2023.
Two
reviewers
evaluated
answers
independently.
Results
ChatGPT-4
ChatGPT-3.5
answered
all
questions,
whereas
only
53.3%
(P
<
.0001).
number
reproducible
was
higher
(95%)
ChatGPT3.5
(88.3%)
than
(50%)
In
terms
accuracy,
deemed
fully
correct
75.4%,
58.5%,
43.8%
Bard,
respectively
=
.03).
Furthermore,
highly
relevant
71.9%,
77.4%,
.04).
Regarding
readability,
readable
(98.1%)
(100%)
compared
(87.5%)
.02).
Conclusion
are
potentially
powerful
in
immuno-oncology,
demonstrated
relatively
poorer
performance.
However,
risk
inaccuracy
or
incompleteness
evident
3
highlighting
importance
expert-driven
verification
outputs
returned
by
these
technologies.
Japanese Journal of Radiology,
Journal Year:
2024,
Volume and Issue:
42(12), P. 1392 - 1398
Published: July 20, 2024
Abstract
Purpose
The
performance
of
vision-language
models
(VLMs)
with
image
interpretation
capabilities,
such
as
GPT-4
omni
(GPT-4o),
vision
(GPT-4V),
and
Claude-3,
has
not
been
compared
remains
unexplored
in
specialized
radiological
fields,
including
nuclear
medicine
interventional
radiology.
This
study
aimed
to
evaluate
compare
the
diagnostic
accuracy
various
VLMs,
+
GPT-4V,
GPT-4o,
Claude-3
Sonnet,
Opus,
using
Japanese
radiology,
medicine,
radiology
(JDR,
JNM,
JIR,
respectively)
board
certification
tests.
Materials
methods
In
total,
383
questions
from
JDR
test
(358
images),
300
JNM
(92
322
JIR
(96
images)
2019
2023
were
consecutively
collected.
rates
Opus
calculated
for
all
or
images.
VLMs
McNemar’s
test.
Results
GPT-4o
demonstrated
highest
across
evaluations
(all
questions,
49%;
images,
48%),
64%;
59%),
tests
43%;
34%),
followed
by
40%;
38%),
42%;
43%),
30%).
For
showed
that
significantly
outperformed
other
P
<
0.007),
except
0.001),
Conclusion
had
success
images
JDR,