Journal of Hospital Medicine,
Journal Year:
2025,
Volume and Issue:
unknown
Published: May 4, 2025
Abstract
Generative
Artificial
Intelligence
(Gen
AI)
shows
significant
promise
as
a
technology
that
could
improve
healthcare
delivery,
but
its
implementation
will
be
influenced
by
the
spheres
in
which
it
is
studied
and
limited
resources
of
hospitals.
The
Point
authors
argue
we
should
focus
on
cognitive
abilities
GenAI
or
risk
being
left
out
technological
leap
change
way
doctors
practice.
Counterpoint
argues
using
to
ease
system
burdens
address
workflow
issues,
focusing
our
efforts
fixing
problems
would
doctors’
quality
life
increase
time
spent
with
patients.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 15, 2025
ABSTRACT
Background
Large
language
models
(LLMs)
are
increasingly
evaluated
in
medical
education
and
clinical
decision
support,
but
their
performance
highly
specialized
fields,
such
as
nephrology,
is
not
well
established.
We
compared
two
advanced
LLMs,
GPT-4
the
newly
released
o1
pro,
on
comprehensive
nephrology
board
renewal
examinations.
Methods
administered
209
Japanese
Self-Assessment
Questions
for
Nephrology
Board
Renewal
from
2014–2023
to
pro
using
ChatGPT
pro.
Each
question,
including
images,
was
presented
separate
chat
sessions
prevent
contextual
carryover.
were
classified
by
taxonomy
(recall/interpretation/problem-solving),
question
type
(general/clinical),
image
inclusion,
subspecialty.
calculated
proportion
of
correct
answers
performances
chi-square
or
Fisher’s
exact
tests.
Results
Overall,
scored
81.3%
(170/209),
significantly
higher
than
GPT-4’s
51.2%
(107/209;
p<0.001).
exceeded
60%
passing
criterion
every
year,
while
achieved
this
only
out
ten
years.
Across
levels,
types,
presence
consistently
outperformed
(p<0.05
multiple
comparisons).
Performance
differences
also
significant
several
subspecialties,
chronic
kidney
disease,
confirming
pro’s
broad
superiority.
Conclusion
substantially
a
examination,
demonstrating
reasoning
integration
knowledge.
These
findings
highlight
potential
next-generation
LLMs
valuable
tools
specialty
possibly
support
warranting
further
careful
validation.
BMC Medical Education,
Journal Year:
2025,
Volume and Issue:
25(1)
Published: March 19, 2025
To
assess
the
ability
of
General
Practice
(GP)
Trainees
to
detect
AI-generated
hallucinations
in
simulated
clinical
practice,
ChatGPT-4o
was
utilized.
The
were
categorized
into
three
types
based
on
accuracy
answers
and
explanations:
(1)
correct
with
incorrect
or
flawed
explanations,
(2)
explanations
that
contradict
factual
evidence,
(3)
explanations.
This
multi-center,
cross-sectional
survey
study
involved
142
GP
Trainees,
all
whom
undergoing
Specialist
Training
volunteered
participate.
evaluated
consistency
ChatGPT-4o,
as
well
Trainees'
response
time,
accuracy,
sensitivity
(d'),
tendencies
(β).
Binary
regression
analysis
used
explore
factors
affecting
identify
errors
generated
by
ChatGPT-4o.
A
total
137
participants
included,
a
mean
age
25.93
years.
Half
unfamiliar
AI,
35.0%
had
never
it.
ChatGPT-4o's
overall
80.8%,
which
slightly
decreased
80.1%
after
human
verification.
However,
for
professional
practice
(Subject
4)
only
57.0%,
verification,
it
dropped
further
44.2%.
87
identified,
primarily
occurring
at
application
evaluation
levels.
detecting
these
55.0%,
(d')
0.39.
Regression
revealed
shorter
times
(OR
=
0.92,
P
0.02),
higher
self-assessed
AI
understanding
0.16,
0.04),
more
frequent
use
10.43,
0.01)
associated
stricter
error
detection
criteria.
concluded
trainees
faced
challenges
identifying
errors,
particularly
scenarios.
highlights
importance
improving
literacy
critical
thinking
skills
ensure
effective
integration
medical
education.
Movement Disorders Clinical Practice,
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 20, 2025
Intelligence
is
the
ability
to
think
logically,
conceptualize
and
abstract
from
reality.1
Its
companion,
wisdom
capacity
grasp
human
nature,
which
paradoxical,
contradictory,
subject
continual
change.1
Both
of
these
constructs
are
key
practice
medicine
often
improve
with
age
clinical
experience.
Artificial
"intelligence"
(AI)
refers
machines
recognize
"learn"
patterns
complex
data,
predict
outcomes
help
in
decision-making.2
AI
has
been
heralded
as
a
new
medicine,
will
take
over
medical
diagnosis
management.
With
this
background,
we
draw
attention
shortcomings
potential
dangers
our
own
specialty
movement
disorders.
The
term
was
coined
1956
but
last
few
years
made
considerable
progress
no
longer
science
fiction.3
A
quick
google
search
even
reputable
generally
trustworthy
sources
reveals
frequent
promotional
slogans
such
"AI
revolutionizing
healthcare
know
it"
"2023:
year
groundbreaking
advances
computing".
Meta
launched
"Galactica",
large
language
model
(LLM)
based
on
training
dataset
48
million
examples
scientific
articles,
textbooks,
websites,
lecture
notes,
encyclopedias.
purpose
behind
Galactica
have
single
AI-based
tool
summarize
all
academic
articles
write
code
annotate
molecules.
It
lasted
total
3
days
online
after
it
found
be
unable
distinguish
truth
fiction
that
could
"hallucinate"
data.4
adoption
led
increased
concerns
about
absence
published
negative
results
some
top
researchers
concerned
companies
adopting
"shiny
products"
safety.5,
6
Leading
expressed
alarm
low
regulatory
bar
for
transparency.7
skill
writers
questioned
when
emphasis
hiring
personnel
an
aptitude
142X
increase
LinkedIn.8
diagnostic
accuracy
carries
great
importance
consequence
error
harm
patients.
research
"Artificial
intelligence"
funded
by
NIH
tune
~$1.1
billion
2023.
Correspondingly,
number
Medline
topic
greatly
increased.
recent
study
showed
only
20%
studies
using
neuroimaging
Parkinson's
disease
passed
minimal
quality
criteria,
8%
external
test
sets
where
lower.9
systematic
review
fifty-five
relevant
use
PD
three
were
validated
data
five
had
risk
bias.10
field
disorders
relies
detailed
history
focused
neurological
examination
arrive
at
diagnosis.
gold
standard
criteria
most
common
essential
tremor
depend
acumen
part
what
observe
clinic
intuition
tacit
knowledge.
Key
documented
includes
case
reports,
series,
videos
small
sample
size,
given
rarity
diagnoses.
As
such,
may
not
serve
appropriately
comprehensive
datasets
model.
Different
forms
chance
played
major
role
including
story
introduction
L-DOPA
into
medicine.
Dr.
Langston
encountered
MPTP-induced
parkinsonism
while
he
enjoying
his
coffee
annoyed
being
interrupted
residents.11
Amantadine
originally
introduced
utilized
antiviral
medication.12
While
entertaining
thought
algorithm
replacing
neurologist,
must
remember
first
easiest
aspect
care.
telling
however
requires
nuance,
grace
empathy.
Perhaps
question
matters
regard
machine
learning
is:
would
trust
diagnosing
family
members?
answer
highlighted
compared
AI,
+
physician
comprehensibility
similar
across
groups,
empathy,
reliability
willingness
follow
advice
significantly
better
physician.13
survey
1400
US
adults
revealed
69%
them
uncomfortable
AI.14
need
continue
investing
supporting
meticulous,
caring
physicians
just
state-of-the-art
technology.
demise
expertise
fields
like
pathology
radiology
hands
long
predicted.
However,
certain
pitfalls
make
unlikely.
Missing
lead
bias.
Rare
missed
or
overcompensated
model,
leading
overdiagnosis
misdiagnosis.
LLMs
reinforce
outdated
practices.15,
16
Unlike
physicians,
insensitive
impact
their
decisions
therefore,
demonstrate
safety
behavior
and/
its
limitation.
There
also
accountability
should
AI-led
calculation
error.
Perhaps,
voiced
concern
mismatch
between
real
world,
known
unknown
errors.15,
Such
discrepancy
noted
conducted
Stanford
Gastroenterology.17
Preliminary
indicated
detect
polyps
during
colonoscopy.
subsequent
trial
negative.
authors
acknowledged
difference
added
more
world
can
noise
alter
efficacy
tools.17
Concerns
disappointment
opacity
computational
scientists
University
Toronto
noting
trend
excitement
around
feels
"advertisement
cool
technology"
instead
having
basis
science.16
substantial
towards
incorporation
electronic
health
records
alleviate
documentation
burden.18
without
risks.
In
addition
hallucination
misinterpret
recorded
text.
example
issues
hands,
feet
mouth
scribed
hand,
foot
disease.
chart
bloat
requiring
additional
time
screening
errors.
Overall,
used
discussion,
there
greater
summarizing
information
synthesizing
data.19
look
options
expand
care
ensure
tools
biased
inadequately
tested,
thereby
introducing
inequity
resource-rich
resource-poor
nations.
Recognizing
issues,
World
Health
Organization
called
careful
tools,
especially
biases
before
they
adopted
low-resource
settings
intent
reducing
inequity.
FDA
"nimble
regulation"
avoid
"swept
up
something
hardly
understand".
commissioner
recognizes
models
likely
"evolve"
implantation
"continuous
adjustment
remain
accurate".20,
21
"Intelligence"
poor
surrogate
intelligence,
interaction.
performs
poorly
cognitive
tests.22
clear
never
replace
paradoxically
improving
patient-doctor
relationship.
years,
environment
high
burden
stifled
innovation,
burnout
job
dissatisfaction.23-25
Several
primary
contributor
exodus.26
"Boring
AI"
termed
offers
hope
lowering
burdens
making
sustainable
pleasurable.27
Quality
measures
cost
hospitals
5
dollars
annually,
precious
manpower.
took
less
than
hour
draft
abstractions
>90%
AI.28
along
video
filming
does
offer
interesting
future
collection
synthesis
then
neurologist.29
approach
seek
supplant
taking
physical
examination,
compliment
good
patient
streamline
workflows
costs
17-fold
preserving
reliability.30
Once
ready
caution,
smart
systems
potentially
scheduling,
stratification
resource
allotment.
medications
approved
require
prior
authorization
coverage,
letters
necessity
create
reading
material
patients.31,
32
therefore
spend
patient.
always
center
any
decision
related
important
acknowledge
narrative,
subjective
Despite
potential,
currently
consistent
accountable
independent
diagnostics.
understand
generative
best,
predictive
errors
expert
judgment.33
further
drops
substantially
diagnoses
conversation
simulated
highlighting
limitations
entity.34
neurologists
once
predicted
advent
brain
imaging.35
Instead,
embraced
use,
evidence,
multiple
incorporate
imperative
partner
clinicians
development
exhaustive
testing
rushing
premature
dangerous
deployment
direct
(1)
Research
Project:
A.
Conception,
B.
Organization,
C.
Execution;
(2)
Statistical
Analysis:
Design,
Execution,
Review
Critique;
(3)
Manuscript:
Writing
First
Draft,
Critique.
A.M.:
1A,
1B,
1C,
3A
A.J.L.:
3B
Ethical
Compliance
Statement:
This
document
written
following
ethical
guidelines,
IRB
approval.
Informed
consent
necessary
work.
All
read
complied
Journal's
Publication
Guidelines.
We
confirm
position
involved
publication
affirm
work
those
guidelines.
Funding
Sources
Conflicts
Interest:
declare
funding
effort.
competing
financial
interests
personal
relationships
appeared
influence
reported
paper.
Financial
Disclosures
From
Last
12
Months:
AM
served
advisory
board
adaptive
biosciences.
He
serves
associate
editor
MDCP.
AJL
reports
consultancies
Britannia
Pharmaceuticals
BIAL
Portela.
honoraria
Pharmaceuticals,
BIAL,
Convatec
FMQ
Brazil.
Data
sharing
applicable
article
generated
analysed
current
study.
BACKGROUND
Perception-based
studies
are
susceptible
to
bias
introduced
through
the
design
of
instruments
used.
We
demonstrate
need
shift
from
perception-based
usage-based
trust
evaluation,
emphasizing
that
must
be
earned
demonstrated
reliability
rather
than
assumed
pre-adoption
surveys.
Our
findings
suggest
successful
AI
implementation
requires
a
proactive
approach
addresses
complex
interplay
human,
technical,
and
organizational
factors,
grounded
in
real-world
usage
data
theoretical,
perception-driven
acceptance
measures.
OBJECTIVE
To
examine
disconnect
between
expectations
post-implementation
realities
healthcare
systems.
METHODS
assessed
key
perceptive-driven
models,
namely
Unified
Theory
Acceptance
Use
Technology
(UTAUT),
Model
(TAM),
Diffusion
Innovation
(DOI)
with
regards
healthcare.
then
matched
using
these
models
real
results
post-usage
evidences.
RESULTS
Through
empirical
anecdotal
evidence,
this
paper
demonstrates
technology
adoption
frameworks
usage,
focusing
on
human
factors
influence
shortcomings
current
perception-focused
research.
CONCLUSIONS
Real-world
hype
fall
short,
underly
reluctance
or
resistance
providers
fully
adopt
AI.
NEJM AI,
Journal Year:
2025,
Volume and Issue:
2(4)
Published: March 25, 2025
Improved
performance
of
large
language
models
(LLMs)
on
traditional
reasoning
assessments
has
led
to
benchmark
saturation.
This
spurred
efforts
develop
new
benchmarks,
including
synthetic
computational
simulations
clinical
practice
involving
multiple
AI
agents.
We
argue
that
it
is
crucial
ground
such
in
extensive
human
validation.
conclude
by
providing
four
recommendations
for
researchers
better
evaluate
LLMs
practice.
Research Square (Research Square),
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 15, 2025
Abstract
Large
language
models
(LLMs)
possess
extensive
medical
knowledge
and
demonstrate
impressive
performance
in
answering
diagnostic
questions.
However,
responding
to
such
questions
differs
significantly
from
actual
clinical
procedures.
Real-world
diagnostics
involve
a
dynamic,
iterative
process
that
includes
hypothesis
refinement
targeted
data
collection.
This
complex
task
is
both
challenging
time-consuming,
demanding
significant
portion
of
workload
healthcare
resources.
Therefore,
evaluating
enhancing
LLM
real-world
procedures
crucial
for
deployment.
In
this
study,
framework
was
developed
assess
LLMs'
capability
complete
encounters,
including
history,
physical
examination,
tests
diagnosis.
A
benchmark
dataset
4,421
cases
curated,
covering
rare
common
diseases
across
32
specialties.
Clinical
evaluation
methods
were
used
comprehensively
the
GPT-4o-mini,
GPT-4o,
Claude-3-Haiku,
Qwen2.5-72b,
Qwen2.5-34b,
Qwen2.5-14b
Although
these
performed
well
questions,
they
consistently
underperformed
exhibited
number
errors.
To
address
challenges,
ClinDiag-GPT
trained
on
over
8,000
cases.
It
emulates
physicians'
reasoning,
collects
information
line
with
practice,
recommends
key
definitive
diagnoses.
outperformed
other
LLMs
accuracy
procedural
performance.
We
further
compared
alone,
collaboration
physicians,
physicians
alone.
Collaboration
between
enhanced
efficiency,
demonstrating
ClinDiag-GPT's
potential
as
valuable
assistant.
npj Digital Medicine,
Journal Year:
2025,
Volume and Issue:
8(1)
Published: March 7, 2025
Red
teaming,
the
practice
of
adversarially
exposing
unexpected
or
undesired
model
behaviors,
is
critical
towards
improving
equity
and
accuracy
large
language
models,
but
non-model
creator-affiliated
red
teaming
scant
in
healthcare.
We
convened
teams
clinicians,
medical
engineering
students,
technical
professionals
(80
participants
total)
to
stress-test
models
with
real-world
clinical
cases
categorize
inappropriate
responses
along
axes
safety,
privacy,
hallucinations/accuracy,
bias.
Six
medically-trained
reviewers
re-analyzed
prompt-response
pairs
added
qualitative
annotations.
Of
376
unique
prompts
(1504
responses),
20.1%
were
(GPT-3.5:
25.8%;
GPT-4.0:
16%;
GPT-4.0
Internet:
17.8%).
Subsequently,
we
show
utility
our
benchmark
by
testing
GPT-4o,
a
released
after
event
(20.4%
inappropriate).
21.5%
appropriate
GPT-3.5
updated
models.
share
insights
for
constructing
prompts,
present
iterative
assessments.