npj Digital Medicine,
Год журнала:
2025,
Номер
8(1)
Опубликована: Май 3, 2025
Large
language
models
generate
plausible
text
responses
to
medical
questions,
but
inaccurate
pose
significant
risks
in
decision-making.
Grading
LLM
outputs
determine
the
best
model
or
answer
is
time-consuming
and
impractical
clinical
settings;
therefore,
we
introduce
EVAL
(Expert-of-Experts
Verification
Alignment)
streamline
this
process
enhance
safety
for
upper
gastrointestinal
bleeding
(UGIB).
We
evaluated
OpenAI's
GPT-3.5/4/4o/o1-preview,
Anthropic's
Claude-3-Opus,
Meta's
LLaMA-2
(7B/13B/70B),
Mistral
AI's
Mixtral
(7B)
across
27
configurations,
including
zero-shot
baseline,
retrieval-augmented
generation,
supervised
fine-tuning.
uses
similarity-based
ranking
a
reward
trained
on
human-graded
rejection
sampling.
Among
employed
similarity
metrics,
Fine-Tuned
ColBERT
achieved
highest
alignment
with
human
performance
three
separate
datasets
(ρ
=
0.81-0.91).
The
replicated
grading
87.9%
of
cases
temperature
settings
significantly
improved
accuracy
through
sampling
by
8.36%
overall.
offers
scalable
potential
assess
high-stakes
Scientific Reports,
Год журнала:
2025,
Номер
15(1)
Опубликована: Янв. 28, 2025
Recent
advancements
of
large
language
models
(LLMs)
like
generative
pre-trained
transformer
4
(GPT-4)
have
generated
significant
interest
among
the
scientific
community.
Yet,
potential
these
to
be
utilized
in
clinical
settings
remains
largely
unexplored.
In
this
study,
we
investigated
abilities
multiple
LLMs
and
traditional
machine
learning
analyze
emergency
department
(ED)
reports
determine
if
corresponding
visits
were
due
symptomatic
kidney
stones.
Leveraging
a
dataset
manually
annotated
ED
reports,
developed
strategies
enhance
including
prompt
optimization,
zero-
few-shot
prompting,
fine-tuning,
augmentation.
Further,
implemented
fairness
assessment
bias
mitigation
methods
investigate
disparities
by
with
respect
race
gender.
A
expert
assessed
explanations
GPT-4
for
its
predictions
they
sound,
factually
correct,
unrelated
input
prompt,
or
potentially
harmful.
The
best
results
achieved
(macro-F1
=
0.833,
95%
confidence
interval
[CI]
0.826–0.841)
GPT-3.5
0.796,
CI
0.796–0.796).
Ablation
studies
revealed
that
initial
model
benefits
from
fine-tuning.
Adding
demographic
information
prior
disease
history
prompts
allows
make
better
decisions.
Bias
found
exhibited
no
racial
gender
disparities,
contrast
GPT-3.5,
which
failed
effectively
diversity.
European Journal of Cancer,
Год журнала:
2025,
Номер
218, С. 115274 - 115274
Опубликована: Фев. 4, 2025
Recent
advancements
in
large
language
models
(LLMs)
enable
real-time
web
search,
improved
referencing,
and
multilingual
support,
yet
ensuring
they
provide
safe
health
information
remains
crucial.
This
perspective
evaluates
seven
publicly
accessible
LLMs-ChatGPT,
Co-Pilot,
Gemini,
MetaAI,
Claude,
Grok,
Perplexity-on
three
simple
cancer-related
queries
across
eight
languages
(336
responses:
English,
French,
Chinese,
Thai,
Hindi,
Nepali,
Vietnamese,
Arabic).
None
of
the
42
English
responses
contained
clinically
meaningful
hallucinations,
whereas
7
294
non-English
did.
48
%
(162/336)
included
valid
references,
but
39
references
were.com
links
reflecting
quality
concerns.
frequently
exceeded
an
eighth-grade
level,
many
outputs
were
also
complex.
These
findings
reflect
substantial
progress
over
past
2-years
reveal
persistent
gaps
accuracy,
reliable
reference
inclusion,
referral
practices,
readability.
Ongoing
benchmarking
is
essential
to
ensure
LLMs
safely
support
global
dichotomy
meet
online
standards.
Frontiers in Artificial Intelligence,
Год журнала:
2025,
Номер
8
Опубликована: Янв. 27, 2025
Large
Language
Models
(LLMs)
offer
considerable
potential
to
enhance
various
aspects
of
healthcare,
from
aiding
with
administrative
tasks
clinical
decision
support.
However,
despite
the
growing
use
LLMs
in
a
critical
gap
persists
clear,
actionable
guidelines
available
healthcare
organizations
and
providers
ensure
their
responsible
safe
implementation.
In
this
paper,
we
propose
practical
step-by-step
approach
bridge
support
warranting
implementation
into
healthcare.
The
recommendations
manuscript
include
protecting
patient
privacy,
adapting
models
healthcare-specific
needs,
adjusting
hyperparameters
appropriately,
ensuring
proper
medical
prompt
engineering,
distinguishing
between
(CDS)
non-CDS
applications,
systematically
evaluating
LLM
outputs
using
structured
approach,
implementing
solid
model
governance
structure.
We
furthermore
ACUTE
mnemonic;
for
assessing
responses
based
on
Accuracy,
Consistency,
semantically
Unaltered
outputs,
Traceability,
Ethical
considerations.
Together,
these
aim
provide
clear
pathway
practice.
Canadian Association of Radiologists Journal,
Год журнала:
2025,
Номер
unknown
Опубликована: Фев. 27, 2025
Purpose:
Large
language
models
(LLMs)
have
the
potential
to
support
clinical
decision-making
but
often
lack
training
on
latest
guidelines.
Retrieval-augmented
generation
(RAG)
may
enhance
guideline
adherence
by
dynamically
integrating
external
information.
This
study
evaluates
performance
of
two
LLMs,
GPT-4o
and
o1-mini,
with
without
RAG,
in
adhering
Canadian
radiology
guidelines
for
incidental
hepatobiliary
findings.
Methods:
A
customized
RAG
architecture
was
developed
integrate
guideline-based
recommendations
into
LLM
prompts.
Clinical
cases
were
curated
used
prompt
RAG.
Primary
analyses
assessed
rate
comparisons
made
between
LLMs
Secondary
evaluated
reading
ease,
grade
level,
response
times
generated
outputs.
Results:
total
319
evaluated.
Adherence
rates
81.7%
97.2%
79.3%
o1-mini
95.1%
Model
differed
significantly
across
groups,
RAG-enabled
configurations
outperforming
their
non-RAG
counterparts
(
P
<
.05).
demonstrated
improved
ease
lower
level
scores;
however,
all
model
outputs
remained
at
advanced
comprehension
levels.
Response
increased
slightly
due
additional
retrieval
processing
clinically
acceptable.
Conclusions:
findings
compromising
readability
or
times.
approach
holds
promise
advancing
evidence-based
care
warrants
further
validation
broader
settings.
Journal of Surgical Oncology,
Год журнала:
2024,
Номер
unknown
Опубликована: Авг. 19, 2024
Large
Language
Models
(LLM;
e.g.,
ChatGPT)
may
be
used
to
assist
clinicians
and
form
the
basis
of
future
clinical
decision
support
(CDS)
for
colon
cancer.
The
objectives
this
study
were
(1)
evaluate
response
accuracy
two
LLM-powered
interfaces
in
identifying
guideline-based
care
simulated
scenarios
(2)
define
variation
between
within
LLMs.