JCO Clinical Cancer Informatics,
Journal Year:
2024,
Volume and Issue:
8
Published: Dec. 1, 2024
We
examined
the
effectiveness
of
proprietary
and
open
large
language
models
(LLMs)
in
detecting
disease
presence,
location,
treatment
response
pancreatic
cancer
from
radiology
reports.
The Lancet Digital Health,
Journal Year:
2024,
Volume and Issue:
6(9), P. e662 - e672
Published: Aug. 23, 2024
Among
the
rapid
integration
of
artificial
intelligence
in
clinical
settings,
large
language
models
(LLMs),
such
as
Generative
Pre-trained
Transformer-4,
have
emerged
multifaceted
tools
that
potential
for
health-care
delivery,
diagnosis,
and
patient
care.
However,
deployment
LLMs
raises
substantial
regulatory
safety
concerns.
Due
to
their
high
output
variability,
poor
inherent
explainability,
risk
so-called
AI
hallucinations,
LLM-based
applications
serve
a
medical
purpose
face
challenges
approval
devices
under
US
EU
laws,
including
recently
passed
Artificial
Intelligence
Act.
Despite
unaddressed
risks
patients,
misdiagnosis
unverified
advice,
are
available
on
market.
The
ambiguity
surrounding
these
creates
an
urgent
need
frameworks
accommodate
unique
capabilities
limitations.
Alongside
development
frameworks,
existing
regulations
should
be
enforced.
If
regulators
fear
enforcing
market
dominated
by
supply
or
technology
companies,
consequences
layperson
harm
will
force
belated
action,
damaging
potentiality
advice.
Scientific Reports,
Journal Year:
2025,
Volume and Issue:
15(1)
Published: Jan. 28, 2025
Recent
advancements
of
large
language
models
(LLMs)
like
generative
pre-trained
transformer
4
(GPT-4)
have
generated
significant
interest
among
the
scientific
community.
Yet,
potential
these
to
be
utilized
in
clinical
settings
remains
largely
unexplored.
In
this
study,
we
investigated
abilities
multiple
LLMs
and
traditional
machine
learning
analyze
emergency
department
(ED)
reports
determine
if
corresponding
visits
were
due
symptomatic
kidney
stones.
Leveraging
a
dataset
manually
annotated
ED
reports,
developed
strategies
enhance
including
prompt
optimization,
zero-
few-shot
prompting,
fine-tuning,
augmentation.
Further,
implemented
fairness
assessment
bias
mitigation
methods
investigate
disparities
by
with
respect
race
gender.
A
expert
assessed
explanations
GPT-4
for
its
predictions
they
sound,
factually
correct,
unrelated
input
prompt,
or
potentially
harmful.
The
best
results
achieved
(macro-F1
=
0.833,
95%
confidence
interval
[CI]
0.826–0.841)
GPT-3.5
0.796,
CI
0.796–0.796).
Ablation
studies
revealed
that
initial
model
benefits
from
fine-tuning.
Adding
demographic
information
prior
disease
history
prompts
allows
make
better
decisions.
Bias
found
exhibited
no
racial
gender
disparities,
contrast
GPT-3.5,
which
failed
effectively
diversity.
Mayo Clinic Proceedings Digital Health,
Journal Year:
2024,
Volume and Issue:
3(1), P. 100184 - 100184
Published: Nov. 29, 2024
Large
language
models
(LLMs)
are
a
type
of
artificial
intelligence,
which
operate
by
predicting
and
assembling
sequences
words
that
statistically
likely
to
follow
from
given
text
input.
With
this
basic
ability,
LLMs
able
answer
complex
questions
extremely
instructions.
Products
created
using
such
as
ChatGPT
OpenAI
Claude
Anthropic
have
huge
amount
traction
user
engagements
revolutionized
the
way
we
interact
with
technology,
bringing
new
dimension
human-computer
interaction.
Fine-tuning
is
process
in
pretrained
model,
an
LLM,
further
trained
on
custom
data
set
adapt
it
for
specialized
tasks
or
domains.
In
review,
outline
some
major
methodologic
approaches
techniques
can
be
used
fine-tune
use
cases
enumerate
general
steps
required
carrying
out
LLM
fine-tuning.
We
then
illustrate
few
these
describing
several
specific
fine-tuning
across
medical
subspecialties.
Finally,
close
consideration
benefits
limitations
associated
cases,
emphasis
concerns
field
medicine.
European Journal of Cancer,
Journal Year:
2025,
Volume and Issue:
218, P. 115274 - 115274
Published: Feb. 4, 2025
Recent
advancements
in
large
language
models
(LLMs)
enable
real-time
web
search,
improved
referencing,
and
multilingual
support,
yet
ensuring
they
provide
safe
health
information
remains
crucial.
This
perspective
evaluates
seven
publicly
accessible
LLMs-ChatGPT,
Co-Pilot,
Gemini,
MetaAI,
Claude,
Grok,
Perplexity-on
three
simple
cancer-related
queries
across
eight
languages
(336
responses:
English,
French,
Chinese,
Thai,
Hindi,
Nepali,
Vietnamese,
Arabic).
None
of
the
42
English
responses
contained
clinically
meaningful
hallucinations,
whereas
7
294
non-English
did.
48
%
(162/336)
included
valid
references,
but
39
references
were.com
links
reflecting
quality
concerns.
frequently
exceeded
an
eighth-grade
level,
many
outputs
were
also
complex.
These
findings
reflect
substantial
progress
over
past
2-years
reveal
persistent
gaps
accuracy,
reliable
reference
inclusion,
referral
practices,
readability.
Ongoing
benchmarking
is
essential
to
ensure
LLMs
safely
support
global
dichotomy
meet
online
standards.
Critical Care,
Journal Year:
2025,
Volume and Issue:
29(1)
Published: Feb. 10, 2025
Abstract
Background
Large
language
models
(LLMs)
show
increasing
potential
for
their
use
in
healthcare
administrative
support
and
clinical
decision
making.
However,
reports
on
performance
critical
care
medicine
is
lacking.
Methods
This
study
evaluated
five
LLMs
(GPT-4o,
GPT-4o-mini,
GPT-3.5-turbo,
Mistral
2407
Llama
3.1
70B)
1181
multiple
choice
questions
(MCQs)
from
the
gotheextramile.com
database,
a
comprehensive
database
of
at
European
Diploma
Intensive
Care
examination
level.
Their
was
compared
to
random
guessing
350
human
physicians
77-MCQ
practice
test.
Metrics
included
accuracy,
consistency,
domain-specific
performance.
Costs,
as
proxy
energy
consumption,
were
also
analyzed.
Results
GPT-4o
achieved
highest
accuracy
93.3%,
followed
by
70B
(87.5%),
(87.9%),
GPT-4o-mini
(83.0%),
GPT-3.5-turbo
(72.7%).
Random
yielded
41.5%
(
p
<
0.001).
On
test,
all
surpassed
physicians,
scoring
89.0%,
80.9%,
84.4%,
80.3%,
66.5%,
respectively,
42.7%
0.001)
61.9%
physicians.
contrast
other
0.001),
GPT-3.5-turbo’s
did
not
significantly
outperform
=
0.196).
Despite
high
overall
gave
consistently
incorrect
answers.
The
most
expensive
model
GPT-4o,
costing
over
25
times
more
than
least
model,
GPT-4o-mini.
Conclusions
exhibit
exceptional
with
four
outperforming
European-level
exam.
led
but
raised
concerns
about
consumption.
care,
produced
answers,
highlighting
need
thorough
ongoing
evaluations
guide
responsible
implementation
settings.