Evaluating large language models and agents in healthcare: key challenges in clinical applications
Xiaolan Chen,
No information about this author
Jie Xiang,
No information about this author
Shanfu Lu
No information about this author
et al.
Intelligent Medicine,
Journal Year:
2025,
Volume and Issue:
unknown
Published: March 1, 2025
Language: Английский
A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions
Research Square (Research Square),
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 16, 2025
Abstract
Background:
Large
Language
Models
(LLMs)
are
one
of
the
artificial
intelligence
(AI)
technologies
used
to
understand
and
generate
text,
summarize
information,
comprehend
contextual
cues.
LLMs
have
been
increasingly
by
researchers
in
various
medical
applications,
but
their
effectiveness
limitations
still
uncertain,
especially
across
specialties.
Objective:
This
review
evaluates
recent
literature
on
how
utilized
research
studies
19
It
also
explores
challenges
involved
suggests
areas
for
future
focus.
Methods:
Two
performed
searches
PubMed,
Web
Science
Scopus
identify
published
from
January
2021
March
2024.
The
included
usage
LLM
performing
tasks.
Data
was
extracted
analyzed
five
reviewers.
To
assess
risk
bias,
quality
assessment
using
revised
tool
intelligence-centered
diagnostic
accuracy
(QUADAS-AI).
Results:
Results
were
synthesized
through
categorical
analysis
evaluation
metrics,
impact
types,
validation
approaches
A
total
84
this
mainly
originated
two
countries;
USA
(35/84)
China
(16/84).
Although
reviewed
applications
spread
specialties,
multi-specialty
demonstrated
22
studies.
Various
aims
include
clinical
natural
language
processing
(31/84),
supporting
decision
(20/84),
education
(15/84),
diagnoses
patient
management
engagement
(3/84).
GPT-based
BERT-based
most
(83/84)
Despite
reported
positive
impacts
such
as
improved
efficiency
accuracy,
related
reliability,
ethics
remain.
overall
bias
low
72
studies,
high
11
not
clear
3
Conclusion:
dominate
specialty
with
over
98.8%
these
models.
potential
benefits
process
diagnostics,
a
key
finding
regarding
substantial
variability
performance
among
LLMs.
For
instance,
LLMs'
ranged
3%
support
90%
some
NLP
Heterogeneity
utilization
diverse
tasks
contexts
prevented
meaningful
meta-analysis,
lacked
standardized
methodologies,
outcome
measures,
implementation
approaches.
Therefore,
room
improvement
remains
wide
developing
domain-specific
data
establishing
standards
ensure
reliability
effectiveness.
Language: Английский