Large
language
models
(LLMs)
excel
in
providing
natural
responses
that
sound
authoritative,
reflect
knowledge
of
the
context
area,
and
can
present
from
a
range
varied
perspectives.
Agent
Based
Models
Simulation
consist
simulated
agents
interact
within
environment
to
explore
societal,
social,
ethical,
among
other,
problems.
Agents
generate
large
volumes
data
over
time
discerning
useful
relevant
content
is
an
onerous
task.
LLMs
help
communicating
agents’
perspectives
on
key
events
by
narratives.
However,
these
narratives
need
be
factual,
transparent,
reproducible.
To
this
end,
we
structured
narrative
prompt
for
sending
queries
LLMs.
Chi-square
tests
Fisher’s
Exact
are
applied
assess
statistically
significant
difference
sentiment
scores
messages
between
simulation
generated
narratives,
ChatGPT-generated
real
tweets.
The
structure
effectively
yields
with
desired
components
ChatGPT.
This
expected
extensible
across
In
14
out
44
categories,
ChatGPT
which
has
were
not
discernibly
different,
terms
statistical
significance
(alpha
level
0.05),
expressed
Three
outcomes
provided:
(1)
list
benefits
challenges
generation;
(2)
requesting
LLM
based
information;
(3)
assessment
prevalence
compared
indicates
promise
utilization
helping
connect
agent’s
experiences
people.
BMC Medical Informatics and Decision Making,
Год журнала:
2024,
Номер
24(1)
Опубликована: Ноя. 29, 2024
Abstract
Background
Owing
to
the
rapid
growth
in
popularity
of
Large
Language
Models
(LLMs),
various
performance
evaluation
studies
have
been
conducted
confirm
their
applicability
medical
field.
However,
there
is
still
no
clear
framework
for
evaluating
LLMs.
Objective
This
study
reviews
on
LLM
evaluations
field
and
analyzes
research
methods
used
these
studies.
It
aims
provide
a
reference
future
researchers
designing
Methods
&
materials
We
scoping
review
three
databases
(PubMed,
Embase,
MEDLINE)
identify
LLM-related
articles
published
between
January
1,
2023,
September
30,
2023.
analyzed
types
methods,
number
questions
(queries),
evaluators,
repeat
measurements,
additional
analysis
use
prompt
engineering,
metrics
other
than
accuracy.
Results
A
total
142
met
inclusion
criteria.
was
primarily
categorized
as
either
providing
test
examinations
(
n
=
53,
37.3%)
or
being
evaluated
by
professional
80,
56.3%),
with
some
hybrid
cases
5,
3.5%)
combination
two
4,
2.8%).
Most
had
100
fewer
18,
29.0%),
15
(24.2%)
performed
repeated
18
(29.0%)
analyses,
8
(12.9%)
engineering.
For
assessment,
most
50
queries
54,
64.3%),
evaluators
43,
48.3%),
14
(14.7%)
Conclusions
More
required
regarding
application
LLMs
healthcare.
Although
previous
performance,
will
likely
focus
improving
performance.
well-structured
methodology
be
systematically.
Future Internet,
Год журнала:
2023,
Номер
15(12), С. 375 - 375
Опубликована: Ноя. 23, 2023
Large
language
models
(LLMs)
excel
in
providing
natural
responses
that
sound
authoritative,
reflect
knowledge
of
the
context
area,
and
can
present
from
a
range
varied
perspectives.
Agent-based
simulations
consist
simulated
agents
interact
within
environment
to
explore
societal,
social,
ethical,
among
other,
problems.
Simulated
generate
large
volumes
data
discerning
useful
relevant
content
is
an
onerous
task.
LLMs
help
communicating
agents’
perspectives
on
key
life
events
by
narratives.
However,
these
narratives
should
be
factual,
transparent,
reproducible.
Therefore,
we
structured
narrative
prompt
for
sending
queries
LLMs,
experiment
with
generation
process
using
OpenAI’s
ChatGPT,
assess
statistically
significant
differences
across
11
Positive
Negative
Affect
Schedule
(PANAS)
sentiment
levels
between
generated
real
tweets
chi-squared
tests
Fisher’s
exact
tests.
The
structure
effectively
yields
desired
components
ChatGPT.
In
four
out
forty-four
categories,
ChatGPT
which
have
scores
were
not
discernibly
different,
terms
statistical
significance
(alpha
level
α=0.05),
expressed
tweets.
Three
outcomes
are
provided:
(1)
list
benefits
challenges
generation;
(2)
requesting
LLM
chatbot
based
information;
(3)
assessment
prevalence
compared
This
indicates
promise
utilization
helping
connect
agent’s
experiences
people.
medRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2024,
Номер
unknown
Опубликована: Июль 22, 2024
Large
language
models
(LLMs)
show
promise
in
supporting
differential
diagnosis,
but
their
performance
is
challenging
to
evaluate
due
the
unstructured
nature
of
responses.
To
assess
current
capabilities
LLMs
diagnose
genetic
diseases,
we
benchmarked
these
on
5,213
case
reports
using
Phenopacket
Schema,
Human
Phenotype
Ontology
and
Mondo
disease
ontology.
Prompts
generated
from
each
phenopacket
were
sent
three
generative
pretrained
transformer
(GPT)
models.
The
same
phenopackets
used
as
input
a
widely
diagnostic
tool,
Exomiser,
phenotype-only
mode.
best
LLM
ranked
correct
diagnosis
first
23.6%
cases,
whereas
Exomiser
did
so
35.5%
cases.
While
for
has
been
improving,
it
not
reached
level
commonly
traditional
bioinformatics
tools.
Future
research
needed
determine
approach
incorporate
into
pipelines.
medRxiv (Cold Spring Harbor Laboratory),
Год журнала:
2025,
Номер
unknown
Опубликована: Фев. 28, 2025
Large
language
models
(LLMs)
are
increasingly
used
in
the
medical
field
for
diverse
applications
including
differential
diagnostic
support.
The
estimated
training
data
to
create
LLMs
such
as
Generative
Pretrained
Transformer
(GPT)
predominantly
consist
of
English-language
texts,
but
could
be
across
globe
support
diagnostics
if
barriers
overcome.
Initial
pilot
studies
on
utility
diagnosis
languages
other
than
English
have
shown
promise,
a
large-scale
assessment
relative
performance
these
variety
European
and
non-European
comprehensive
corpus
challenging
rare-disease
cases
is
lacking.
We
created
4967
clinical
vignettes
using
structured
captured
with
Human
Phenotype
Ontology
(HPO)
terms
Global
Alliance
Genomics
Health
(GA4GH)
Phenopacket
Schema.
These
span
total
378
distinct
genetic
diseases
2618
associated
phenotypic
features.
translations
together
language-specific
templates
generate
prompts
English,
Chinese,
Czech,
Dutch,
German,
Italian,
Japanese,
Spanish,
Turkish.
applied
GPT-4o,
version
gpt-4o-2024-08-06,
task
delivering
ranked
zero-shot
prompt.
An
ontology-based
approach
Mondo
disease
ontology
was
map
synonyms
subtypes
diagnoses
order
automate
evaluation
LLM
responses.
For
GPT-4o
placed
correct
at
first
rank
19·8%
within
top-3
ranks
27·0%
time.
In
comparison,
eight
non-English
tested
here
1
between
16·9%
20·5%,
25·3%
27·7%
cases.
consistent
nine
tested.
This
suggests
that
may
settings.
NHGRI
5U24HG011449
5RM1HG010860.
P.N.R.
supported
by
Professorship
Alexander
von
Humboldt
Foundation;
P.L.
National
Grant
(PMP21/00063
ONTOPREC-ISCIII,
Fondos
FEDER).
npj Digital Medicine,
Год журнала:
2025,
Номер
8(1)
Опубликована: Март 22, 2025
Abstract
While
generative
artificial
intelligence
(AI)
has
shown
potential
in
medical
diagnostics,
comprehensive
evaluation
of
its
diagnostic
performance
and
comparison
with
physicians
not
been
extensively
explored.
We
conducted
a
systematic
review
meta-analysis
studies
validating
AI
models
for
tasks
published
between
June
2018
2024.
Analysis
83
revealed
an
overall
accuracy
52.1%.
No
significant
difference
was
found
(
p
=
0.10)
or
non-expert
0.93).
However,
performed
significantly
worse
than
expert
0.007).
Several
demonstrated
slightly
higher
compared
to
non-experts,
although
the
differences
were
significant.
Generative
demonstrates
promising
capabilities
varying
by
model.
Although
it
yet
achieved
expert-level
reliability,
these
findings
suggest
enhancing
healthcare
delivery
education
when
implemented
appropriate
understanding
limitations.
Research Square (Research Square),
Год журнала:
2025,
Номер
unknown
Опубликована: Апрель 15, 2025
Abstract
Large
language
models
(LLMs)
possess
extensive
medical
knowledge
and
demonstrate
impressive
performance
in
answering
diagnostic
questions.
However,
responding
to
such
questions
differs
significantly
from
actual
clinical
procedures.
Real-world
diagnostics
involve
a
dynamic,
iterative
process
that
includes
hypothesis
refinement
targeted
data
collection.
This
complex
task
is
both
challenging
time-consuming,
demanding
significant
portion
of
workload
healthcare
resources.
Therefore,
evaluating
enhancing
LLM
real-world
procedures
crucial
for
deployment.
In
this
study,
framework
was
developed
assess
LLMs'
capability
complete
encounters,
including
history,
physical
examination,
tests
diagnosis.
A
benchmark
dataset
4,421
cases
curated,
covering
rare
common
diseases
across
32
specialties.
Clinical
evaluation
methods
were
used
comprehensively
the
GPT-4o-mini,
GPT-4o,
Claude-3-Haiku,
Qwen2.5-72b,
Qwen2.5-34b,
Qwen2.5-14b
Although
these
performed
well
questions,
they
consistently
underperformed
exhibited
number
errors.
To
address
challenges,
ClinDiag-GPT
trained
on
over
8,000
cases.
It
emulates
physicians'
reasoning,
collects
information
line
with
practice,
recommends
key
definitive
diagnoses.
outperformed
other
LLMs
accuracy
procedural
performance.
We
further
compared
alone,
collaboration
physicians,
physicians
alone.
Collaboration
between
enhanced
efficiency,
demonstrating
ClinDiag-GPT's
potential
as
valuable
assistant.
JAMA Network Open,
Год журнала:
2025,
Номер
8(4), С. e256359 - e256359
Опубликована: Апрель 22, 2025
Importance
Large
language
models
(LLMs)
are
being
implemented
in
health
care.
Enhanced
accuracy
and
methods
to
maintain
over
time
needed
maximize
LLM
benefits.
Objective
To
evaluate
whether
performance
on
the
US
Medical
Licensing
Examination
(USMLE)
can
be
improved
by
including
formally
represented
semantic
clinical
knowledge.
Design,
Setting,
Participants
This
comparative
effectiveness
research
study
was
conducted
between
June
2024
February
2025
at
Department
of
Biomedical
Informatics,
Jacobs
School
Medicine
Sciences,
University
Buffalo,
New
York,
using
sample
questions
from
USMLE
Steps
1,
2,
3.
Intervention
Semantic
artificial
intelligence
(SCAI)
developed
insert
knowledge
into
LLMs
retrieval
augmented
generation
(RAG).
Main
Outcomes
Measures
The
SCAI
method
evaluated
comparing
3
Llama
(13B,
70B,
405B;
Meta)
with
without
RAG
text-based
for
answering
determined
output
answer
key.
Results
were
tested
87
Step
103
123
13B
enhanced
associated
significantly
1
but
only
met
60%
passing
threshold
(74
correct
[60.2%]).
70B
405B
passed
all
steps
RAG.
model
scored
80
(92.0%)
correctly
82
(79.6%)
112
(91.1%)
79
(90.8%)
(84.5%)
117
(95.1%)
Significant
improvements
found
3,
parameter
better
than
model,
not
model.
Conclusions
Relevance
In
this
study,
scores
RAG,
well
or
augmentation.
forms
reasoning
LLMs,
like
reasoning,
have
potential
improve
important
medical
questions.
Improving
care
targeted,
up-to-date
is
an
step
implementation
acceptance.
Hepatology Communications,
Год журнала:
2025,
Номер
9(5)
Опубликована: Апрель 30, 2025
Large
language
models
like
ChatGPT
have
demonstrated
potential
in
medical
image
interpretation,
but
their
efficacy
liver
histopathological
analysis
remains
largely
unexplored.
This
study
aims
to
assess
ChatGPT-4-vision's
diagnostic
accuracy,
compared
pathologists'
performance,
evaluating
fibrosis
(stage)
metabolic
dysfunction-associated
steatohepatitis.
Digitized
Sirius
Red-stained
images
for
59
steatohepatitis
tissue
biopsy
specimens
were
evaluated
by
ChatGPT-4
and
4
pathologists
using
the
NASH-CRN
staging
system.
Fields
of
view
at
increasing
magnification
levels,
extracted
a
senior
pathologist
or
randomly
selected,
shown
ChatGPT-4,
asking
staging.
The
accuracy
was
with
evaluations
correlated
collagen
proportionate
area
additional
insights.
All
cases
further
analyzed
an
in-context
learning
approach,
where
model
learns
from
exemplary
provided
during
prompting.
ChatGPT-4's
81%
when
selected
pathologist,
while
it
decreased
54%
cropped
fields
view.
By
employing
increased
88%
77%
random
view,
respectively.
method
enabled
fully
correctly
identify
structures
characteristic
F4
stages,
previously
misclassified.
also
highlighted
moderate
strong
correlation
between
area.
showed
remarkable
results
overlapping
those
expert
pathologists.
analysis,
applied
here
first
time
deposition
samples,
crucial
accurately
identifying
key
features
cases,
critical
early
therapeutic
decision-making.
These
findings
suggest
integrating
large
as
supportive
tools
pathology.