Clinics and Practice,
Journal Year:
2024,
Volume and Issue:
14(6), P. 2376 - 2384
Published: Nov. 5, 2024
In
November
2022,
OpenAI
launched
ChatGPT
for
public
use
through
a
free
online
platform.
is
an
artificial
intelligence
(AI)
chatbot
trained
on
broad
dataset
encompassing
wide
range
of
topics,
including
medical
literature.
The
usability
in
the
field
and
quality
AI-generated
responses
are
widely
discussed
subject
current
investigations.
Patellofemoral
pain
one
most
common
conditions
among
young
adults,
often
prompting
patients
to
seek
advice.
This
study
examines
as
source
information
regarding
patellofemoral
surgery,
hypothesizing
that
there
will
be
differences
evaluation
generated
by
between
populations
with
different
levels
expertise
disorders.
Research Square (Research Square),
Journal Year:
2025,
Volume and Issue:
unknown
Published: Jan. 8, 2025
AbstractPurpose:
Large
language
Models
(LLM),
GPT
in
particular,
have
demonstrated
near
human-level
performance
medical
domain,
from
summarizing
clinical
notes
and
passing
licensing
examinations,
to
predictive
tasks
such
as
disease
diagnoses
treatment
recommendations.
However,
currently
there
is
little
research
on
their
efficacy
for
coding,
a
pivotal
component
health
informatics,
trials,
reimbursement
management.
This
study
proposes
prompt
framework
investigates
its
effectiveness
coding
tasks.
Methods:
First,
proposed.
aims
improve
the
of
complex
by
leveraging
state-of-the-art
(SOTA)
techniques
including
meta
prompt,
multi-shot
learning,
dynamic
in-context
learning
extract
task
specific
knowledge.
implemented
with
combination
commercial
GPT-4o
open-source
LLM.
Then
evaluated
three
different
Finally,
ablation
studies
are
presented
validate
analyze
contribution
each
module
proposed
framework.
Results:
On
MIMIC-IV
dataset,
prediction
accuracy
68.1%
over
30
most
frequent
MS-DRG
codes.
The
result
comparable
SOTA
69.4%
that
fine-tunes
LLaMA
model,
best
our
And
top-5
90.0%.
trial
criteria
results
macro
F1
score
68.4
CHIP-CTC
test
dataset
Chinese,
close
70.9,
supervised
model
training
method
comparison.
For
less
semantic
task,
79.7
CHIP-STS
which
not
competitive
methods
Conclusion:
This
demonstrates
tasks,
carefully
designed
prompt-based
can
achieve
similar
approaches.
Currently,
it
be
very
helpful
assistants,
but
does
replace
human
specialists.
With
rapid
advancement
LLM,
potential
reliably
automate
future
cannot
underestimated.
Turk Kardiyoloji Dernegi Arsivi-Archives of the Turkish Society of Cardiology,
Journal Year:
2025,
Volume and Issue:
unknown, P. 35 - 43
Published: Jan. 1, 2025
Objective:
Coronary
artery
disease
(CAD)
is
the
leading
cause
of
morbidity
and
mortality
globally.The
growing
interest
in
natural
language
processing
chatbots
(NLPCs)
has
driven
their
inevitable
widespread
adoption
healthcare.The
purpose
this
study
was
to
evaluate
accuracy
reproducibility
responses
provided
by
NLPCs,
such
as
ChatGPT,
Gemini,
Bing,
frequently
asked
questions
about
CAD.Methods:
Fifty
CAD
were
twice,
with
a
one-week
interval,
on
Bing.Two
cardiologists
independently
scored
answers
into
four
categories:
comprehensive/correct
(1),
incomplete/partially
correct
(2),
mix
accurate
inaccurate/misleading
(3),
completely
inaccurate/irrelevant
(4).The
each
NLPC's
assessed.Results:
ChatGPT's
14%
86%
comprehensive/correct.In
contrast,
Gemini
68%
responses,
30%
2%
inaccurate/
misleading
information.Bing
delivered
60%
26%
8%
information.Reproducibility
scores
88%
for
84%
70%
Bing.
Conclusion:ChatGPT
demonstrates
significant
potential
improve
patient
education
coronary
providing
more
sensitive
compared
Bing
Gemini.
BACKGROUND
Stroke
is
a
leading
cause
of
disability
and
death
worldwide,
with
home-based
rehabilitation
playing
crucial
role
in
improving
patient
prognosis
quality
life.
Traditional
health
education
models
often
fall
short
terms
precision,
personalization,
accessibility.
In
contrast,
large
language
(LLMs)
are
gaining
attention
for
their
potential
medical
education,
owing
to
advanced
natural
processing
capabilities.
However,
the
effectiveness
LLMs
stroke
remains
uncertain.
OBJECTIVE
This
study
evaluates
four
LLMs—ChatGPT-4,
MedGo,
Qwen,
ERNIE
Bot—in
rehabilitation.
The
aim
offer
patients
more
precise
secure
pathways
while
exploring
feasibility
using
guide
education.
METHODS
first
phase
this
study,
literature
review
expert
interviews
identified
15
common
questions
2
clinical
cases
relevant
These
were
input
into
simulated
consultations.
Six
experts
(2
clinicians,
nursing
specialists,
therapists)
evaluated
LLM-generated
responses
Likert
5-point
scale,
assessing
accuracy,
completeness,
readability,
safety,
humanity.
second
phase,
top
two
performing
from
one
selected.
Thirty
undergoing
recruited.
Each
asked
both
3
questions,
rated
satisfaction
assessed
text
length,
recommended
reading
age
Chinese
readability
analysis
tool.
Data
analyzed
one-way
ANOVA,
post
hoc
Tukey
HSD
tests,
paired
t-tests.
RESULTS
results
revealed
significant
differences
across
five
dimensions:
accuracy
(P
=
.002),
completeness
<
.001),
.04),
safety
.007),
humanity
.001).
ChatGPT-4
outperformed
all
each
dimension,
scores
(M
4.28,
SD
0.84),
4.35,
0.75),
0.85),
4.38,
0.81),
user-friendliness
4.65,
0.66).
MedGo
excelled
4.06,
0.78)
0.74).
Qwen
Bot
scored
significantly
lower
dimensions
compared
MedGo.
generated
longest
1338.35,
236.03)
had
highest
score
12.88).
performed
best
overall,
provided
clearest
responses.
CONCLUSIONS
have
shown
strong
performance
demonstrating
real-world
applications.
further
improvements
needed
professionalism,
oversight.
npj Digital Medicine,
Journal Year:
2025,
Volume and Issue:
8(1)
Published: March 25, 2025
Abstract
Symptom-Assessment
Application
(SAAs,
e.g.,
NHS
111
online)
that
assist
laypeople
in
deciding
if
and
where
to
seek
care
(
self-triage
)
are
gaining
popularity
Large
Language
Models
(LLMs)
increasingly
used
too.
However,
there
is
no
evidence
synthesis
on
the
accuracy
of
LLMs,
review
has
contextualized
SAAs
LLMs.
This
systematic
evaluates
both
LLMs
compares
them
laypeople.
A
total
1549
studies
were
screened
19
included.
The
was
moderate
but
highly
variable
(11.5–90.0%),
while
(57.8–76.0%)
(47.3–62.4%)
with
low
variability.
Based
available
evidence,
use
or
should
neither
be
universally
recommended
nor
discouraged;
rather,
we
suggest
their
utility
assessed
based
specific
case
user
group
under
consideration.
Health Data Science,
Journal Year:
2025,
Volume and Issue:
5
Published: Jan. 1, 2025
Background:
Multimodal
large
language
models
(LLMs)
have
shown
potential
in
various
health-related
fields.
However,
many
healthcare
studies
raised
concerns
about
the
reliability
and
biases
of
LLMs
applications.
Methods:
To
explore
practical
application
multimodal
skin
disease
identification,
to
evaluate
sex
age
biases,
we
tested
performance
2
popular
LLMs,
ChatGPT-4
LLaVA-1.6,
across
diverse
groups
using
a
subset
dermatoscopic
dataset
containing
around
10,000
images
3
diseases
(melanoma,
melanocytic
nevi,
benign
keratosis-like
lesions).
Results:
In
comparison
deep
learning
(VGG16,
ResNet50,
Model
Derm)
based
on
convolutional
neural
network
(CNN),
one
vision
transformer
model
(Swin-B),
found
that
LLaVA-1.6
demonstrated
overall
accuracies
were
3%
23%
higher
(and
F1-scores
4%
34%
higher),
respectively,
than
best
performing
CNN-based
baseline
while
maintaining
38%
26%
lower
19%
lower),
Swin-B.
Meanwhile,
is
generally
unbiased
identifying
these
groups,
contrast
Swin-B,
which
biased
nevi.
Conclusions:
This
study
suggests
usefulness
fairness
dermatological
applications,
aiding
physicians
practitioners
with
diagnostic
recommendations
patient
screening.
further
verify
healthcare,
experiments
larger
more
datasets
need
be
performed
future.
Journal of Mid-life Health,
Journal Year:
2025,
Volume and Issue:
16(1), P. 45 - 50
Published: Jan. 1, 2025
A
BSTRACT
Background:
The
use
of
large
language
model
(LLM)
chatbots
in
health-related
queries
is
growing
due
to
their
convenience
and
accessibility.
However,
concerns
about
the
accuracy
readability
information
persist.
Many
individuals,
including
patients
healthy
adults,
may
rely
on
for
midlife
health
instead
consulting
a
doctor.
In
this
context,
we
evaluated
responses
from
six
LLM
questions
men
women.
Methods:
Twenty
were
asked
different
–
ChatGPT,
Claude,
Copilot,
Gemini,
Meta
artificial
intelligence
(AI),
Perplexity.
Each
chatbot’s
collected
accuracy,
relevancy,
fluency,
coherence
by
three
independent
expert
physicians.
An
overall
score
was
also
calculated
taking
average
four
criteria.
addition,
analyzed
using
Flesch-Kincaid
Grade
Level,
determine
how
easily
could
be
understood
general
population.
Results:
terms
Perplexity
scored
highest
(4.3
±
1.78),
AI
(4.26
0.16),
AI,
relevancy
(4.35
0.24).
Overall,
(4.28
followed
ChatGPT
(4.22
0.21),
whereas
Copilot
had
lowest
(3.72
0.19)
(
P
<
0.0001).
showed
41.24
10.57
grade
level
(11.11
1.93),
meaning
its
text
easiest
read
requires
lower
education.
Conclusion:
can
answer
midlife-related
with
variable
capabilities.
found
scoring
chatbot
addressing
men’s
women’s
questions,
offers
high
accessible
information.
Hence,
used
as
educational
tools
selecting
appropriate
according
capability.
Nursing Reports,
Journal Year:
2025,
Volume and Issue:
15(4), P. 130 - 130
Published: April 14, 2025
Background/Objectives:
The
advent
of
large
language
models
(LLMs),
like
platforms
such
as
ChatGPT,
capable
generating
quick
and
interactive
answers
to
complex
questions,
opens
the
way
for
new
approaches
training
healthcare
professionals,
enabling
them
acquire
up-to-date
specialised
information
easily.
In
nursing,
they
have
proven
support
clinical
decision
making,
continuing
education,
development
care
plans
management
cases,
well
writing
academic
reports
scientific
articles.
Furthermore,
ability
provide
rapid
access
can
improve
quality
promote
evidence-based
practice.
However,
their
applicability
in
practice
requires
thorough
evaluation.
This
study
evaluated
accuracy,
completeness
safety
responses
generated
by
ChatGPT-4
on
pressure
injuries
(PIs)
infants.
Methods:
January
2025,
we
analysed
60
queries,
subdivided
into
12
main
topics,
PIs
questions
were
developed,
through
consultation
authoritative
documents,
based
relevance
nursing
potential.
A
panel
five
experts,
using
a
5-point
Likert
scale,
assessed
ChatGPT.
Results:
Overall,
over
90%
ChatGPT-4o
received
relatively
high
ratings
three
criteria
with
most
frequent
value
4.
when
analysing
topics
individually,
observed
that
Medical
Device
Management
Technological
Innovation
lowest
accuracy
scores.
At
same
time,
Scientific
Evidence
had
No
rated
completely
incorrect.
Conclusions:
has
shown
good
level
addressing
about
ongoing
updates
integration
high-quality
sources
are
essential
ensuring
its
reliability
decision-support
tool.
Research Square (Research Square),
Journal Year:
2025,
Volume and Issue:
unknown
Published: April 16, 2025
AbstractBackground:
Large
Language
Models
(LLMs)
are
one
of
the
artificial
intelligence
(AI)
technologies
used
to
understand
and
generate
text,
summarize
information,
comprehend
contextual
cues.
LLMs
have
been
increasingly
by
researchers
in
various
medical
applications,
but
their
effectiveness
limitations
still
uncertain,
especially
across
specialties.
Objective:
This
review
evaluates
recent
literature
on
how
utilized
research
studies
19
It
also
explores
challenges
involved
suggests
areas
for
future
focus.
Methods:
Two
performed
searches
PubMed,
Web
Science
Scopus
identify
published
from
January
2021
March
2024.
The
included
usage
LLM
performing
tasks.
Data
was
extracted
analyzed
five
reviewers.
To
assess
risk
bias,
quality
assessment
using
revised
tool
intelligence-centered
diagnostic
accuracy
(QUADAS-AI).
Results:
Results
were
synthesized
through
categorical
analysis
evaluation
metrics,
impact
types,
validation
approaches
A
total
84
this
mainly
originated
two
countries;
USA
(35/84)
China
(16/84).
Although
reviewed
applications
spread
specialties,
multi-specialty
demonstrated
22
studies.
Various
aims
include
clinical
natural
language
processing
(31/84),
supporting
decision
(20/84),
education
(15/84),
diagnoses
patient
management
engagement
(3/84).
GPT-based
BERT-based
most
(83/84)
Despite
reported
positive
impacts
such
as
improved
efficiency
accuracy,
related
reliability,
ethics
remain.
overall
bias
low
72
studies,
high
11
not
clear
3
Conclusion:
dominate
specialty
with
over
98.8%
these
models.
potential
benefits
process
diagnostics,
a
key
finding
regarding
substantial
variability
performance
among
LLMs.
For
instance,
LLMs'
ranged
3%
support
90%
some
NLP
Heterogeneity
utilization
diverse
tasks
contexts
prevented
meaningful
meta-analysis,
lacked
standardized
methodologies,
outcome
measures,
implementation
approaches.
Therefore,
room
improvement
remains
wide
developing
domain-specific
data
establishing
standards
ensure
reliability
effectiveness.