Research on Intelligent Grading of Physics Problems Based on Large Language Models
Education Sciences,
Год журнала:
2025,
Номер
15(2), С. 116 - 116
Опубликована: Янв. 21, 2025
The
automation
of
educational
and
instructional
assessment
plays
a
crucial
role
in
enhancing
the
quality
teaching
management.
In
physics
education,
calculation
problems
with
intricate
problem-solving
ideas
pose
challenges
to
intelligent
grading
tests.
This
study
explores
automatic
through
combination
large
language
models
prompt
engineering.
By
comparing
performance
four
strategies
(one-shot,
few-shot,
chain
thought,
tree
thought)
within
two
model
frameworks,
namely
ERNIEBot-4-turbo
GPT-4o.
finds
that
thought
can
better
assess
complex
(N
=
100,
ACC
≥
0.9,
kappa
>
0.8)
reduce
gap
between
different
models.
research
provides
valuable
insights
for
assessments
education.
Язык: Английский
A Comparative Study on the Question-Answering Proficiency of Artificial Intelligence Models in Bladder-Related Conditions: An Evaluation of Gemini and ChatGPT 4.o
Medical Records,
Год журнала:
2025,
Номер
7(1), С. 201 - 205
Опубликована: Янв. 10, 2025
Aim:
The
rapid
evolution
of
artificial
intelligence
(AI)
has
revolutionized
medicine,
with
tools
like
ChatGPT
and
Google
Gemini
enhancing
clinical
decision-making.
ChatGPT's
advancements,
particularly
GPT-4,
show
promise
in
diagnostics
education.
However,
variability
accuracy
limitations
complex
scenarios
emphasize
the
need
for
further
evaluation
these
models
medical
applications.
This
study
aimed
to
assess
agreement
between
4.o
AI
identifying
bladder-related
conditions,
including
neurogenic
bladder,
vesicoureteral
reflux
(VUR),
posterior
urethral
valve
(PUV).
Material
Method:
study,
conducted
October
2024,
compared
AI's
on
51
questions
about
VUR,
PUV.
Questions,
randomly
selected
from
pediatric
surgery
urology
materials,
were
evaluated
using
metrics
statistical
analysis,
highlighting
models'
performance
agreement.
Results:
demonstrated
similar
across
PUV
questions,
true
response
rates
66.7%
68.6%,
respectively,
no
statistically
significant
differences
(p>0.05).
Combined
all
topics
was
67.6%.
Strong
inter-rater
reliability
(κ=0.87)
highlights
their
Conclusion:
comparable
ChatGPT-4.o
key
performance.
Язык: Английский
Enhancing ophthalmology students’ awareness of retinitis pigmentosa: assessing the efficacy of ChatGPT in AI-assisted teaching of rare diseases—a quasi-experimental study
Frontiers in Medicine,
Год журнала:
2025,
Номер
12
Опубликована: Март 18, 2025
Retinitis
pigmentosa
(RP)
is
a
rare
retinal
dystrophy
often
underrepresented
in
ophthalmology
education.
Despite
advancements
diagnostics
and
treatments
like
gene
therapy,
RP
knowledge
gaps
persist.
This
study
assesses
the
efficacy
of
AI-assisted
teaching
using
ChatGPT
compared
to
traditional
methods
educating
students
about
RP.
A
quasi-experimental
was
conducted
with
142
medical
randomly
assigned
control
(traditional
review
materials)
groups.
Both
groups
attended
lecture
on
completed
pre-
post-tests.
Statistical
analyses
learning
outcomes,
times,
response
accuracy.
significantly
improved
post-test
scores
(p
<
0.001),
but
group
required
less
time
(24.29
±
12.62
vs.
42.54
20.43
min,
p
0.0001).
The
also
performed
better
complex
questions
regarding
advanced
treatments,
demonstrating
AI's
potential
deliver
accurate
current
information
efficiently.
enhances
efficiency
comprehension
diseases
hybrid
educational
model
combining
AI
can
address
gaps,
offering
promising
approach
for
modern
Язык: Английский
Is artificial intelligence successful in the Turkish neurology board exam?
Neurological Research,
Год журнала:
2025,
Номер
unknown, С. 1 - 4
Опубликована: Март 20, 2025
Objectives
OpenAI
declared
that
GPT-4
performed
better
in
academic
and
certain
specialty
areas.
Medical
licensing
exams
assess
the
clinical
competence
of
doctors.
We
aimed
to
investigate
for
first
time
howChatGPT
will
perform
Turkish
Neurology
Proficiency
Exam.
Язык: Английский
Comparative Evaluation of Advanced AI Reasoning Models in Korean National Licensing Examination OpenAI vs DeepSeek (Preprint)
Опубликована: Март 27, 2025
UNSTRUCTURED
Artificial
intelligence
(AI)
has
advanced
in
natural
language
processing
and
reasoning,
with
large
models
(LLMs)
increasingly
assessed
for
medical
education
licensing
exams.
Given
the
growing
use
of
AI
examinations,
evaluating
their
performance
on
non-Western,
region-specific
tests
like
Korean
Medical
Licensing
Examination
(KMLE)
is
crucial
assessing
real-world
applicability.
This
study
compared
five
LLMs—GPT-4o,
o1,
o3-mini
(OpenAI),
DeepSeek-V3,
DeepSeek-R1
(DeepSeek)—on
KMLE.
A
total
150
multiple-choice
questions
from
2024
KMLE
were
extracted
categorized
into
three
domains:
Local
Health
&
Laws,
Preventive
Medicine,
Clinical
Medicine.
Graph-based
excluded.
Each
model
completed
independent
runs
via
API,
accuracy
against
official
answers.
Statistical
differences
analyzed
using
ANOVA,
consistency
was
measured
Fleiss'
kappa
coefficient.
o1
achieved
highest
overall
(94.3%),
excelling
Medicine
(97.5%)
Law
(81.0%),
while
led
(92.6%).
Despite
domain-specific
variations,
all
surpassed
passing
criteria.
For
consistency,
ranked
(97.1%),
DeepSeek-V3
(97.5%).
Performance
declined
Law,
likely
due
to
legal
complexities
limited
Korean-language
training
data.
first
compare
OpenAI
DeepSeek
exam,
demonstrating
strong
performance,
ranking
within
top
10%
human
candidates.
While
most
accurate,
provided
a
cost-effective
alternative.
Future
research
should
optimize
LLMs
non-English
exams
develop
Korea-specific
improve
domains.
Язык: Английский
Comparative analysis of a standard (GPT-4o) and reasoning-enhanced (o1 pro) large language model on complex clinical questions from the Japanese orthopaedic board examination
Joe Hasei,
Ryuichi Nakahara,
Koichi Takeuchi
и другие.
Journal of Orthopaedic Science,
Год журнала:
2025,
Номер
unknown
Опубликована: Апрель 1, 2025
Язык: Английский
A Brief Review on Benchmarking for Large Language Models Evaluation in Healthcare
Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery,
Год журнала:
2025,
Номер
15(2)
Опубликована: Апрель 9, 2025
ABSTRACT
This
paper
reviews
benchmarking
methods
for
evaluating
large
language
models
(LLMs)
in
healthcare
settings.
It
highlights
the
importance
of
rigorous
to
ensure
LLMs'
safety,
accuracy,
and
effectiveness
clinical
applications.
The
review
also
discusses
challenges
developing
standardized
benchmarks
metrics
tailored
healthcare‐specific
tasks
such
as
medical
text
generation,
disease
diagnosis,
patient
management.
Ethical
considerations,
including
privacy,
data
security,
bias,
are
addressed,
underscoring
need
multidisciplinary
collaboration
establish
robust
frameworks
that
facilitate
reliable
ethical
use
healthcare.
Evaluation
LLMs
remains
challenging
due
lack
comprehensive
datasets.
Key
concerns
include
model
better
explainability,
all
which
impact
overall
trustworthiness
Язык: Английский
Overview of the Lymphoma Information Extraction and Automatic Coding Evaluation Task in CHIP 2024
Communications in computer and information science,
Год журнала:
2025,
Номер
unknown, С. 75 - 84
Опубликована: Янв. 1, 2025
Язык: Английский
The Effectiveness of Local Fine-Tuned LLMs: Assessment of the Japanese National Examination for Pharmacists
Research Square (Research Square),
Год журнала:
2025,
Номер
unknown
Опубликована: Апрель 15, 2025
Abstract
Large
Language
Models
(LLMs)
offer
great
potential
for
applications
in
healthcare
and
pharmaceutical
fields.
While
cloud-based
implementations
are
commonly
used,
they
present
challenges
related
to
privacy
cost.
This
study
examined
the
performance
of
locally
executable
LLMs
on
Japanese
National
Examination
Pharmacists
(JNEP).
Additionally,
we
explore
feasibility
creating
specialized
pharmacy
models
through
fine-tuning
with
Low-Rank
Adaptation
(LoRA).
Text-based
questions
from
97th
109th
JNEP
were
utilized,
comprising
2,421
training
165
testing.
Four
distinct
evaluated,
including
Microsoft
phi-4
DeepSeek
R1
Distill
Qwen
series.
Baseline
was
initially
assessed,
followed
by
using
LoRA
dataset.
Model
evaluated
based
accuracy
scores
achieved
test
In
baseline
evaluation
against
JNEP,
ranged
55.15–76.36%.
Notably,
CyberAgent
32B
passing
threshold
(approximately
61%).
Following
fine-tuning,
exhibited
a
increase
60.61–66.06%.
showed
that
capable
handling
knowledge
tasks
comparable
those
national
pharmacist
examination.
Moreover,
found
techniques
like
can
significantly
enhance
model
performance,
demonstrating
robust
AI
specifically
designed
pharmacological
applications.
These
findings
contribute
understanding
implementing
secure
high-performing
solutions
tailored
use.
Язык: Английский
NDDRF 2.0: An update and expansion of risk factor knowledge base for personalized prevention of neurodegenerative diseases
Alzheimer s & Dementia,
Год журнала:
2025,
Номер
21(5)
Опубликована: Май 1, 2025
Abstract
INTRODUCTION
Neurodegenerative
diseases
(NDDs)
are
chronic
caused
by
brain
neuron
degeneration,
requiring
systematic
integration
of
risk
factors
to
address
their
heterogeneity.
Established
in
2021,
Knowledgebase
Risk
Factors
for
Diseases
(NDDRF)
was
the
first
knowledge
base
consolidate
NDD
factors.
NDDRF
2.0
expands
focus
modifiable
lifestyle‐related
factors,
enhancing
utility
prevention.
METHODS
Data
from
past
4
years
were
comprehensively
updated,
while
lifestyle
manually
collected
and
filtered
1975
2024.
Each
factor
embedded
with
International
Classification
codes
clinical
stage
annotations,
then
re‐standardized,
classified,
annotated
accordance
Unified
Medical
Language
System
Semantic
Network.
RESULTS
encompasses
1971
classified
under
151
subcategories
across
20
NDDs,
including
536
covering
six
major
categories
is
freely
accessible
at
http://sysbio.org.cn/NDDRF/
.
DISCUSSION
As
lifestyle‐specific
holistic
offers
structured
deep
phenotype
information,
enabling
personalized
prevention
strategies
decision
support.
Highlights
An
enhanced
(Knowledgebase
[NDDRF]
2.0)
built
neurodegenerative
(NDDs).
provides
detailed
categorization
phenotypes
support
targeted
a
knowledge‐driven
resource
that
facilitates
assessment
proactive
health
management.
clinicians,
researchers,
at‐risk
populations
develop
implement
effective
strategies.
can
be
used
build
chatbots
large
language
models
future.
Язык: Английский