Recent
advancements
in
large
language
models
(LLMs)
hold
significant
promise
for
improving
physics
education
research
that
uses
machine
learning.
In
this
study,
we
compare
the
application
of
various
conducting
a
large-scale
analysis
written
text
grounded
classification
problem:
identifying
skills
students’
typed
lab
notes
through
sentence-level
labeling.
Specifically,
use
training
data
to
fine-tune
two
different
LLMs,
BERT
and
LLaMA,
performance
these
both
traditional
bag-of-words
approach
few-shot
LLM
(without
fine-tuning).
We
evaluate
based
on
their
resource
use,
metrics,
outcomes
when
notes.
find
higher-resource
often,
but
not
necessarily,
perform
better
than
lower-resource
models.
also
all
report
similar
trends
outcomes,
although
absolute
values
estimated
measurements
are
always
within
uncertainties
each
other.
results
discuss
relevant
considerations
researchers
seeking
select
model
type
as
classifier.
Published
by
American
Physical
Society
2025
npj Science of Learning,
Journal Year:
2025,
Volume and Issue:
10(1)
Published: Feb. 25, 2025
Abstract
Recently,
the
option
to
use
large
language
models
as
a
middleware
connecting
various
AI
tools
and
other
led
development
of
so-called
multimodal
foundation
models,
which
have
power
process
spoken
text,
music,
images
videos.
In
this
overview,
we
explain
new
set
opportunities
challenges
that
arise
from
integration
in
education.
British Journal of Educational Technology,
Journal Year:
2025,
Volume and Issue:
unknown
Published: Feb. 24, 2025
Effective
educational
measurement
relies
heavily
on
the
curation
of
well‐designed
item
pools.
However,
calibration
is
time
consuming
and
costly,
requiring
a
sufficient
number
respondents
to
estimate
psychometric
properties
items.
In
this
study,
we
explore
potential
six
different
large
language
models
(LLMs;
GPT‐3.5,
GPT‐4,
Llama
2,
3,
Gemini‐Pro
Cohere
Command
R
Plus)
generate
responses
with
comparable
those
human
respondents.
Results
indicate
that
some
LLMs
exhibit
proficiency
in
College
Algebra
similar
or
exceeds
college
students.
find
used
study
have
narrow
distributions,
limiting
their
ability
fully
mimic
variability
observed
respondents,
but
an
ensemble
can
better
approximate
broader
distribution
typical
Utilizing
response
theory,
parameters
calibrated
by
LLM
high
correlations
(eg,
>0.8
for
GPT‐3.5)
counterparts.
Several
augmentation
strategies
are
evaluated
relative
performance,
resampling
methods
proving
most
effective,
enhancing
Spearman
correlation
from
0.89
(human
only)
0.93
(augmented
human).
Practitioner
notes
What
already
known
about
topic
The
collection
candidate
test
items
common
practice
when
designing
assessment
tool.
Large
(LLMs)
been
found
rival
abilities
variety
subject
areas,
making
them
low‐cost
option
testing
efficacy
Data
using
AI
has
effective
strategy
machine
learning
model
performance.
paper
adds
This
provides
first
analysis
open‐source
proprietary
as
compared
humans.
finds
produced
50
undergraduate
Using
augment
data
yields
mixed
results.
Implications
and/or
policy
moderate
performance
themselves
suggests
they
could
provide
curating
quality
low‐stakes
formative
summative
assessments.
methodology
offers
scalable
way
evaluate
vast
amounts
generative
AI‐produced
Recent
advancements
in
large
language
models
(LLMs)
hold
significant
promise
for
improving
physics
education
research
that
uses
machine
learning.
In
this
study,
we
compare
the
application
of
various
conducting
a
large-scale
analysis
written
text
grounded
classification
problem:
identifying
skills
students’
typed
lab
notes
through
sentence-level
labeling.
Specifically,
use
training
data
to
fine-tune
two
different
LLMs,
BERT
and
LLaMA,
performance
these
both
traditional
bag-of-words
approach
few-shot
LLM
(without
fine-tuning).
We
evaluate
based
on
their
resource
use,
metrics,
outcomes
when
notes.
find
higher-resource
often,
but
not
necessarily,
perform
better
than
lower-resource
models.
also
all
report
similar
trends
outcomes,
although
absolute
values
estimated
measurements
are
always
within
uncertainties
each
other.
results
discuss
relevant
considerations
researchers
seeking
select
model
type
as
classifier.
Published
by
American
Physical
Society
2025