Behavior Research Methods,
Journal Year:
2024,
Volume and Issue:
57(1)
Published: Dec. 18, 2024
Abstract
The
study
of
large
language
models
(LLMs)
and
LLM-powered
chatbots
has
gained
significant
attention
in
recent
years,
with
researchers
treating
LLMs
as
participants
psychological
experiments.
To
facilitate
this
research,
we
developed
an
R
package
called
“MacBehaviour
“
(
https://github.com/xufengduan/MacBehaviour
),
which
interacts
over
100
LLMs,
including
OpenAI's
GPT
family,
the
Claude
Gemini,
Llama
other
open-weight
models.
streamlines
processes
LLM
behavioural
experimentation
by
providing
a
comprehensive
set
functions
for
experiment
design,
stimuli
presentation,
model
behaviour
manipulation,
logging
responses
token
probabilities.
With
few
lines
code,
can
seamlessly
up
conduct
experiments,
making
studies
highly
accessible.
validate
utility
effectiveness
“MacBehaviour,“
conducted
three
experiments
on
GPT-3.5
Turbo,
Llama-2-7b-chat-hf,
Vicuna-1.5-13b,
replicating
sound-gender
association
LLMs.
results
consistently
demonstrated
that
these
exhibit
human-like
tendencies
to
infer
gender
from
novel
personal
names
based
their
phonology,
previously
shown
Cai
et
al.
(2024).
In
conclusion,
“MacBehaviour”
is
user-friendly
simplifies
standardises
experimental
process
machine
studies,
offering
valuable
tool
field.
Cognitive Science,
Journal Year:
2025,
Volume and Issue:
49(5)
Published: May 1, 2025
Abstract
Humor
is
an
essential
aspect
of
human
experience,
yet
surprisingly,
little
known
about
how
we
recognize
and
understand
humorous
utterances.
Most
theories
humor
emphasize
the
role
incongruity
detection
resolution
(e.g.,
frame‐shifting),
as
well
cognitive
capacities
like
Theory
Mind
pragmatic
reasoning.
In
multiple
preregistered
experiments,
ask
whether
to
what
extent
exposure
purely
linguistic
input
can
account
for
ability
one‐line
jokes
identify
their
entailments.
We
find
that
GPT‐3,
a
large
language
model
(LLM)
trained
on
only
data,
exhibits
above‐chance
performance
in
tasks
designed
test
its
detect,
appreciate,
comprehend
jokes.
exploratory
work,
also
comprehension
several
open‐source
LLMs,
such
Llama‐3
Mixtral.
Although
all
LLMs
tested
fall
short
performance,
both
humans
show
tendency
misclassify
nonjokes
with
surprising
endings
Results
suggest
are
remarkably
adept
at
some
involving
jokes,
but
reveal
key
limitations
distributional
approaches
meaning.
Research Square (Research Square),
Journal Year:
2024,
Volume and Issue:
unknown
Published: April 3, 2024
Abstract
The
potential
of
multimodal
generative
artificial
intelligence
(mAI)
to
replicate
human
grounded
language
understanding,
including
the
pragmatic,
context-rich
aspects
communication,
remains
be
clarified.
Humans
are
known
use
salient
features,
such
as
visual
cues,
facilitate
processing
upcoming
words.
Correspondingly,
computational
models
can
integrate
and
linguistic
data
using
a
attention
mechanism
assign
next-word
probabilities.
To
test
whether
these
processes
align,
we
tasked
both
participants
(N
=
200)
well
several
state-of-the-art
with
evaluating
predictability
forthcoming
words
after
viewing
short
audio-only
or
audio-visual
clips
speech.
During
task,
model’s
weights
were
recorded
was
indexed
via
eye
tracking.
Results
show
that
estimates
from
humans
aligned
more
closely
scores
generated
vs.
their
unimodal
counterparts.
Furthermore,
an
doubled
alignment
judgments
when
context
facilitated
predictions.
In
cases,
patches
tracking
significantly
overlapped.
Our
results
indicate
improved
modeling
naturalistic
in
mAI
does
not
merely
depend
on
training
diet
but
driven
by
multimodality
combination
attention-based
architectures.
alike
leverage
predictive
constraints
information
attending
relevant
features
input.
medRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Jan. 24, 2024
Withdrawal
Statement
The
authors
have
withdrawn
their
manuscript
owing
to
needing
additional
internal
review.
Therefore,
the
do
not
wish
this
work
be
cited
as
a
reference
for
project.
If
you
any
questions,
please
contact
corresponding
author.
Computational Linguistics,
Journal Year:
2024,
Volume and Issue:
unknown, P. 1291 - 1355
Published: July 30, 2024
Abstract
Large
language
models
(LLMs)
have
garnered
a
great
deal
of
attention
for
their
exceptional
generative
performance
on
commonsense
and
reasoning
tasks.
In
this
work,
we
investigate
LLMs’
capabilities
generalization
using
particularly
challenging
type
statement:
generics.
Generics
express
generalizations
(e.g.,
birds
can
fly)
but
do
so
without
explicit
quantification.
They
are
notable
because
they
generalize
over
instantiations
sparrows
yet
hold
true
even
in
the
presence
exceptions
penguins
not).
For
humans,
these
generic
play
fundamental
role
cognition,
concept
acquisition,
intuitive
reasoning.
We
how
LLMs
respond
to
reason
about
To
end,
first
propose
framework
grounded
pragmatics
automatically
generate
both
–
collectively
exemplars.
make
use
focus—a
pragmatic
phenomenon
that
highlights
meaning-bearing
elements
sentence—to
capture
full
range
interpretations
generics
across
different
contexts
use.
This
allows
us
derive
precise
logical
definitions
exemplars
operationalize
them
from
LLMs.
Using
our
system,
dataset
∼370k
∼17k
conduct
human
validation
sample
generated
data.
final
Humans
documented
tendency
conflate
universally
quantified
statements
all
with
Therefore,
probe
whether
exhibit
similar
overgeneralization
behavior
terms
quantification
property
inheritance.
find
show
evidence
overgeneralization,
although
sometimes
struggle
exceptions.
Furthermore,
may
non-logical
humans
when
considering
inheritance
Proceedings of the National Academy of Sciences,
Journal Year:
2024,
Volume and Issue:
121(36)
Published: Aug. 26, 2024
Large
volumes
of
liquid
water
transiently
existed
on
the
surface
Mars
more
than
3
billion
years
ago.
Much
this
is
hypothesized
to
have
been
sequestered
in
subsurface
or
lost
space.
We
use
rock
physics
models
and
Bayesian
inversion
...
Transactions of the Association for Computational Linguistics,
Journal Year:
2024,
Volume and Issue:
12, P. 1755 - 1779
Published: Jan. 1, 2024
Abstract
Robust,
faithful,
and
harm-free
pronoun
use
for
individuals
is
an
important
goal
language
model
development
as
their
increases,
but
prior
work
tends
to
study
only
one
or
two
of
these
characteristics
at
a
time.
To
measure
progress
towards
the
combined
goal,
we
introduce
task
fidelity:
Given
context
introducing
co-referring
entity
pronoun,
reuse
correct
later.
We
present
RUFF,
carefully
designed
dataset
over
5
million
instances
robust
fidelity
in
English,
evaluate
37
variants
from
nine
popular
families,
across
architectures
(encoder-only,
decoder-only,
encoder-decoder)
scales
(11M-70B
parameters).
When
individual
introduced
with
models
can
mostly
faithfully
this
next
sentence,
they
are
significantly
worse
she/her/her,
singular
they,
neopronouns.
Moreover,
easily
distracted
by
non-adversarial
sentences
discussing
other
people;
even
sentence
distractor
causes
accuracy
drop
on
average
34
percentage
points.
Our
results
show
that
not
robust,
simple,
naturalistic
setting
where
humans
achieve
nearly
100%
accuracy.
encourage
researchers
bridge
gaps
find
reasoning
settings
superficial
repetition
might
inflate
perceptions
performance.
We
present
a
corpus
of
8400
Dutch
sentence
pairs,
intended
for
the
grammatical
evaluation
language
models.
Each
pair
consists
and
minimally
different
ungrammatical
sentence.
The
covers
84
paradigms,
classified
into
22
syntactic
phenomena.
Nine
sentences
each
paradigm
are
rated
acceptability
by
at
least
30
participants
each,
same
9
reading
times
recorded
per
word,
through
self-paced
reading.
Ten
sentence-pairs
were
created
hand,
while
remaining
ninety
semi-automatically
with
help
ChatGPT.
Here,
we
report
on
construction
dataset,
measured
ratings
times,
as
well
extent
to
which
variety
models
can
be
used
predict
both
ground-truth
grammaticality
human
ratings.
Frontiers in Human Neuroscience,
Journal Year:
2024,
Volume and Issue:
18
Published: Sept. 30, 2024
Psycholinguistic
literature
has
consistently
shown
that
humans
rely
on
a
rich
and
organized
understanding
of
event
knowledge
to
predict
the
forthcoming
linguistic
input
during
online
sentence
comprehension.
We,
authors,
expect
sentences
maintain
coherence
with
preceding
context,
making
congruent
sequences
easier
process
than
incongruent
ones.
It
is
widely
known
discourse
relations
between
(e.g.,
temporal,
contingency,
comparison)
are
generally
made
explicit
through
specific
particles,
as
connectives
,
and,
but,
because,
after
).
However,
some
easily
accessible
speakers,
given
their
knowledge,
can
also
be
left
implicit.
The
goal
this
paper
investigate
importance
in
prediction
events
human
language
processing
pretrained
models,
focus
concessives
contrastives,
which
signal
comprehenders
event-related
predictions
have
reversed
.
Inspired
by
previous
work,
we
built
comprehensive
set
story
stimuli
Italian
Mandarin
Chinese
differ
plausibility
situation
being
described
presence
or
absence
connective.
We
collected
judgments
reading
times
from
native
speakers
for
stimuli.
Moreover,
correlated
results
experiments
computational
modeling,
using
Surprisal
scores
obtained
via
Transformer-based
models.
judgements
were
seven-point
Likert
scale
analyzed
cumulative
link
mixed
modeling
(CLMM),
while
model
surprisal
linear
effects
regression
(LMER).
found
NLMs
sensitive
connectives,
although
they
struggle
reproduce
expectation
reversal
due
connective
changing
scenario;
even
less
aligned
data,
no
either
Surprisal.
Transactions of the Association for Computational Linguistics,
Journal Year:
2024,
Volume and Issue:
12, P. 1616 - 1647
Published: Jan. 1, 2024
Abstract
We
introduce
Holmes,
a
new
benchmark
designed
to
assess
language
models’
(LMs’)
linguistic
competence—their
unconscious
understanding
of
phenomena.
Specifically,
we
use
classifier-based
probing
examine
LMs’
internal
representations
regarding
distinct
phenomena
(e.g.,
part-of-speech
tagging).
As
result,
meet
recent
calls
disentangle
competence
from
other
cognitive
abilities,
such
as
following
instructions
in
prompting-based
evaluations.
Composing
review
over
270
studies
and
include
more
than
200
datasets
syntax,
morphology,
semantics,
reasoning,
discourse
Analyzing
50
LMs
reveals
that,
aligned
with
known
trends,
their
correlates
model
size.
However,
surprisingly,
architecture
instruction
tuning
also
significantly
influence
performance,
particularly
morphology
syntax.
Finally,
propose
FlashHolmes,
streamlined
version
that
reduces
the
computation
load
while
maintaining
high-ranking
precision.