Behavior Research Methods,
Год журнала:
2024,
Номер
57(1)
Опубликована: Дек. 18, 2024
Abstract
The
study
of
large
language
models
(LLMs)
and
LLM-powered
chatbots
has
gained
significant
attention
in
recent
years,
with
researchers
treating
LLMs
as
participants
psychological
experiments.
To
facilitate
this
research,
we
developed
an
R
package
called
“MacBehaviour
“
(
https://github.com/xufengduan/MacBehaviour
),
which
interacts
over
100
LLMs,
including
OpenAI's
GPT
family,
the
Claude
Gemini,
Llama
other
open-weight
models.
streamlines
processes
LLM
behavioural
experimentation
by
providing
a
comprehensive
set
functions
for
experiment
design,
stimuli
presentation,
model
behaviour
manipulation,
logging
responses
token
probabilities.
With
few
lines
code,
can
seamlessly
up
conduct
experiments,
making
studies
highly
accessible.
validate
utility
effectiveness
“MacBehaviour,“
conducted
three
experiments
on
GPT-3.5
Turbo,
Llama-2-7b-chat-hf,
Vicuna-1.5-13b,
replicating
sound-gender
association
LLMs.
results
consistently
demonstrated
that
these
exhibit
human-like
tendencies
to
infer
gender
from
novel
personal
names
based
their
phonology,
previously
shown
Cai
et
al.
(2024).
In
conclusion,
“MacBehaviour”
is
user-friendly
simplifies
standardises
experimental
process
machine
studies,
offering
valuable
tool
field.
Nature Human Behaviour,
Год журнала:
2024,
Номер
unknown
Опубликована: Ноя. 27, 2024
Abstract
Scientific
discoveries
often
hinge
on
synthesizing
decades
of
research,
a
task
that
potentially
outstrips
human
information
processing
capacities.
Large
language
models
(LLMs)
offer
solution.
LLMs
trained
the
vast
scientific
literature
could
integrate
noisy
yet
interrelated
findings
to
forecast
novel
results
better
than
experts.
Here,
evaluate
this
possibility,
we
created
BrainBench,
forward-looking
benchmark
for
predicting
neuroscience
results.
We
find
surpass
experts
in
experimental
outcomes.
BrainGPT,
an
LLM
tuned
literature,
performed
yet.
Like
experts,
when
indicated
high
confidence
their
predictions,
responses
were
more
likely
be
correct,
which
presages
future
where
assist
humans
making
discoveries.
Our
approach
is
not
specific
and
transferable
other
knowledge-intensive
endeavours.
Proceedings of the National Academy of Sciences,
Год журнала:
2024,
Номер
121(36)
Опубликована: Авг. 26, 2024
Large
volumes
of
liquid
water
transiently
existed
on
the
surface
Mars
more
than
3
billion
years
ago.
Much
this
is
hypothesized
to
have
been
sequestered
in
subsurface
or
lost
space.
We
use
rock
physics
models
and
Bayesian
inversion
...
Scientific Reports,
Год журнала:
2024,
Номер
14(1)
Опубликована: Ноя. 14, 2024
Large
Language
Models
(LLMs)
are
recruited
in
applications
that
span
from
clinical
assistance
and
legal
support
to
question
answering
education.
Their
success
specialized
tasks
has
led
the
claim
they
possess
human-like
linguistic
capabilities
related
compositional
understanding
reasoning.
Yet,
reverse-engineering
is
bound
by
Moravec's
Paradox,
according
which
easy
skills
hard.
We
systematically
assess
7
state-of-the-art
models
on
a
novel
benchmark.
answered
series
of
comprehension
questions,
each
prompted
multiple
times
two
settings,
permitting
one-word
or
open-length
replies.
Each
targets
short
text
featuring
high-frequency
constructions.
To
establish
baseline
for
achieving
performance,
we
tested
400
humans
same
prompts.
Based
dataset
n
=
26,680
datapoints,
discovered
LLMs
perform
at
chance
accuracy
waver
considerably
their
answers.
Quantitatively,
outperformed
humans,
qualitatively
answers
showcase
distinctly
non-human
errors
language
understanding.
interpret
this
evidence
as
suggesting
that,
despite
usefulness
various
tasks,
current
AI
fall
way
matches
argue
may
be
due
lack
operator
regulating
grammatical
semantic
information.
Proceedings of the National Academy of Sciences,
Год журнала:
2025,
Номер
122(8)
Опубликована: Фев. 20, 2025
Large
language
models
(LLMs)
can
pass
explicit
social
bias
tests
but
still
harbor
implicit
biases,
similar
to
humans
who
endorse
egalitarian
beliefs
yet
exhibit
subtle
biases.
Measuring
such
biases
be
a
challenge:
As
LLMs
become
increasingly
proprietary,
it
may
not
possible
access
their
embeddings
and
apply
existing
measures;
furthermore,
are
primarily
concern
if
they
affect
the
actual
decisions
that
these
systems
make.
We
address
both
challenges
by
introducing
two
measures:
LLM
Word
Association
Test,
prompt-based
method
for
revealing
bias;
Relative
Decision
strategy
detect
discrimination
in
contextual
decisions.
Both
measures
based
on
psychological
research:
Test
adapts
Implicit
widely
used
study
automatic
associations
between
concepts
held
human
minds;
operationalizes
results
indicating
relative
evaluations
candidates,
absolute
assessing
each
independently,
more
diagnostic
of
Using
measures,
we
found
pervasive
stereotype
mirroring
those
society
8
value-aligned
across
4
categories
(race,
gender,
religion,
health)
21
stereotypes
(such
as
race
criminality,
weapons,
gender
science,
age
negativity).
These
draw
from
psychology's
long
history
research
into
measuring
purely
observable
behavior;
expose
nuanced
proprietary
appear
unbiased
according
standard
benchmarks.
We
identify
and
analyze
three
caveats
that
may
arise
when
analyzing
the
linguistic
abilities
of
Large
Language
Models.
The
problem
unlicensed
generalizations
refers
to
danger
interpreting
performance
in
one
task
as
predictive
models’
overall
capabilities,
based
on
assumption
because
a
specific
is
indicative
certain
underlying
capabilities
humans,
same
association
holds
for
models.
human-like
paradox
lacking
human
comparisons,
while
at
time
attributing
Last,
double
standards
use
tasks
methodologies
either
cannot
be
applied
humans
or
they
are
evaluated
differently
models
vs.
humans.
While
we
recognize
impressive
LLMs,
conclude
claims
about
human-likeness
grammatical
domain
premature.
Computational Linguistics,
Год журнала:
2024,
Номер
unknown, С. 1 - 36
Опубликована: Июль 30, 2024
Abstract
How
should
we
compare
the
capabilities
of
language
models
(LMs)
and
humans?
In
this
article,
I
draw
inspiration
from
comparative
psychology
to
highlight
challenges
in
these
comparisons.
focus
on
a
case
study:
processing
recursively
nested
grammatical
structures.
Prior
work
suggests
that
LMs
cannot
process
structures
as
reliably
humans
can.
However,
were
provided
with
instructions
substantial
training,
while
evaluated
zero-shot.
therefore
match
evaluation
more
closely.
Providing
large
simple
prompt—with
substantially
less
content
than
human
training—allows
consistently
outperform
results,
even
deeply
conditions
tested
humans.
Furthermore,
effects
prompting
are
robust
particular
vocabulary
used
prompt.
Finally,
reanalyzing
existing
data
may
not
perform
above
chance
at
difficult
initially.
Thus,
indeed
humans,
when
comparably.
This
study
highlights
how
discrepancies
methods
can
confound
comparisons
conclude
by
reflecting
broader
challenge
comparing
model
capabilities,
an
important
difference
between
evaluating
cognitive
foundation
models.
Heliyon,
Год журнала:
2025,
Номер
11(2), С. e42083 - e42083
Опубликована: Янв. 1, 2025
Sentence
stimuli
pervade
psycholinguistics
research.
Yet,
limited
attention
has
been
paid
to
the
automatic
construction
of
sentence
stimuli.
Given
their
linguistic
capabilities,
this
study
investigated
efficacy
ChatGPT
in
generating
and
AI
tools
producing
auditory
In
three
psycholinguistic
experiments,
examined
acceptability
validity
AI-formulated
sentences
written
one
two
languages:
English
Arabic.
Experiment
1
3,
participants
gave
AI-generated
similar
or
higher
ratings
than
human-composed
2,
Arabic
received
lower
counterparts.
The
AI-developed
relied
on
design,
with
only
2
demonstrating
target
effect.
These
results
highlight
promising
role
as
a
developer,
which
could
facilitate
research
increase
its
diversity.
Implications
for
were
discussed.
In
this
paper,
we
focus
on
the
challenging
task
of
reliably
estimating
factual
knowledge
that
is
embedded
inside
large
language
models
(LLMs).To
avoid
reliability
concerns
with
prior
approaches,
propose
to
eliminate
prompt
engineering
when
probing
LLMs
for
knowledge.Our
approach,
called
Zero-Prompt
Latent
Knowledge
Estimator
(ZP-LKE),
leverages
in-context
learning
ability
communicate
both
question
as
well
expected
answer
format.Our
estimator
conceptually
simpler
(i.e.,
doesn't
depend
meta-linguistic
judgments
LLMs)
and
easier
apply
not
LLM-specific),
demonstrate
it
can
surface
more
latent
in
LLMs.We
also
investigate
how
different
design
choices
affect
performance
ZP-LKE.Using
proposed
estimator,
perform
a
large-scale
evaluation
variety
open-source
LLMs,
like
OPT,
Pythia,
Llama(2),
Mistral,
Gemma,
etc.
over
set
relations
facts
from
Wikidata
base.We
observe
differences
between
model
families
sizes,
some
are
consistently
better
known
than
others
but
differ
precise
they
know,
base
their
finetuned
counterparts.
Proceedings of the National Academy of Sciences,
Год журнала:
2025,
Номер
122(19)
Опубликована: Май 9, 2025
What
mechanisms
underlie
linguistic
generalization
in
large
language
models
(LLMs)?
This
question
has
attracted
considerable
attention,
with
most
studies
analyzing
the
extent
to
which
skills
of
LLMs
resemble
rules.
As
yet,
it
is
not
known
whether
could
equally
well
be
explained
as
result
analogy.
A
key
shortcoming
prior
research
its
focus
on
regular
phenomena,
for
rule-based
and
analogical
approaches
make
same
predictions.
Here,
we
instead
examine
derivational
morphology,
specifically
English
adjective
nominalization,
displays
notable
variability.
We
introduce
a
method
investigating
LLMs:
Focusing
GPT-J,
fit
cognitive
that
instantiate
learning
LLM
training
data
compare
their
predictions
set
nonce
adjectives
those
LLM,
allowing
us
draw
direct
conclusions
regarding
underlying
mechanisms.
expected,
explain
GPT-J
nominalization
patterns.
However,
variable
patterns,
model
provides
much
better
match.
Furthermore,
GPT-J’s
behavior
sensitive
individual
word
frequencies,
even
forms,
consistent
an
account
but
one.
These
findings
refute
hypothesis
involves
rules,
suggesting
analogy
mechanism.
Overall,
our
study
suggests
processes
play
bigger
role
than
previously
thought.
Free
associations
have
been
extensively
used
in
psychology
and
linguistics
for
studying
how
conceptual
knowledge
is
organized.
Recently,
the
potential
of
applying
a
similar
approach
investigating
encoded
LLMs
has
emerged,
specifically
as
method
LLM
biases.
However,
absence
large-scale
LLM-generated
free
association
norms
that
are
comparable
with
human-generated
an
obstacle
to
this
research
direction.
To
address
this,
we
create
new
dataset
modeled
after
"Small
World
Words"(SWOW)
nearly
12,000
cue
words.
We
prompt
three
(Mistral,
Llama3,
Haiku)
same
cues
those
SWOW
generate
novel
datasets,
"LLM
Words"
(LWOW).
From
construct
network
models
semantic
memory
represent
possessed
by
humans
LLMs.
validate
datasets
simulating
priming
within
models,
briefly
discuss
can
be
implicit
biases