bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Nov. 28, 2024
Abstract
Real
world
choices
often
involve
balancing
decisions
that
are
optimized
for
the
short-vs.
long-term.
Here,
we
reason
apparently
sub-optimal
single
trial
in
macaques
may
fact
reflect
long-term,
strategic
planning.
We
demonstrate
freely
navigating
VR
sequentially
presented
targets
will
strategically
abort
offers,
forgoing
more
immediate
rewards
on
individual
trials
to
maximize
session-long
returns.
This
behavior
is
highly
specific
individual,
demonstrating
about
their
own
long-run
performance.
Reinforcement-learning
(RL)
models
suggest
this
algorithmically
supported
by
modular
actor-critic
networks
with
a
policy
module
not
only
optimizing
long-term
value
functions,
but
also
informed
of
state-action
values
allowing
rapid
optimization.
The
artificial
suggests
changes
matched
offer
ought
be
evident
as
soon
offers
made,
even
if
aborting
occurs
much
later.
confirm
prediction
units
and
population
dynamics
macaque
dorsolateral
prefrontal
cortex
(dlPFC),
parietal
area
7a
or
dorsomedial
superior
temporal
(MSTd),
upcoming
reward-maximizing
upon
presentation.
These
results
cast
dlPFC
specialized
module,
stand
contrast
recent
work
distributed
recurrent
nature
belief-networks.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2024,
Volume and Issue:
unknown
Published: Feb. 6, 2024
Associative
learning
depends
on
contingency,
the
degree
to
which
a
stimulus
predicts
an
outcome.
Despite
its
importance,
neural
mechanisms
linking
contingency
behavior
remain
elusive.
Here
we
examined
dopamine
activity
in
ventral
striatum
-
signal
implicated
associative
Pavlovian
degradation
task
mice.
We
show
that
both
anticipatory
licking
and
responses
conditioned
decreased
when
additional
rewards
were
delivered
uncued,
but
remained
unchanged
if
cued.
These
results
conflict
with
contingency-based
accounts
using
traditional
definition
of
or
novel
causal
model
(ANCCR),
can
be
explained
by
temporal
difference
(TD)
models
equipped
appropriate
inter-trial-interval
(ITI)
state
representation.
Recurrent
networks
trained
within
TD
framework
develop
representations
like
our
best
'handcrafted'
model.
Our
findings
suggest
error
measure
describes
dopaminergic
activity.
Nature Communications,
Journal Year:
2024,
Volume and Issue:
15(1)
Published: July 12, 2024
Abstract
The
dominant
theoretical
framework
to
account
for
reinforcement
learning
in
the
brain
is
temporal
difference
(TD)
learning,
whereby
certain
units
signal
reward
prediction
errors
(RPE).
TD
algorithm
has
been
traditionally
mapped
onto
dopaminergic
system,
as
firing
properties
of
dopamine
neurons
can
resemble
RPEs.
However,
predictions
are
inconsistent
with
experimental
results,
and
previous
implementations
have
made
unscalable
assumptions
regarding
stimulus-specific
fixed
bases.
We
propose
an
alternate
describe
signaling
brain,
FLEX
(
F
lexibly
L
earned
E
rrors
x
pected
Reward).
In
FLEX,
release
similar,
but
not
identical
RPE,
leading
that
contrast
those
TD.
While
itself
a
general
framework,
we
specific,
biophysically
plausible
implementation,
results
which
consistent
preponderance
both
existing
reanalyzed
data.
Science Advances,
Journal Year:
2024,
Volume and Issue:
10(29)
Published: July 19, 2024
The
brain
may
have
evolved
a
modular
architecture
for
daily
tasks,
with
circuits
featuring
functionally
specialized
modules
that
match
the
task
structure.
We
hypothesize
this
enables
better
learning
and
generalization
than
architectures
less
modules.
To
test
this,
we
trained
reinforcement
agents
various
neural
on
naturalistic
navigation
task.
found
agent,
an
segregates
computations
of
state
representation,
value,
action
into
modules,
achieved
generalization.
Its
learned
representation
combines
prediction
observation,
weighted
by
their
relative
uncertainty,
akin
to
recursive
Bayesian
estimation.
This
agent’s
behavior
also
resembles
macaques’
more
closely.
Our
results
shed
light
possible
rationale
brain’s
modularity
suggest
artificial
systems
can
use
insight
from
neuroscience
improve
in
natural
tasks.
How
external/internal
‘state’
is
represented
in
the
brain
crucial,
since
appropriate
representation
enables
goal-directed
behavior.
Recent
studies
suggest
that
state
and
value
can
be
simultaneously
learnt
through
reinforcement
learning
(RL)
using
reward-prediction-error
recurrent-neural-network
(RNN)
its
downstream
weights.
However,
how
such
neurally
implemented
remains
unclear
because
training
of
RNN
‘backpropagation’
method
requires
weights,
which
are
biologically
unavailable
at
upstream
RNN.
Here
we
show
random
feedback
instead
weights
still
works
‘feedback
alignment’,
was
originally
demonstrated
for
supervised
learning.
We
further
if
constrained
to
non-negative,
occurs
without
alignment
non-negative
constraint
ensures
loose
alignment.
These
results
neural
mechanisms
RL
representation/value
power
biological
constraints.
How
external/internal
‘state’
is
represented
in
the
brain
crucial,
since
appropriate
representation
enables
goal-directed
behavior.
Recent
studies
suggest
that
state
and
value
can
be
simultaneously
learnt
through
reinforcement
learning
(RL)
using
reward-prediction-error
recurrent-neural-network
(RNN)
its
downstream
weights.
However,
how
such
neurally
implemented
remains
unclear
because
training
of
RNN
‘backpropagation’
method
requires
weights,
which
are
biologically
unavailable
at
upstream
RNN.
Here
we
show
random
feedback
instead
weights
still
works
‘feedback
alignment’,
was
originally
demonstrated
for
supervised
learning.
We
further
if
constrained
to
non-negative,
occurs
without
alignment
non-negative
constraint
ensures
loose
alignment.
These
results
neural
mechanisms
RL
representation/value
power
biological
constraints.
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Dec. 28, 2023
Hippocampal
place
cells
fire
in
sequences
that
span
spatial
environments
and
non-spatial
modalities,
suggesting
hippocampal
activity
can
anchor
to
the
most
behaviorally
salient
aspects
of
experience.
As
reward
is
a
highly
event,
we
hypothesized
rewards.
To
test
this,
performed
two-photon
imaging
CA1
neurons
as
mice
navigated
virtual
with
changing
hidden
locations.
When
moved,
firing
fields
subpopulation
moved
same
relative
position
respect
reward,
constructing
sequence
reward-relative
spanned
entire
task
structure.
The
density
these
increased
experience
additional
were
recruited
population.
Conversely,
largely
separate
maintained
spatially-based
code.
These
findings
thus
reveal
ensembles
flexibly
encode
multiple
reference
frames,
reflecting
structure
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: July 21, 2023
The
orbitofrontal
cortex
(OFC)
and
hippocampus
(HC)
are
both
implicated
in
forming
the
cognitive
or
task
maps
that
support
flexible
behavior.
Previously,
we
used
dopamine
neurons
as
a
sensor
tool
to
measure
functional
effects
of
OFC
lesions
(Takahashi
et
al.,
2011).
We
recorded
midbrain
rats
performed
an
odor-based
choice
task,
which
errors
prediction
reward
were
induced
by
manipulating
number
timing
expected
rewards
across
blocks
trials.
found
ipsilateral
recording
electrodes
caused
be
degraded
consistent
with
loss
resolution
states,
particularly
under
conditions
where
hidden
information
was
critical
sharpening
predictions.
Here
have
repeated
this
experiment,
along
computational
modeling
results,
HC
lesions.
results
show
also
shapes
map
our
however
unlike
OFC,
provides
local
trial,
appears
necessary
for
estimating
upper-level
states
based
on
is
discontinuous
separated
longer
timescales.
contrast
respective
roles
mapping
add
evidence
access
rich
set
from
distributed
regions
regarding
predictive
structure
environment,
potentially
enabling
powerful
teaching
signal
complex
learning
bioRxiv (Cold Spring Harbor Laboratory),
Journal Year:
2023,
Volume and Issue:
unknown
Published: Nov. 13, 2023
Abstract
Dopamine
release
in
the
nucleus
accumbens
has
been
hypothesized
to
signal
reward
prediction
error,
difference
between
observed
and
predicted
reward,
suggesting
a
biological
implementation
for
reinforcement
learning.
Rigorous
tests
of
this
hypothesis
require
assumptions
about
how
brain
maps
sensory
signals
predictions,
yet
mapping
is
still
poorly
understood.
In
particular,
non-trivial
when
provide
ambiguous
information
hidden
state
environment.
Previous
work
using
classical
conditioning
tasks
suggested
that
predictions
are
generated
conditional
on
probabilistic
beliefs
state,
such
dopamine
implicitly
reflects
these
beliefs.
Here
we
test
context
an
instrumental
task
(a
two-armed
bandit),
where
switches
repeatedly.
We
measured
choice
behavior
recorded
dLight
reflecting
core.
Model
comparison
among
wide
set
cognitive
models
based
behavioral
data
favored
used
Bayesian
updating
These
same
also
quantitatively
matched
measurements
better
than
non-Bayesian
alternatives.
conclude
belief
computation
contributes
performance
mice
reflected
mesolimbic
signaling.