Modeling and analysis of dynamic decision making in sequential

CONFIDENTIAL. Limited circulation. For review only.
Modeling and analysis of dynamic decision making
in sequential two-choice tasks
Linh Vu and Kristi A. Morgansen
Department of Aeronautics and Astronautics
University of Washington, Seattle
Email: {linhvu,morgansn}@u.washington.edu
Abstract—The focus of the work in this paper is the
construction and analysis of a dynamical system model for
human decision making in sequential two-choice tasks. In
these tasks, a human subject makes a series of interrelated
decisions between two choices in order to maximize the
reward, which depends on the past choices. Such experiments are common in behavioral and cognitive sciences
and have been previously studied using prediction-error
models and drift-diffusion models. Previous works have
mainly examined average behaviors of humans over the
entire task duration. In this work, we analyze asymptotic
behaviors of human decision making in such tasks, using
a nominal decision making policy based on behavioral
aspects of humans. A hybrid system is used to model the
closed loop of the decision making process in sequential
two-choice tasks. Our work presents a control theory
oriented perspective to explain observations in experiments
carried out by cognitive scientists.
I. I NTRODUCTION
Operation of mixed teams of humans and robots has
recently attracted new research interests with the goal
of ensuring the entire human/robot system to maintain
certain required performance even if performance of
the humans can be affected by factors such as fatigue
or stress. Such issue is especially important in large
scale human-robot systems, such as platoons of soldiers
commanding fleets of UAVs (Unmmaned Aerial Vehicles)/UGVs (Unmanned Ground Vehicles) [10], [11], airtraffic control systems [7] or emergency rescue systems
[9].
One way to address the above issue is to incorporate
human decision making dynamics into autonomous robot
control design. By doing so, one can make robots behave
in a way to accommodate (sometime) error-prone and
non-optimal human decisions so that performance of the
overall team of humans and robots is achieved and maintained. This line of research requires knowledge of both
human decision making and autonomous control and
This work is supported in part by AFOSR grant FA9550-07-10528.
has lately attracted collaborative research of cognitive
researchers and control researchers.
In this paper, we use control tools to study one particular type of human decision making, namely decision
making in sequential two-choice tasks. This type of tasks
are devised by cognitive scientists to study cognitive and
behavioral aspects of human decision making [5], [8]
(see also [3] on how human decision making in mixed
team of humans and robots can be mapped to this type
of tasks). In a sequential two-choice task, a participant
is presented with two choices A and B . At each time,
the participant chooses either A or B and receives a
reward, which depends on the current choice and the
number of A choices among the last (fixed number of)
decisions. The goal for the participant is to maximize the
reward. The participant does not know how the reward
is calculated, and thus needs to explore through a series
of A and B using rewards as feedback. In this type of
tasks, a primary question being addressed by cognitive
scientists is how optimal decision making is related to
human factors such as personality.
Human decision making in sequential two-choice
tasks has been modeled using the prediction-error model
[8], which is a neural network/reinforcement learning
model with two weights, one for each choice. The
probability of the next choice being biased towards
one choice is a function of the difference between the
weights. Closely related to the sequential two-choice
task is the two alternative forced-choice task in which
participants need to determine the direction of moving
dots embedded in a noisy screen. The two alternative
forced-choice task has been modeled using drift diffusion
models (see, e.g. [1], [2], [12] and the references therein),
in which the difference between the amounts of evidence
supporting one choice over the other is integrated over
time until it crosses a threshold. Experimental data are
used to determine parameters of these two models. The
criteria for fitting data for the prediction-error model is
the number of A choices over the entire experiment, and
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
for drift-diffusion models, the average reaction time and
error rate over the entire experiment.
The aim of this work is to bring a control engineering
perspective to modeling and analyzing sequential twochoice decision tasks. In particular, we focus on asymptotic behaviors of human decision making, a feature
which has not been addressed adequately in [1], [2], [8].
While human decision making is stochastic in nature,
as a first step, we model human decision making as a
deterministic policy. The policy is motivated by human
behavioral characteristics. We then extend the result to
include probability of human deviation from the nominal
policy.
Compared to the prediction-error model and the driftdiffusion models, our model is somewhat simpler and
enables us to analyze asymptotic behaviors for various
types of reward structures in sequential two-choice tasks.
Moreover, by directly modeling human characteristics
(such as conservativeness and riskiness), our model can
capture behavioral aspects of human decision making
(cf. the prediction-error model always behaves as conservative players [8]). Our work here lies between the
areas of cognitive science research and control theory.
With respect to the cognitive research community, this
work brings another perspective and explanation of the
dynamics of human decision making. To the control
community, this work introduces a particular class of
dynamical systems coming from cognitive science which
is quite different from the classical control system theme
in that the dynamics to be controlled are completely
unknown to the controller, and the controller has limited resources (i.e. the system is a black box and the
controller is the human, which has limited memory and
computation capability), and the system behavior is then
assessed via the structure of the system.
II. S EQUENTIAL
TWO - CHOICE TASK
A. The task
A sequential two-choice task is as follows [5]: a
participant is presented with two choices (or two buttons
on a screen), A and B . At each time, the participant can
choose either A or B and receives a reward afterward.
The goal is to maximize the reward. The reward is
calculated as follows: if the participant chooses A, and
the percentage of A among the last fixed number of (e.g.
20) decisions is x, the reward is ϕA (x); if the participant
chooses B , and the percentage of A among the last 20
decisions is x, the reward is ϕB (x) (see Fig. 1). The
number 20 is defined as the window of the task and can
be chosen differently and should be large enough. The
pair (ϕA , ϕB ) is called a reward structure. Examples
of common reward structures [8] are plotted in Fig. 1,
where the horizonal axis is the percentage of allocation
to A in last N decisions, and the vertical axis is the
reward (which is a non-negative number) for choice of
A or B at given percentage of A.
Reward
Reward
reward to A
reward to B ϕB
reward to A
reward to B ϕB
ϕA
ϕA
0
w1
Allocation to A
1
a. Matching shoulder task
Fig. 1.
0 w1 w2
Allocation to A
1
b. Rising optimum task
Common reward structures [8]
The difficulty for a participant in the task is that the
participant knows neither how the reward is calculated
nor the reward functions ϕA , ϕB . All information available to the participant is contained in the rewards. By
exploring through a series of A and B choices and using
the rewards as feedback, a participant aims to find the
sequence of A or B that achieves the maximum reward.
The difficulty depends on the reward structures, and
cognitive scientists are interested in finding out whether
a human can find the global optimum (in Fig 1, the
optimum is near the point 1 of the horizontal axis)
and how human characteristics (such as personality)
influence performance in these tasks.
B. Experimental observations
In experiments using the matching shoulder task (Fig.
1a) and a window N = 40, cognitive scientists have
observed that for a majority of the human test subjects
(18 out of 24), the human decisions are such that the
average of A over the entire experiment is biased towards
the point w1 of the horizontal axis [5]. The point w1 is
known as a crossing point in the cognitive literature [6].
In the rising optimum task (Fig. 1b), for about half of
all the participants (14 out of 25), the average of A over
the entire experiment is biased toward the point w1 . A
smaller percentage of the participants are biased toward
the point 1 of the horizontal axis. These particular biases
suggests that there is some mechanism underlying human
decision making in sequential two-choice tasks.
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
III. M ODELING
DECISION MAKING IN SEQUENTIAL
TWO - CHOICE TASKS
Human decision making in sequential two-choice
tasks belongs to a class of dynamic decision making
processes in which a human makes a sequence of interdependent decisions in order to achieve some objective.
The sequence of decisions are interdependent because
the human receives feedback after every decision, and
also, the environment within which the human operates
can change autonomously and/or as a result of the human
decisions.
Denote by tk ∈ R, k = 1, 2, . . . the times at which the
human makes (by convention, t0 is the initial time), and
by u(tk ) the corresponding decisions, where u(tk ) ∈
{A, B}. Denote by x the fraction of A in the last N
choices, where N is the window of the task. Between
tk and tk+1 , x does not change, and so the dynamics
of x are ẋ(t) = 0 for t ∈ (tk , tk + 1). At time tk , x is
changed according to the following impulsive map:
x(t+
k ) = g(x(tk ), u(tk ))
(
min{x(tk ) + N1 , 1}
=
max{x(tk ) − N1 , 0}
if u(tk ) = A
if u(tk ) = B.
(1)
In general, sequential decision tasks can be modeled
as feedback hybrid dynamical systems. Denote by U
the set of decisions a human can make; U is a discrete
set. Denote by x ∈ Rn the environment state. Between
two consecutive decision times, the environment evolves
according to ẋ = f (x) for some function f . At a decision
time, an impulsive jump occurs due to the human action:
x(t+
k ) = g(x(tk ), u(tk )), where g is the impulsive map.
The feedback to the human at time tk is denoted by
y(tk ) = h(x(tk ), u(tk )) for some function h. Finally,
the human decision at time tk is u(tk ) = ρ(I(tk )) for
some function ρ. Thus, for a general sequential decsion
task, the closed-loop system is a feedback system of the
following form (see Fig. 2; the dash line is to indicate
that communication happens only at times tk ):


ẋ(t) = f (x(t)), t 6= tk



+


x(tk ) = g(x(tk ), u(tk ))
(5)
y(tk ) = h(x(tk ), u(tk ))



u(tk ) = ρ(I(tk ))



I(tk ) = I(tk−1 ) ∪ y(tk−1 ).
ẋ = f (x)
x(t+
)
k = g(x(tk ), u(tk ))
The min and max functions are used to capture the fact
that x cannot go below 0 or above 1. Let (ϕA , ϕB ) be
a reward structure, where {ϕA , ϕB } : [0, 1] → [0, Rmax ]
for some number Rmax . The reward to the participant is
y(tk ) = ϕu(tk ) (x(tk )).
x(tk ) = ρ(I(tk ))
I(tk ) = I(tk−1) ∪ y(tk )
(2)
In control terminology, the decision u is the control
signal and y is the output. In general, the control signal
is of the form
u(tk ) = ρ(I(tk ))
System (the environment)
(3)
for some function ρ (which could be probabilistic
and vary from person to person), and I(tk ) :=
{u(ti ), y(ti )}, i < k} is the information available to
the controller (i.e. the human in this case) at time tk .
The function ρ reminiscences a nonlinear dynamical
system/filter in general. The closed loop system of the
sequential two choice task is written as


ẋ(t) = (
0 t 6= tk




1

min{x(t−
if u(tk ) = A

k ) + N , 1}

x(t
)
=

k

−
1
max{x(tk ) − N , 0} if u(tk ) = B.
y(tk ) = ϕu(tk ) (x(tk ))




u(tk ) = ρ(I(tk ))



I(t ) = I(t ) ∪ y(t ).
k
k−1
k−1
(4)
Controller (the human)
Fig. 2.
The closed loop system of a sequential decision task
The closed loop system (5) is a hybrid system because
the closed loop comprises both variables taking continuous values (i.e. the variable x) and variables taking
discrete values (i.e. the variable u).
A. The γ -decision making policy
We do not solve for a control law ρ that optimally
plays the sequential two-choice tasks, but rather, try to
understand human decision making in such tasks (which
could be non-optimal). In this paper, we seek a static
control law ρ, which is termed a policy in this context.
A static ρ is of interest for the following reason: a
static ρ can be seen as an input-output characterization
of human decision making dynamics. Such a static ρ
makes analysis of asymptotic behaviors of the closed
loop control system (4) possible. Also, this input-output
approach allows us to incorporate human behavioral
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
characteristics (such as exploring or hedging) directly
into the model (further discussions are in Section VI).
We now present one simple decision making rule—
termed the γ -policy—in sequential two-choice tasks. The
proposed policy works as follows: If the last action
does not decrease the reward, then continue with the
current choice, otherwise switch to the other choice.
Mathematically, at time tk , the decision u(tk ) is
(
u(tk−1 )
if y(tk−1) > y(tk−2 )
u(tk ) :=
switch(u(tk−1 )) else,
(6)
where switch(a) = B if a = A, and switch(a) = A if
a = B . By convention, y(t0 ) = y(t1 ).
The rationale behind the γ -policy is as follows: In
sequential two-choice tasks where the goal is to seek
maximum, we postulate that humans largely follow the
following two courses of action:
• A1: If the last reward increases or does not decrease,
one keeps the current choice.
• A2: If the last reward decreases, one will immediately switch the decision.
The above courses of action are illuminated by human
behavioral characteristics. If the outcome is as expected
i.e. when the last choice increases the reward, there is no
incentive to change from the last decision (alternatively,
it is the incentive to continue with the same decision). If
the outcome is not as expected i.e when the last choice
decreases the reward, it is the incentive to switch decision
(in order to avoid suffering further potential losses). In
fact, in cognitive psychology terminology, actions of
Type A1 are known as exploitation, and actions of Type
A2 are known as exploration, and in fact, relationships
between exploitation and exploration in human decision
making is a major research theme in cognitive science
[4]. Using this terminology, the first case of (1) encodes
exploitation, and the second case encodes exploration.
Switching between the two is triggered by whether
expectation is met or not (i.e. by the logical condition
y(tk1 ) > y(tk−2 )).
The actions A1 and A2 are idealistic. Some humans
may try to explore by switching decisions even if there is
no decrease in rewards (for example, in gambling type of
activities), or some may stay with a decision even if there
is a temporary decline in rewards (for example, those
traits can be found in long term stock investors). We
will discuss modifications of the policy (1) to incorporate
these deviations later in Section VI, but for the moment,
for the sake of conveying our idea and for analysis, we
use the (ideal) policy (1).
IV. A NALYSIS
A. Reward structures
In sequential two-choice tasks under the γ -policy,
asymptotic behaviors of u and x depend solely on the
reward structure. To simplify the study here, we assume
that no plateau is present in the reward structure (i.e.
there is no nonzero interval in which ϕA or ϕB are
constant) but the approach in this paper can be extended
to include this case.
We identify eight basic types of reward structures
based on the monotonicity of ϕA and ϕB and their
relative positions. We write ϕA > ϕB if ϕA (x) >
ϕB (x) ∀x ∈ (0, 1). Similarly, we write ϕA < ϕB
if ϕA (x) < ϕB (x) ∀x ∈ (0, 1). The definitions and
examples the basic reward structures are plotted in Fig.
3; the horizontal axis is the percentage of A choices in
the last N decisions, and the vertical axis is the rewards
for choice of A or B at given percentage of A in the
last w decision. The range of the rewards is [0, Rmax ]
for some Rmax , and without loss of generality, Rmax
can be taken as 1 after some scaling. For a structure
Γ = (ϕA , ϕB ), denote by Type(Γ) the type of Γ; for
example, if Γ is of Type 1, we write Type(Γ) = 1.
A regular reward function is a continuous function
whose derivative can only change sign a finite number of
times in any finite interval (an example of an irregular reward function is ϕ(x) = sin(1/(x−a)), which oscillates
infinitely fast near x = a). A reward structure is regular if
both ϕA and ϕB are regular reward functions. Moreover,
if ϕA and ϕB intersect, the number of crossing points is
finite. Regularity of reward structures is not a restriction,
but rather, a formal technicality for the proof. All the
reward structures implemented in experiments [5], [8]
have been regular.
To facilitate analysis, we allow the domains of reward
structures be general intervals D ⊂ R other than the
interval [0, 1]. The definition of the basic reward structures in Fig. 3 does not change when we replace [0, 1] by
an interval D . We say that a reward structure (ϕA , ϕB )
on a domain D1 is a substructure of a reward structure
with (fA , fB ) on a domain D2 if D1 ⊂ D2 , and ϕA is
a portion of fA and ϕB is a portion of fB on the same
interval D1 . A reward structure is a serial composition
of substructures if it is the same as the concatenation of
the substructures. We have the following lemma (due
to space limitations, the proof is omitted; see [13] for
details).
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
ϕA
ϕA
ϕA
ϕA
ϕB
ϕB
ϕB
ϕB
0
Type 1
sgn ϕ̇A > 0,
sgn ϕ̇B > 0,
ϕ̇A > ϕ̇B
1
0
Type 2
sgn ϕ̇A > 0,
sgn ϕ̇B < 0,
ϕ̇A > ϕ̇B
1
ϕB
0
Type 3
sgn ϕ̇A < 0,
sgn ϕ̇B > 0,
ϕ̇A > ϕ̇B
1
0
Type 4
sgn ϕ̇A < 0,
sgn ϕ̇B < 0,
ϕ̇A > ϕ̇B
1
ϕB
ϕB
ϕB
ϕA
ϕA
ϕA
ϕA
0
Type 5
sgn ϕ̇A > 0,
sgn ϕ̇B > 0,
ϕ̇A < ϕ̇B
1
0
Fig. 3.
Type 6
sgn ϕ̇A < 0,
sgn ϕ̇B > 0,
ϕ̇A < ϕ̇B
1
0
Type 7
sgn ϕ̇A > 0,
sgn ϕ̇B < 0,
ϕ̇A < ϕ̇B
1
0
Type 8
sgn ϕ̇A < 0,
sgn ϕ̇B < 0,
ϕ̇A < ϕ̇B
1
The basic reward structures for sequential two-choice tasks
Lemma 1 Any regular reward structure can be uniquely
decomposed into a serial composition of substructures of
the basic types.
Example 1 The reward structure in Fig. 1a. can be
uniquely decomposed into the two substructures: Γ1 on
[0, w1 ] and Γ2 on [w1 , w] where w1 is the intersection of
ϕA and ϕB . We have Type(Γ1 ) = 1 and Type(Γ2 ) = 6.
The reward structure in Fig. 1b can be uniquely
decomposed into the three substructures: Γ1 , Γ2 , and Γ3
on [0, w1 ], [w1 , w2 ], and [w2 , w], respectively. We have
type(Γ1 ) = 3, Type(Γ2 ) = 6, and Type(Γ3 ) = 5.
⊳
Let z(tk ) := (u(tk−1 ), u(tk )) be the ordered sequence
of the two consecutive decisions at time tk ; it is clear
that z(tk ) ∈ {AA, AB, BA, BB}. We write x ֌ x∗ if
there exists T < ∞ such that x(t) = x∗ for all t > T .
We have the following lemma (due to space limitations,
the proof is omitted; see [13] for details).
Lemma 2 Consider the system (4) with the impulsive
map (7). Let Γ be a basic reward structure on [a, b].
Under the γ -policy (6),
•
•
B. Asymptotic behavior
The state x is bounded because the impulse map g is
bounded, and the dynamics of x are stable in between
[tk , tk+1 ] (recall that in the sequential two-choice task,
ẋ = 0). We can further show that under the γ -policy,
the decisions u exhibit asymptotic behaviors. As in the
previous section, for generality, we allow the domain of
the reward structure be an arbitrary interval [a, b], and
further, we allow a general step change ∆ instead of
1/N in the impulsive map g in (1):
(
min{x(tk ) + ∆, b}
if u(tk ) = A
ḡ(x(tk ), u(tk )) =
max{x(tk ) − ∆, a} if u(tk ) = B.
(7)
•
•
if Type(Γ) ∈ {1, 3}, then x ֌ b for all z(t0 ),
if Type(Γ) ∈ {2, 4}, then
(
x ֌ a if z(t0 ) = BB,
,
x ֌ b if z(t0 ) 6= BB,
if Type(Γ) ∈ {5, 7}, then
(
x ֌ b if z(t0 ) = AA,
x ֌ a if z(t0 ) 6= AA,
if Type(Γ) ∈ {6, 8}, then x ֌ a for all z(t0 ).
Corollary 1 Consider the system (4) with the impulsive
map (7). Let Γ be a basic reward structure on [a, b].
Under the γ -policy (6), for every initial state x(0) ∈
[a, b], there exists a finite time T and x∗ ∈ [a, b] such
that z(tk ) ∈ {BB, AA} and x(t) = x∗ for all tk > T .
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
For reward structures not one of the eight basic types,
we can decompose the structure into substructures of
the basic types using Lemma 1, and then examine the
behavior of the overall structure using Lemma 2. For a
reward structure Γ on the domain [0, 1], let {Γ1 , . . . , Γm }
be the unique substructure decomposition of Γ. Denote
by w1 , . . . , wm−1 be the corresponding partition points
in [0, 1] such that the domain of Γ1 is [0, w1 ], . . . , and
the domain of Γm is [wm−1 , w]. By convention, w0 = 0
and wm = 1. Lemma 2 tells us that within [wk , wk+1 ],
x will go towards either wk or wk+1 . At a point wk ,
x can jump from on interval to another interval. With
respect to behavior of x around a point wk , we have the
following concept of attracting points:
Definition 1 A point wk is called an attracting point if
there exist number T < ∞, α > 0 such that if |x(t0 ) −
wk | 6 α, then for all z(t0 ), |x(t) − wk | 6 ∆ for all
t > t0 + T , where t0 is the initial time.
The following lemma gives a condition for identifying
attracting points of reward structures.
Lemma 3 The point x = 0 is an attracting point if
and only if Type(Γ1 ) ∈ {6, 8}. The point x = 1 is
an attracting point if and only if Type(Γm ) ∈ {1, 3}.
For wk 6= 0, 1, wk is an attracting point if and only if
Type(Γk ) ∈ {1, 3} and Type(Γk+1 ) ∈ {6, 8}.
We have the following main result.
Theorem 1 Consider the system (4) under the γ -policy
(6) with a reward structure Γ on [0, 1] and a window N .
Let W be the set of attracting points of Γ. For any initial
state x(t0 ) and any initial decision u(t0 ), there is a time
T < ∞ and x∗ ∈ {W, 0, 1} such that |x(t)− x∗ | 6 1/N
for all t > T .
Proof: From Lemma 2, x must asymptotically converge to 0, 1, or an attracting point wk ∈ (0, 1) in
view of the definition of attracting points. The bound
on |x(t) − x∗ | is 1/N follows from the fact that the
impulsive map g jump step size is 1/N .
Example 2 Consider the matching shoulder task in Fig.
1a. The reward structure is uniquely decomposed into the
two substructures: Γ1 on [0, w1 ] and Γ2 on [w1 , w], where
w1 is the intersection of ϕA and ϕB . Since Type(Γ1 ) = 3
and Type(Γ2 ) = 6, the point w1 is an attracting point
by Lemma 3. Also by Lemma 3, the point 0 and w are
not attracting points. Theorem 1 tells us that under the
γ -policy, for every initial states, x will asymptotically
converge to a ball of radius 1/N centered at w1 . This
prediction matches with the observation that human
decisions bias towards the crossing point (see Section
II-B). Also, the magnitude of the difference between ϕA
and ϕB did not affect the human behaviors in the tests
[5] and this feature is also captured in our model.
For the rising optimum task in Fig. 1b, the reward
structures comprises the three substructures: Γ1 , Γ2 , and
Γ3 on [0, w1 ], [w1 , w2 ], and [w2 , w], respectively, and
Type(Γ1 ) = 3, Type(Γ2 ) = 6, and Type(Γ3 ) = 5.
The point w1 is an attracting point; w2 and 1 are not
attracting points. Depending on the initial state x(t0 ) and
the initial decisions z(t0 ), x can converge towards w1
or 1 but it cannot stay at w2 . This prediction matches
the observation that human decisions bias towards the
crossing point w1 or the optimum point 1.
⊳
Theorem 1 enables us to predict outcomes of the γ policy in sequential two-choice tasks for a wide variety
of reward structures. Examples of other reward structures
are plotted in Fig. 4 (axis labels and legends are the same
as before and is omitted due to space limitation). Below
each figure, the first line is the decomposition of the
reward structure, and the second line is the set of points
which x can asymptotically converge to under the γ policy as predicted by Theorem 1. The reward structure
in Fig. 4a has only been recently tested with humans.
The reward structures in 4b, Fig. 4c, and 4d have not
been tried with human subjects, and human subject tests
with the proposed reward structures of this form are the
subject of future work and to demonstrate the flexibility
of the methods presented here.
V. N ONDETERMINISTIC γ - POLICY
Humans may not follow the γ -policy if one is not
consciously aware of it. However, as we discussed above,
humans may unconsciously follow the γ -policy due to
common human psychology. In this section, we will
relax the condition that a human follows the γ -policy
at all times, and allow the human to only choose the the
decision dictated by the γ -policy with a probability p.
We call 1 − p the error probability.
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
Reward
Reward
ϕB
ϕB
ϕA
ϕA
w1
0
w2 w3
1
w1 w2 w3
0
a. (1,3,6,8)
{w2 }
1
b. (3,6,7,2)
{w1 , 1}
Reward
Reward
ϕA
ϕA
ϕB
ϕB
0
w1
w2 w3
1
0
w1
1
d. (7,2)
{0, 1}
c. (7,2,1,3)
{0, 1}
Theorem 2 Consider the system (4) under the probabilistic γ -policy (8) with a reward structure Γ on [0, 1]
and a window N . Suppose that all the substructures of
Γ are of Type 1,3,6, or 8. Let W be the set of attracting
points of Γ. ∃p∗ ∈ (0, 1) such that if the error probability
in (8) is less than p∗ , then for any initial state x(t0 ) and
any initial decision u(t0 ), ∃x∗ ∈ {W, 0, 1} such that
P (E|x(t) − x∗ | 6 1/N ∀t > T for some T ) = 1.
Theorem 2 demonstrates that even if a human participant does not follow the γ -policy all the time in
experiments, the sequence of decisions still converges
to the steady state predicted in the deterministic case
if the human indeed unconsciously follows the γ -policy
most of the time (which is due to the nature of human
psychology as discussed in the assumptions A1 and A2
in Section III-A).
Fig. 4. Other reward structures and predictions by Theorem 1 with
the γ-policy
In particular, a probabilistic

u(tk−1 )










switch(u(tk−1 ))




u(tk ) :=

switch(u(tk−1 ))










u(tk−1 )



γ -policy is
if y(tk−1) > y(tk−2 ),
with prob. p
if y(tk−1) < y(tk−2 ),
with prob. p
if y(tk−1) > y(tk−2 ),
with prob. 1 − p
if y(tk−1) < y(tk−2 ),
with prob. 1 − p.
(8)
Because only Types 1,3, 6, and 8 exhibit asymptotic
behaviors for all initial decisions, we will focus only on
these types of reward structures in the probabilistic case
(otherwise, one needs to further impose a probability
distribution on the initial states).
Lemma 4 Consider the system (4) with the impulsive
map (7). Let Γ be a basic reward structure on [a, b]
such that Type(Γ) ∈ {1, 3, 6, 8}. Under the probabilistic
γ -policy (8), for every initial state x(t0 ) and initial
decision z(t0 ), we have P (z(t) ∈ {BB, AA} ∀t >
T for some T ) = 1, and ∃x∗ ∈ [a, b] such that
P (Ex(t) = x∗ ∀t > T for some T ) = 1.
Due to space limitations, the proof is omitted; see [13]
for details. From Lemma 4 and Theorem 1, we arrive
at the following result.
VI. D ISCUSSION
The prediction-error model [8] and drift-diffusion
models [1], [12] view two-choice tasks as a type of decision making which seeks definitive answer of one choice
over the other. These models use accumulated amounts
of evidence supporting the choices as a basis for making
decisions. The models also have links to neuro science
in which the accumulated amounts of evidence relates to
dopamine levels in the brain. However, less results have
been reported in terms of behavioral characteristics of
these models. For example, the prediction-error model
has not been able to capture behaviors of a certain class
of decision making (the risky type) in sequential twochoice tasks [8].
On the other hand, behavioral science has established
exploration and exploitation as two main types of human
behaviors in decision making (see, e.g., [4]). Using
this fact as a basis, the γ -policy directly encodes the
exploration and exploitation behavioral aspects of human
decision making in sequential two-choice tasks. This
approach first allows us to address asymptotic behaviors
in sequential decision making–a topic relevant to control
of mixed teams of humans and robots that has not
been addressed much in the prediction-error model and
drift-diffusion models. Further, human factors can be
incorporated into the γ -policy as discuss below.
Extensions
Personality: A γ -policy with a threshold δ is:
(
u(tk−1 )
if y(tk−1 ) > y(tk−2 ) + δ
u(tk ) :=
switch(u(tk−1 )) else.
(9)
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.
CONFIDENTIAL. Limited circulation. For review only.
If δ < 0, the modified strategy means the behavior is
more exploitative: the action does not change unless the
reward degrades by an amount of at least δ (this action
means a risky behavior because the player does not react
to negative development). On the other hand, if δ > 0,
the strategy is more explorative: the decision is switched
unless the reward is improved by a amount of at least
δ (this action can be seen as conservative in sequential
two-choice tasks because the goal is to seek optimum,
and so it is not risky to explore). Thus, in essence, the
variable δ may capture the risk attitude of humans in
sequential two-choice tasks: δ = 0 means risk neutral
(most people), δ > 0 means conservative behavior (few
people), and δ < 0 means risky behavior (few people).
Response time: The effect of response time under
pressure (e.g. deadline) can be included as degradation
of the reward after a certain time (alternatively, one can
view this as a problem concerning changing environment). For example, a participant could be told that one
has τ unit of times to make a decision after which the
reward will deteriorate, i.e. the feedback to the human is
(
ϕu(tk ) (x(tk ))
iftk − tk−1 < τ
y(tk ) =
−λ(t
k −tk−1 )
ϕu(tk ) (x(tk ))e
iftk − tk−1 > τ
(10)
for some λ > 0. This scenario can be examined under
the probabilistic γ -policy framework in which the error
probability pe = 1 − p is a function of τ . One needs
to find reasonable relationships between pe and τ such
that pe is larger if τ is smaller, pe is smaller if τ is
larger, and probably, p → 1 as τ → ∞. Different types
of relationship between p and τ could exist depending
human personality. This structure also captures the issue
between decision time and error rate (speed vs. accuracy)
in human decision making.
Longer memory: Another aspect is to include more
memory in the γ -policy (for example, chess players can
remember a larger number of steps than average people).
Instead of using the last two rewards in the γ -policy, one
can have a policy with three or more past rewards.
VII. C ONCLUSION
In this paper, we modeled human decision making
in sequential two-choice tasks using a decision making
policy known as the γ -policy, which is derived based
on human behavioral characteristics. We presented a
closed loop hybrid model for the sequential two-choice
task under the γ -policy. We provided an analysis of
asymptotic behaviors of the closed loop, and showed
that under the deterministic γ -policy, the decision asymptotically converges to an attracting point of the reward
structure or to the end points. For the probabilistic γ policy, convergence to the same steady state in the
deterministic case is guaranteed with probability 1 if
the error probability is less than a certain threshold.
The predictions of our result agree with observations in
reported experiments with human subjects. Future work
is to further explore the framework presented here to
include other aspects of human characteristics in decision
making as outlined in Section VI and to validate the
theoretical results with human subject experiments.
R EFERENCES
[1] R. Bogacz, E. Brown, J. Moehlis, P. Holmes, and J. D. Cohen.
The physics of optimal decision making: A formal analysis of
models of performance in two-alternative forced choice tasks.
Psychological Review, 113(4):700–765, 2006.
[2] R. Bogacz, S. M. McClure, J. Li, J. D. Cohen, and P. R.
Montague. Short-term memory traces for action bias in human
reinforcement learning. Brain Research, 1153:111–121, 2007.
[3] M. Cao, A. Stewart, and N. E. Leonard. Integrating human
and robot decision-making dynamics with feedback: Models
and convergence analysis. Preprint. Submitted to the 47th IEEE
Conf. Decision and Control, 2008.
[4] J. D. Cohen, S. M. McClure, and A. J. Yu. Should I stay or
should I go? How the human brain manages the tradeoff between exploitation and exploration. Philosophical Transactions
of the Royal Society B: Biological Sciences, 362:933–942, 2007.
[5] D.M. Egelman, C. Person, and P.R. Montague. A computational
role for dopamine delivery in human decision-making. J. Cogn.
Neurosci, 10:623–630, 1998.
[6] R. J. Herrnstein. Rational choice theory: necessary but not
sufficient. American Psychologist, 45:356–367, 1990.
[7] B. Hilburn, P.G. Jorna, E.A. Byrne, and R. Parasuraman. The
effect of adaptive air traffic control (atc) decision aiding on
controller mental workload. In M. Mouloua and J. Koonce,
editors, Human-automation interaction: Research and practice,
pages 84–91. Lawrence Erlbaum Associates, 1997.
[8] P. R. Montague and G. S. Berns. Neural economics and the
biological substrates of valuation. Neuron, 36:265–284, 2002.
[9] R. R. Murphy. Human-robot interaction in rescue robotics.
IEEE Trans. on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 34:138–153, 2004.
[10] National Research Council. Technology Development for Army
Unmanned Ground Vehicle. National Academies Press, Washington, D.C., 2002.
[11] National Research Council. Autonomous Vehicles in Support of
Naval Operations. National Academies Press, 2005.
[12] R. Ratcliff. A theory of memory retrivial. Psychological Review,
83:59–108, 1978.
[13] L. Vu and K. A. Morgansen. Modeling and analysis of dynamic decision making in sequential two-choice decision tasks.
Preprint. http:// vger.aa.washington.edu/ ∼linhvu/ research/ two
choice tasks.pdf , 2008.
Preprint submitted to 47th IEEE Conference on Decision and Control.
Received March 13, 2008.

Download Report

Modeling and analysis of dynamic decision making in sequential

Paperzz.com

Your Paperzz