Stochastic DP - Fatih Cavdur

Stochastic Dynamic Programming
Fatih Cavdur
[email protected]
Example
Example: For a price of $1/gallon, the Safeco Supermarket chain has
purchased 6 gallons of milk from a local dairy. Each gallon of milk is sold in the
chainโ€™s three stores for $2/gallon. The dairy must buy back for 50¢/gallon any
milk that is left at the end of the day. Unfortunately for Safeco, demand for
each of the chainโ€™s three stores is uncertain. Past data indicate that the daily
demand at each store is as shown in the below table. Safeco wants to allocate
the 6 gallons of milk to the three stores so as to maximize the expected net
daily profit (revenues less costs) earned from milk. Use dynamic programming
to determine how Safeco should allocate the 6 gallons of milk among the three
stores.
[email protected]
Example
Daily Demand Probability
Store 1
1
.6
2
.0
3
.4
Store 2
1
.5
2
.1
3
.4
Store 3
1
.4
2
.3
3
.3
Table: Problem Data
[email protected]
Example
๐‘Ÿ๐‘ก ๐‘”๐‘ก = expected revenue earned from ๐‘”๐‘ก gallons assigned to store ๐‘ก
๐‘“๐‘ก ๐‘ฅ = maximum expected revenue earned from ๐‘ฅ gallons assigned to stores
๐‘ก, ๐‘ก + 1, โ€ฆ ,3
We have,
๐‘“3 ๐‘ฅ = ๐‘Ÿ3 (๐‘ฅ)
๐‘“๐‘ก ๐‘ฅ = max ๐‘Ÿ๐‘ก ๐‘”๐‘ก + ๐‘“๐‘ก+1 (๐‘ฅ โˆ’ ๐‘”๐‘ก ) , ๐‘ก = 1,2
๐‘”๐‘ก
๐‘Ÿ3 0 = $0 ๐‘Ÿ3 1 = $2.00 ๐‘Ÿ3 2 = $3.40 ๐‘Ÿ3 3 = $4.35
๐‘Ÿ2 0 = $0 ๐‘Ÿ2 1 = $2.00 ๐‘Ÿ2 2 = $3.25 ๐‘Ÿ2 3 = $4.35
๐‘Ÿ1 0 = $0 ๐‘Ÿ1 1 = $2.00 ๐‘Ÿ1 2 = $3.10 ๐‘Ÿ1 3 = $4.20
[email protected]
Example
Stage 3 Computations:
๐‘“3 0 = ๐‘Ÿ3 0 = 0.00 โ‡’ ๐‘”3 0 = 0
๐‘“3 1 = ๐‘Ÿ3 1 = 2.00 โ‡’ ๐‘”3 1 = 1
๐‘“3 2 = ๐‘Ÿ3 2 = 3.40 โ‡’ ๐‘”3 2 = 2
๐‘“3 3 = ๐‘Ÿ3 3 = 4.35 โ‡’ ๐‘”3 3 = 3
[email protected]
Example
๐‘“2 0 = ๐‘Ÿ2 0 + ๐‘“3 0 โˆ’ 0 = 0.00 โ‡’ ๐‘”2 0 = 0
๐‘“2 1 = max
๐‘Ÿ2 0 + ๐‘“3 1 โˆ’ 0 = 2.00
โ‡’ ๐‘”2 1 = 0 โˆจ 1
๐‘Ÿ2 1 + ๐‘“3 1 โˆ’ 1 = 2.00
๐‘Ÿ2 0 + ๐‘“3 2 โˆ’ 0 = 0.00 + 3.40 = 3.40
๐‘“2 2 = max ๐‘Ÿ2 1 + ๐‘“3 2 โˆ’ 1 = 2.00 + 2.00 = 4.00 โ‡’ ๐‘”2 2 = 1
๐‘Ÿ2 2 + ๐‘“3 2 โˆ’ 2 = 3.25 + 0.00 = 3.25
๐‘Ÿ2
๐‘Ÿ
๐‘“2 3 = max 2
๐‘Ÿ2
๐‘Ÿ2
0
1
2
3
+ ๐‘“3
+ ๐‘“3
+ ๐‘“3
+ ๐‘“3
3โˆ’0
3โˆ’1
3โˆ’2
3โˆ’3
= 0.00 + 4.35 = 4.35
= 2.00 + 3.40 = 5.40
โ‡’ ๐‘”2 3 = 1
= 3.25 + 2.00 = 5.25
= 4.35 + 0.00 = 4.35
[email protected]
Example
๐‘Ÿ2 1 + ๐‘“3 4 โˆ’ 1 = 2.00 + 4.35 = 6.35
๐‘“2 4 = max ๐‘Ÿ2 2 + ๐‘“3 4 โˆ’ 2 = 3.25 + 3.40 = 6.65 โ‡’ ๐‘”2 4 = 2
๐‘Ÿ2 3 + ๐‘“3 4 โˆ’ 3 = 4.35 + 2.00 = 6.35
๐‘“2 5 = max
๐‘Ÿ2 2 + ๐‘“3 5 โˆ’ 2 = 3.25 + 4.35 = 7.60
โ‡’ ๐‘”2 5 = 3
๐‘Ÿ2 3 + ๐‘“3 5 โˆ’ 3 = 4.35 + 3.40 = 7.75
๐‘“2 6 = ๐‘Ÿ2 3 + ๐‘“3 6 โˆ’ 3 = 4.35 + 4.35 = 8.70 โ‡’ ๐‘”2 6 = 3
[email protected]
Example
Stage 1 Computations:
๐‘Ÿ1
๐‘Ÿ
๐‘“1 6 = max 1
๐‘Ÿ1
๐‘Ÿ1
0
1
2
3
+ ๐‘“2
+ ๐‘“2
+ ๐‘“2
+ ๐‘“2
6โˆ’0
6โˆ’1
6โˆ’2
6โˆ’3
= 0.0 + 8.70 = 8.70
= 2.0 + 7.75 = 9.75
โ‡’ ๐‘”1 6 = 1 โˆจ 2
= 3.1 + 6.65 = 9.75
= 4.2 + 5.40 = 9.60
[email protected]
A Stochastic Inventory Model
Consider the following three-period inventory problem. At the beginning of
each period, a firm must determine how many units should be produced
during the current period. During a period in which ๐‘ฅ units are produced, a
production cost ๐‘(๐‘ฅ) is incurred, where ๐‘ 0 = 0, and for ๐‘ฅ > 0, ๐‘ ๐‘ฅ = 3 +
2๐‘ฅ. Production during each period is limited to at most 4 units. After
production occurs, the periodโ€™s random demand is observed. Each periodโ€™s
demand is equally likely to be 1 or 2 units. After meeting the current periodโ€™s
demand out of current production and inventory, the firmโ€™s end-of-period
inventory is evaluated, and a holding cost of $1 per unit is assessed. Because
of limited capacity, the inventory at the end of each period cannot exceed 3
units. It is required that all demand be met on time. Any inventory on hand at
the end of period 3 can be sold at $2 per unit. At the beginning of period 1, the
firm has 1 unit of inventory. Use dynamic programming to determine a
production policy that minimizes the expected net cost incurred during the
three periods.
[email protected]
A Stochastic Inventory Model
We have, for ๐‘ก = 3,
๐‘“3 ๐‘– = min ๐‘ ๐‘ฅ +
๐‘ฅ
๐‘–+๐‘ฅโˆ’1
๐‘–+๐‘ฅโˆ’2
2 ๐‘–+๐‘ฅโˆ’1
2 ๐‘–+๐‘ฅโˆ’2
+
โˆ’
โˆ’
2
2
2
2
and, for ๐‘ก = 2,1,
๐‘“๐‘ก ๐‘– = min ๐‘ ๐‘ฅ +
๐‘ฅ
๐‘–+๐‘ฅโˆ’1
๐‘–+๐‘ฅโˆ’2
๐‘“๐‘ก+1 ๐‘– + ๐‘ฅ โˆ’ 1
๐‘“๐‘ก+1 ๐‘– + ๐‘ฅ โˆ’ 2
+
+
+
2
2
2
2
[email protected]
A Stochastic Inventory Model
Stage 3 Computations:
๐‘“3 ๐‘– = min ๐‘ ๐‘ฅ +
๐‘ฅ
๐‘–
๐‘ฅ
3
3
2
2
2
1
1
1
0
0
0
0
1
0
1
2
1
2
3
2
3
4
๐‘ ๐‘ฅ
0
5
0
5
7
5
7
9
7
9
11
๐‘–+๐‘ฅโˆ’1
๐‘–+๐‘ฅโˆ’2
2 ๐‘–+๐‘ฅโˆ’1
2 ๐‘–+๐‘ฅโˆ’2
+
โˆ’
โˆ’
2
2
2
2
Expected
Holding Cost
3
๐‘–+๐‘ฅโˆ’
2
3/2
5/2
1/2
3/2
5/2
1/2
3/2
5/2
1/2
3/2
5/2
Expected
Salvage Value
2๐‘– + 2๐‘ฅ โˆ’ 3
Expected
Total
Cost
3
5
1
3
5
1
3
5
1
3
5
-3/2
5/2
-1/2
7/2
9/2
9/2
11/2
13/2
13/2
15/2
17/2
[email protected]
๐‘“3 ๐‘– ; ๐‘ฅ3 ๐‘–
๐‘“3 3 = โˆ’3/2; ๐‘ฅ3 0 = 0
๐‘“3 2 = โˆ’1/2; ๐‘ฅ3 0 = 0
๐‘“3 1 = 9/2; ๐‘ฅ3 0 = 1
๐‘“3 0 = 13/2; ๐‘ฅ3 0 = 2
A Stochastic Inventory Model
Stage 2 Computations:
๐‘“2 ๐‘– = min ๐‘ ๐‘ฅ +
๐‘ฅ
๐‘–+๐‘ฅโˆ’1
๐‘–+๐‘ฅโˆ’2
๐‘“3 ๐‘– + ๐‘ฅ โˆ’ 1
๐‘“3 ๐‘– + ๐‘ฅ โˆ’ 2
+
+
+
2
2
2
2
[email protected]
A Stochastic Inventory Model
๐‘–
๐‘ฅ
๐‘ ๐‘ฅ
Expected
Holding
Cost
3
๐‘–+๐‘ฅโˆ’
2
Expected
Future Cost
๐‘“3 ๐‘– + ๐‘ฅ โˆ’ 1
๐‘“3 ๐‘– + ๐‘ฅ โˆ’ 2
+
2
2
Expected
Total
Cost
๐‘“2 ๐‘– ; ๐‘ฅ2 ๐‘–
7
2
๐‘ฅ2 3 = 0
3
0
0
3/2
2
7/2
3
1
5
5/2
-1
13/2
2
0
0
1/2
11/2
6
2
2
โ€ฆ
0
1
2
โ€ฆ
2
5
7
โ€ฆ
7
3/2
5/2
โ€ฆ
1/2
2
-1
โ€ฆ
11/2
17/2
17/2
โ€ฆ
13
0
3
9
3/2
2
25/2
0
4
11
5/2
-1
25/2
[email protected]
๐‘“2 3 =
๐‘“2 2 = 6
๐‘ฅ2 2 = 0
โ€ฆ
25
2
๐‘ฅ2 0 = 3
25
๐‘“2 0 =
2
๐‘ฅ2 0 = 4
๐‘“2 0 =
A Stochastic Inventory Model
Stage 1 Computations:
๐‘“1 ๐‘– = min ๐‘ ๐‘ฅ +
๐‘ฅ
๐‘–
๐‘ฅ
1
1
1
2
1
3
๐‘ ๐‘ฅ
7
9
11
๐‘–+๐‘ฅโˆ’1
๐‘–+๐‘ฅโˆ’2
๐‘“2 ๐‘– + ๐‘ฅ โˆ’ 1
๐‘“2 ๐‘– + ๐‘ฅ โˆ’ 2
+
+
+
2
2
2
2
Expected
Holding Cost
3
๐‘–+๐‘ฅโˆ’
2
1/2
3/2
Expected
Future Cost
๐‘“2 ๐‘– + ๐‘ฅ โˆ’ 1
๐‘“2 ๐‘– + ๐‘ฅ โˆ’ 2
+
2
2
23/2
33/4
5/2
19/4
[email protected]
Expected
Total Cost
17
67/4
65/4
๐‘“1 ๐‘– ; ๐‘ฅ1 ๐‘–
65
4
๐‘ฅ1 1 = 3
๐‘“1 1 =
Gamblerโ€™s Problem
A gambler has $2. She is allowed to play a game of chance four times, and her
goal is to maximize her probability of ending up with a least $6. If the gambler
bets ๐‘ dollars on a play of the game, then with probability .40, she wins the
game and increases her capital position by ๐‘ dollars; with probability .60, she
loses the game and decreases her capital by ๐‘ dollars. On any play of the
game, the gambler may not bet more money than she has available.
Determine a betting strategy that will maximize the gamblerโ€™s probability of
attaining a wealth of at least $6 by the end of the fourth game. We assume
that bets of zero dollars (that is, not betting) are permissible.
[email protected]
Gamblerโ€™s Problem
We let ๐‘“๐‘ก ๐‘‘ be the probability that the gambler will have at least $6 by the
end of given that she has ๐‘‘ dollars immediately before the game is played for
the ๐‘กth time and given that she acts optimally.
If the gambler playing the game for the 4th and final time, her optimal strategy
is clear:
๏‚ท
If she has $6 or more, donโ€™t bet anything.
๏‚ท
If she has less than $6, bet enough money to ensure (if possible) that
she will have $6 if she wins the last game.
[email protected]
Gamblerโ€™s Problem
We hence have the following:
๐‘“4
๐‘“4
๐‘“4
๐‘“4
๐‘“4
๐‘“4
0
1
2
3
4
5
= .0 โ‡’ ๐‘4
= .0 โ‡’ ๐‘4
= .0 โ‡’ ๐‘4
= .4 โ‡’ ๐‘4
= .4 โ‡’ ๐‘4
= .4 โ‡’ ๐‘4
0
1
2
3
4
5
= $0
= $0, $1
= $0, $1, $2
= $3
= $2, $3, $4
= $1, $2, $3, $4, $5
For ๐‘‘ โ‰ฅ 6,
๐‘“4 ๐‘‘ = 1 โ‡’ ๐‘4 ๐‘‘ = $0, $1, โ€ฆ , $ ๐‘‘ โˆ’ 6
We can write,
๐‘“๐‘ก ๐‘‘ = max
๐‘โˆˆ 0,โ€ฆ๐‘‘
. 4๐‘“๐‘ก+1 ๐‘‘ + ๐‘ + .6๐‘“๐‘ก+1 ๐‘‘ โˆ’ ๐‘
[email protected]
Gamblerโ€™s Problem
Stage 3 Computations:
๐‘“3 0 = 0 โ‡’ ๐‘3 0 = 0
๐‘“3 1 = max
๐‘
. 4๐‘“4 1 + .6๐‘“4 1 = 0
โ‡’ ๐‘3 1 = 0 โˆจ 1
. 4๐‘“4 2 + .6๐‘“4 0 = 0
. 4๐‘“4 2 + .6๐‘“4 2 = .00
๐‘“3 2 = max . 4๐‘“4 3 + .6๐‘“4 1 = .16 โ‡’ ๐‘3 2 = 1 โˆจ 2
๐‘
. 4๐‘“4 4 + .6๐‘“4 0 = .16
If we continue similarly,
. 4๐‘“4 5
. 4๐‘“4 6
. 4๐‘“4 7
๐‘“3 5 = max
๐‘
. 4๐‘“4 8
. 4๐‘“4 1
. 4๐‘“4 10
+ .6๐‘“4
+ .6๐‘“4
+ .6๐‘“4
+ .6๐‘“4
+ .6๐‘“4
+ .6๐‘“4
5
4
3
2
9
0
= .40
= .64
= .64
โ‡’ ๐‘3 5 = 1 โˆจ 2
= .40
= .40
= .40
[email protected]
Gamblerโ€™s Problem
Stage 2 Computations:
๐‘“2 0 = 0 โ‡’ ๐‘2 0 = 0
๐‘“2 1 = max
๐‘
. 4๐‘“3 1 + .6๐‘“3 1 = .000
โ‡’ ๐‘2 1 = 1
. 4๐‘“3 2 + .6๐‘“3 0 = .064
. 4๐‘“3 2 + .6๐‘“3 2 = .16
๐‘“2 2 = max . 4๐‘“3 3 + .6๐‘“3 1 = .16 โ‡’ ๐‘2 2 = 1 โˆจ 2 โˆจ 3
๐‘
. 4๐‘“3 4 + .6๐‘“3 0 = .16
If we continue similarly,
. 4๐‘“3 5
. 4๐‘“3 6
. 4๐‘“3 7
๐‘“2 5 = max
๐‘
. 4๐‘“3 8
. 4๐‘“3 1
. 4๐‘“3 10
+ .6๐‘“3
+ .6๐‘“3
+ .6๐‘“3
+ .6๐‘“3
+ .6๐‘“3
+ .6๐‘“3
5
4
3
2
9
0
= .640
= .640
= .640
โ‡’ ๐‘3 5 = 0 โˆจ 1 โˆจ 2
= .496
= .400
= .400
[email protected]
Gamblerโ€™s Problem
Stage 1 Computations:
. 4๐‘“2 2 + .6๐‘“2 2 = .1600
๐‘“1 2 = max . 4๐‘“2 3 + .6๐‘“2 1 = .1984 โ‡’ ๐‘2 2 = 1 โˆจ 2
๐‘
. 4๐‘“2 4 + .6๐‘“2 0 = .1984
Hence, the gambler has a chance of .1984 reaching $6.
[email protected]
Gamblerโ€™s Problem
[email protected]
Tennis Player
A tennis player has two types of serves, a hard (H) and a soft (S) one. The
probability that her hard serve will land in bounds is ๐‘๐ป and the probability
that her soft serve will land in bounds is ๐‘๐‘† . If her hard serve lands in bounds,
there is a probability ๐‘ค๐ป that she will win the point. If her soft serve lands in
bounds, there is a probability ๐‘ค๐‘  that she will win the point. We assume that
๐‘๐ป < ๐‘๐‘† and ๐‘ค๐ป > ๐‘ค๐‘† . Her goal is to maximize the probability of winning the
point. Use DP to help her! ๏Š
[email protected]
Tennis Player
We let ๐‘“๐‘ก be the probability that she wins the point if she is about to take her
๐‘กth service, ๐‘ก = 1,2.
๐‘๐ป ๐‘ค๐ป
๐‘“2 = max ๐‘ ๐‘ค = ๐‘๐‘† ๐‘ค๐‘† โ‡’ ๐‘ฅ2 = ๐‘†
๐‘ฅ
๐‘† ๐‘†
๐‘“1 = max
๐‘ฅ
๐‘๐ป ๐‘ค๐ป + 1 โˆ’ ๐‘๐ป ๐‘“2
๐‘๐‘† ๐‘ค๐‘† + 1 โˆ’ ๐‘๐‘† ๐‘“2
[email protected]
Tennis Player
If we assume, ๐‘๐‘† ๐‘ค๐‘† > ๐‘๐ป ๐‘ค๐ป ,
โ‡’ ๐‘๐ป ๐‘ค๐ป + 1 โˆ’ ๐‘๐ป ๐‘“2 โ‰ฅ ๐‘๐‘† ๐‘ค๐‘† + 1 โˆ’ ๐‘๐‘† ๐‘“2
โ‡’ ๐‘๐ป ๐‘ค๐ป + 1 โˆ’ ๐‘๐ป ๐‘๐‘† ๐‘ค๐‘† โ‰ฅ ๐‘๐‘† ๐‘ค๐‘† + 1 โˆ’ ๐‘๐‘† ๐‘๐‘† ๐‘ค๐‘†
โ‡’ ๐‘๐ป ๐‘ค๐ป โ‰ฅ ๐‘๐‘† ๐‘ค๐‘† + 1 + ๐‘๐ป โˆ’ ๐‘๐‘†
We now assume, ๐‘๐‘† ๐‘ค๐‘† โ‰ค ๐‘๐ป ๐‘ค๐ป , where she should serve hard on both. We
then have
๐‘๐ป ๐‘ค๐ป
๐‘“2 = max ๐‘ ๐‘ค = ๐‘๐ป ๐‘ค๐ป โ‡’ ๐‘ฅ2 = ๐ป
๐‘ฅ
๐‘† ๐‘†
She should serve hard on the first if
โ‡’ ๐‘๐ป ๐‘ค๐ป + 1 โˆ’ ๐‘๐ป ๐‘“2 โ‰ฅ ๐‘๐‘† ๐‘ค๐‘† + 1 โˆ’ ๐‘๐‘† ๐‘“2
โ‡’ ๐‘๐ป ๐‘ค๐ป + 1 โˆ’ ๐‘๐ป ๐‘๐ป ๐‘ค๐ป โ‰ฅ ๐‘๐‘† ๐‘ค๐‘† + 1 โˆ’ ๐‘๐‘† ๐‘๐ป ๐‘ค๐ป
โ‡’ ๐‘๐‘† ๐‘ค๐‘† โ‰ค ๐‘๐ป ๐‘ค๐ป + 1 + ๐‘๐‘† โˆ’ ๐‘๐ป
[email protected]
Markov Decision Processes (MDP)
An MDP is described by the following information:
๏‚ท
๏‚ท
๏‚ท
State Space
Decision Set
Transition Probabilities
๏‚ท
Expected Rewards
[email protected]
Markov Decision Processes (MDP)
At the beginning of each week, a machine is in one of four conditions (states):
excellent (E), good (G), average (A), or bad (B). The weekly revenue earned by
a machine in each type of condition is as follows: excellent, $100; good, $80;
average, $50; bad, $10. After observing the condition of a machine at the
beginning of the week, we have the option of instantaneously replacing it with
an excellent machine, which costs $200. The quality of a machine deteriorates
over time, as given in the following matrix. For this situation, determine the
state space, decision sets, transition probabilities, and expected rewards.
[email protected]
Markov Decision Processes (MDP)
๐ธ
๐= ๐บ
๐ด
๐ต
๐ธ
.7
.0
.0
.0
๐บ
.3
.7
.0
.0
๐ด ๐ต
. 0 0.0
. 3 0.0
. 6 0.4
. 0 1.0
State Set: ๐‘† = ๐ธ, ๐บ, ๐ด, ๐ต
Decision Set: ๐‘… = ๐‘…, ๐พ , replace and do not replace (keep)
We have
๐ท ๐ธ = ๐พ and ๐ท ๐บ = ๐ท ๐ด = ๐ท ๐ต = ๐‘…, ๐พ
We are given the following transition probabilities:
๐‘ ๐ธ|๐ธ, ๐พ = .7 ๐‘ ๐บ|๐ธ, ๐พ = .3 โ€ฆ
โ‹ฎ
โ‹ฎ
โ‹ฑ
๐‘ ๐ธ|๐ต, ๐พ = .0 ๐‘ ๐บ|๐ต, ๐พ = .0 โ€ฆ
[email protected]
๐‘ ๐ต|๐ธ, ๐พ = 0.0
โ‹ฎ
๐‘ ๐ต|๐ต, ๐พ = 1.0
Markov Decision Processes (MDP)
If we replace a machine with an excellent machine, the transition probabilities
will be the same as if we had begun the week with an excellent machine.
๐‘ ๐ธ|๐บ, ๐‘… = ๐‘ ๐ธ|๐ด, ๐‘… = ๐‘ ๐ธ|๐ต, ๐‘… = .7
๐‘ ๐บ|๐บ, ๐‘… = ๐‘ ๐บ|๐ด, ๐‘… = ๐‘ ๐บ|๐ต, ๐‘… = .3
๐‘ ๐ด|๐บ, ๐‘… = ๐‘ ๐ด|๐ด, ๐‘… = ๐‘ ๐ด|๐ต, ๐‘… = .0
๐‘ ๐ต|๐บ, ๐‘… = ๐‘ ๐ต|๐ด, ๐‘… = ๐‘ ๐ต|๐ต, ๐‘… = .0
[email protected]
Markov Decision Processes (MDP)
If the machine is not replaced, then, during the week, we receive the revenues
given in the problem.
๐‘Ÿ๐ธ,๐พ = $100, ๐‘Ÿ๐บ,๐พ = $80, ๐‘Ÿ๐ด,๐พ = $50, ๐‘Ÿ๐ต,๐พ = $10
If we replace the machine,
๐‘Ÿ๐ธ,๐‘… = ๐‘Ÿ๐บ,๐‘… = ๐‘Ÿ๐ด,๐‘… = ๐‘Ÿ๐ต,๐‘… = โˆ’$100
[email protected]
Markov Decision Processes (MDP)
Definition:
A policy is a rule that specifies how each periodโ€™s decision is chosen.
Definition:
A policy ๐›ฟ is a stationary policy if whenever the state ๐‘–, the policy ๐›ฟ
chooses (independently of the period) the same decision ๐›ฟ ๐‘– .
[email protected]
Markov Decision Processes (MDP)
๐›ฟ: an arbitrary policy
ฮ”: the set of all policies
๐‘‹๐‘ก : random variable for the state of MDP at the beginning of period ๐‘ก
๐‘‹1 : given state of the process at the beginning of period 1 (initial state)
๐‘‘๐‘ก : decision chosen during period ๐‘ก
๐‘‰๐›ฟ ๐‘– : expected discounted reward earned during an infinite number of
periods, given that at the beginning of period 1, state is ๐‘– and stationary policy
๐›ฟ is followed.
[email protected]
Markov Decision Processes (MDP)
We can then write,
โˆž
๐›ฝ ๐‘กโˆ’1 ๐‘Ÿ๐‘ฅ ๐‘ก ๐‘‘ ๐‘ก |๐‘‹1 = ๐‘–
๐‘‰๐›ฟ ๐‘– = ๐ธ๐›ฟ
๐‘ก=1
In a max problem,
๐‘‰ ๐‘– = max ๐‘‰๐›ฟ ๐‘–
๐›ฟ โˆˆฮ”
In a min problem,
๐‘‰ ๐‘– = min ๐‘‰๐›ฟ ๐‘–
๐›ฟโˆˆฮ”
[email protected]
Markov Decision Processes (MDP)
Definition:
If a policy ๐›ฟ โˆ— has the property that for all ๐‘– โˆˆ ๐‘†
๐‘‰ ๐‘– = ๐‘‰๐›ฟ โˆ— ๐‘–
then, ๐›ฟ โˆ— is an optimal policy.
[email protected]
Markov Decision Processes (MDP)
We can use the following approaches to find the optimal stationary policy:
๏‚ท
๏‚ท
Policy Iteration
Linear Programming
๏‚ท
Value Iteration or Successive Approximations
[email protected]
Policy Iteration
๐‘‰๐›ฟ ๐‘– can be found by solving the following system of linear equations:
๐‘
๐‘‰๐›ฟ ๐‘– = ๐‘Ÿ๐‘–,๐›ฟ
๐‘–
+๐›ฝ
๐‘ ๐‘—|๐‘–, ๐›ฟ ๐‘– ๐‘‰๐›ฟ ๐‘— , ๐‘– = 1, โ€ฆ , ๐‘
๐‘— =1
Consider the following stationary policy in the machine replacement example:
๐›ฟ ๐ธ = ๐›ฟ ๐บ = ๐พ; ๐›ฟ ๐ด = ๐›ฟ ๐ต = ๐‘…
We then have,
๐‘‰๐›ฟ
๐‘‰๐›ฟ
๐‘‰๐›ฟ
๐‘‰๐›ฟ
๐ธ
๐บ
๐ด
๐ต
= 100 + .9 . 7๐‘‰๐›ฟ ๐ธ + .3๐‘‰๐›ฟ ๐บ
= 80 + .9 . 7๐‘‰๐›ฟ ๐บ + .3๐‘‰๐›ฟ ๐ด
= โˆ’100 + .9 . 7๐‘‰๐›ฟ ๐ธ + .3๐‘‰๐›ฟ ๐บ
= โˆ’100 + .9 . 7๐‘‰๐›ฟ ๐ธ + .3๐‘‰๐›ฟ ๐บ
By solving these,
๐‘‰๐›ฟ ๐ธ = 687.81, ๐‘‰๐›ฟ ๐บ = 573.19, ๐‘‰๐›ฟ ๐ด = 487.81, and ๐‘‰๐›ฟ ๐ต = 487.81
[email protected]
Howardโ€™s Policy Iteration
Step 1) Policy Evaluation: Choose a stationary policy ๐›ฟ and use the equations
to find ๐‘‰๐›ฟ ๐‘– , ๐‘– = 1, โ€ฆ , ๐‘
Step 2) Policy Improvement: For all states ๐‘– = 1, โ€ฆ , ๐‘, compute
๐‘
๐‘‡๐›ฟ ๐‘– = max
๐‘‘โˆˆ๐ท ๐‘–
๐‘Ÿ๐‘–,๐‘‘ + ๐›ฝ
๐‘ ๐‘—|๐‘–, ๐‘‘ ๐‘‰๐›ฟ ๐‘—
๐‘— =1
If ๐‘‡๐›ฟ ๐‘– = ๐‘‰๐›ฟ ๐‘– , โˆ€๐‘– โ‡’ ๐›ฟ is an optimal policy.
If ๐‘‡๐›ฟ ๐‘– > ๐‘‰๐›ฟ ๐‘– , โˆƒ๐‘– โ‡’ ๐›ฟ is not an optimal policy.
Modify ๐›ฟ to obtain ๐›ฟโ€ฒ for which ๐‘‰๐›ฟโ€ฒ ๐‘– โ‰ฅ ๐‘‰๐›ฟ ๐‘– , โˆ€๐‘–.
Return to Step (1) with policy ๐›ฟโ€ฒ.
[email protected]
Howardโ€™s Policy Iteration
Machine Replacement Example:
Consider the following stationary policy:
๐›ฟ ๐ธ = ๐›ฟ ๐บ = ๐พ and ๐›ฟ ๐ด = ๐›ฟ ๐ต = ๐‘…
We have found that
๐‘‰๐›ฟ ๐ธ = 687.81, ๐‘‰๐›ฟ ๐บ = 573.19, ๐‘‰๐›ฟ ๐ด = 487.81, and ๐‘‰๐›ฟ ๐ต = 487.81
[email protected]
Howardโ€™s Policy Iteration
Now,
๐‘‡๐›ฟ ๐ธ = ๐‘‰๐›ฟ ๐ธ = 687.81
โˆ’100 + .9 . 7๐‘‰๐›ฟ ๐ธ + .3๐‘‰๐›ฟ ๐บ = 487.81
80 + .9 . 7๐‘‰๐›ฟ ๐บ + .3๐‘‰๐›ฟ ๐ด = ๐‘‰๐›ฟ ๐บ = 572.19
= 572.19 โ‡’ ๐›ฟ ๐บ = ๐พ
๐‘‡๐›ฟ ๐บ = max
โˆ’100 + .9 . 7๐‘‰๐›ฟ ๐ธ + .3๐‘‰๐›ฟ ๐บ = 487.81
50 + .9 . 6๐‘‰๐›ฟ ๐ด + .4๐‘‰๐›ฟ ๐ต = 489.03
= 489.03 โ‡’ ๐›ฟ ๐ด = ๐พ
๐‘‡๐›ฟ ๐ด = max
โˆ’100 + .9 . 7๐‘‰๐›ฟ ๐ธ + .3๐‘‰๐›ฟ ๐บ = ๐‘‰๐›ฟ ๐ต = 487.81
10 + .9๐‘‰๐›ฟ ๐ต = 449.03
= 487.81 โ‡’ ๐›ฟ ๐บ = ๐‘…
๐‘‡๐›ฟ ๐ต = max
[email protected]
Howardโ€™s Policy Iteration
Since ๐‘‡๐›ฟ ๐‘– > ๐‘‰๐›ฟ ๐‘– , for ๐‘– = ๐ด, the policy ๐›ฟ is not optimal. So replace it with ๐›ฟโ€ฒ
given as
๐›ฟโ€ฒ ๐ธ = ๐›ฟโ€ฒ ๐บ = ๐›ฟโ€ฒ ๐ด = ๐พ and ๐›ฟโ€ฒ ๐ต = ๐‘…
We now return to Step (1) and compute ๐‘‰๐›ฟโ€ฒ ๐ธ = 690.23, ๐‘‰๐›ฟโ€ฒ ๐บ = 575.50,
๐‘‰๐›ฟโ€ฒ ๐ด = 492.35, and ๐‘‰๐›ฟโ€ฒ ๐ต = 490.23 by solving the following system. Note
that ๐‘‰๐›ฟโ€ฒ ๐‘– > ๐‘‰๐›ฟ ๐‘– , โˆ€๐‘–.
๐‘‰๐›ฟโ€ฒ
๐‘‰๐›ฟโ€ฒ
๐‘‰๐›ฟโ€ฒ
๐‘‰๐›ฟโ€ฒ
๐ธ
๐บ
๐ด
๐ต
= 100 + .9 . 7๐‘‰๐›ฟโ€ฒ ๐ธ + .3๐‘‰๐›ฟโ€ฒ ๐บ
= 80 + .9 . 7๐‘‰๐›ฟโ€ฒ ๐บ + .3๐‘‰๐›ฟโ€ฒ ๐ด
= 50 + .9 . 6๐‘‰๐›ฟโ€ฒ ๐ด + .4๐‘‰๐›ฟโ€ฒ ๐บ
= โˆ’100 + .9 . 7๐‘‰๐›ฟโ€ฒ ๐ธ + .3๐‘‰๐›ฟโ€ฒ ๐บ
[email protected]
Howardโ€™s Policy Iteration
Now apply the policy iteration procedure as follows:
๐‘‡๐›ฟโ€ฒ ๐ธ = ๐‘‰๐›ฟโ€ฒ ๐ธ = 690.23
โˆ’100 + .9 . 7๐‘‰๐›ฟโ€ฒ ๐ธ + .3๐‘‰๐›ฟโ€ฒ ๐บ = 490.23
80 + .9 . 7๐‘‰๐›ฟโ€ฒ ๐บ + .3๐‘‰๐›ฟโ€ฒ ๐ด = ๐‘‰๐›ฟโ€ฒ ๐บ = 575.50
= 575.50 โ‡’ ๐›ฟโ€ฒ ๐บ = ๐พ
๐‘‡๐›ฟโ€ฒ ๐บ = max
โˆ’100 + .9 . 7๐‘‰๐›ฟโ€ฒ ๐ธ + .3๐‘‰๐›ฟโ€ฒ ๐บ = 490.23
50 + .9 . 6๐‘‰๐›ฟ ๐ด + .4๐‘‰๐›ฟ ๐ต = 492.35
= 492.35 โ‡’ ๐›ฟ ๐ด = ๐พ
๐‘‡๐›ฟโ€ฒ ๐ด = max
โˆ’100 + .9 . 7๐‘‰๐›ฟ โ€ฒ ๐ธ + .3๐‘‰๐›ฟ โ€ฒ ๐บ = ๐‘‰๐›ฟโ€ฒ ๐ต = 490.23
10 + .9๐‘‰๐›ฟโ€ฒ ๐ต = 451.21
= 490.23 โ‡’ ๐›ฟ ๐บ = ๐‘…
๐‘‡๐›ฟโ€ฒ ๐ต = max
Since ๐‘‡๐›ฟโ€ฒ ๐‘– = ๐‘‰๐›ฟโ€ฒ ๐‘– , โˆ€๐‘–, the policy ๐›ฟโ€ฒ is an optimal stationary policy.
[email protected]
Linear Programming
It can be shown that an optimal stationary policy for a maximization problem
can be found by solving the following LP:
๐‘
min
๐‘‰๐‘—
๐‘— =1
๐‘
๐‘‰๐‘– โˆ’ ๐›ฝ
๐‘ ๐‘—|๐‘–, ๐‘‘ ๐‘‰๐‘— โ‰ฅ ๐‘Ÿ๐‘–๐‘‘ , โˆ€๐‘–, โˆ€๐‘‘ โˆˆ ๐‘‘ ๐‘–
๐‘— =1
For a minimization problem can be found by solving the following LP:
๐‘
max
๐‘‰๐‘—
๐‘— =1
๐‘
๐‘‰๐‘– โˆ’ ๐›ฝ
๐‘ ๐‘—|๐‘–, ๐‘‘ ๐‘‰๐‘— โ‰ค ๐‘Ÿ๐‘–๐‘‘ , โˆ€๐‘–, โˆ€๐‘‘ โˆˆ ๐‘‘ ๐‘–
๐‘— =1
[email protected]
Linear Programming
min ๐‘‰๐ธ + ๐‘‰๐บ + ๐‘‰๐ด + ๐‘‰๐ต
๐‘‰๐ธ โ‰ฅ 100 + .9 . 7๐‘‰๐ธ + .3๐‘‰๐บ
๐‘‰๐บ โ‰ฅ 80 + .9 . 7๐‘‰๐บ + .3๐‘‰๐ด
๐‘‰๐บ โ‰ฅ โˆ’100 + .9 . 7๐‘‰๐ธ + .3๐‘‰๐บ
๐‘‰๐ด โ‰ฅ 50 + .9 . 6๐‘‰๐ด + .4๐‘‰๐ต
๐‘‰๐ด โ‰ฅ โˆ’100 + .9 . 7๐‘‰๐ธ + .3๐‘‰๐บ
๐‘‰๐ต โ‰ฅ 10 + .9๐‘‰๐ต
๐‘‰๐ต โ‰ฅ โˆ’100 + .9 . 7๐‘‰๐ธ + .3๐‘‰๐บ
(K in E)
(K in G)
(R in G)
(K in A)
(R in A)
(K in B)
(R in B)
By solving the LP, we obtain ๐‘‰๐ธ = 690.23, ๐‘‰๐บ = 575.50, ๐‘‰๐ด = 492.35 and
๐‘‰๐ต = 490.23.
Note that 1st, 2nd, 4th and 7th constraints are binding.
[email protected]
Value Iteration
๐‘
๐‘‰๐‘ก ๐‘– = max
๐‘‘โˆˆ๐ท ๐‘–
๐‘Ÿ๐‘–,๐‘‘ + ๐›ฝ
๐‘ ๐‘—|๐‘–, ๐‘‘ ๐‘‰๐‘กโˆ’1 ๐‘—
, ๐‘กโ‰ฅ1
๐‘— =1
๐‘‰0 ๐‘– = 0
Let ๐‘‘๐‘ก ๐‘– be the decision that must be chosen during period 1 in state ๐‘– to
attain ๐‘‰๐‘ก ๐‘– . For an MDP with finite state space and each ๐ท ๐‘– containing a
finite number of elements, we have,
๐‘‰๐‘ก ๐‘– โˆ’ ๐‘‰ ๐‘–
๐›ฝ๐‘ก
โ‰ค
max ๐‘Ÿ ,
1 โˆ’ ๐›ฝ ๐‘–,๐‘‘ ๐‘–๐‘‘
For an optimal stationary policy ๐›ฟ โˆ— ๐‘– , we have
lim ๐‘‘๐‘ก ๐‘– = ๐›ฟ โˆ— ๐‘–
๐‘กโ†’โˆž
[email protected]
๐‘– = 1, โ€ฆ , ๐‘
Value Iteration
Stage 1:
๐‘‰1 ๐ธ = 100
(๐พ)
๐‘‰1 ๐บ = max
80
โˆ’100
(๐พ)
= 80
(๐‘…)
๐‘‰1 ๐ด = max
50
โˆ’100
(๐พ)
= 50
(๐‘…)
๐‘‰1 ๐ต = max
10
โˆ’100
(๐พ)
= 10
(๐‘…)
[email protected]
Value Iteration
Stage 2:
๐‘‰2 ๐ธ = 100 + .9 . 7๐‘‰1 ๐ธ + .3๐‘‰1 ๐บ
= 184.6
(๐พ)
๐‘‰2 ๐บ = max
80 + .9 . 7๐‘‰1 ๐บ + .3๐‘‰1 ๐ด = 143.9
(๐พ)
= 143.9
โˆ’100 + .9 . 7๐‘‰1 ๐ธ + .3๐‘‰1 ๐บ = โˆ’15.4 (๐‘…)
๐‘‰2 ๐ด = max
50 + .9 . 6๐‘‰1 ๐ด + .4๐‘‰1 ๐ต = 80.6
(๐พ)
= 80.6
โˆ’100 + .9 . 7๐‘‰1 ๐ธ + .3๐‘‰1 ๐บ = โˆ’15.4 (๐‘…)
๐‘‰2 ๐ต = max
10 + .9๐‘‰1 ๐ต = 19
โˆ’100 + .9 . 7๐‘‰1 ๐ธ + .3๐‘‰1 ๐บ = โˆ’15.4
[email protected]
(๐พ)
= 19
(๐‘…)
Value Iteration
Observe that after two iterations of successive approximations, we have not
yet come close to the actual values of ๐‘‰ ๐‘– and have not found it optimal to
replace even a bad machine. In general, if we want to ensure that all the ๐‘‰๐‘ก ๐‘–
are within ๐œ– of the corresponding ๐‘‰ ๐‘– , we would perform ๐‘ก โˆ— iterations of
successive iterations where
โˆ—
๐›ฝ๐‘ก
max ๐‘Ÿ < ๐œ–
1 โˆ’ ๐›ฝ ๐‘–,๐‘‘ ๐‘–๐‘‘
[email protected]
Maximizing Average Reward
๐‘
max
๐œ‹๐‘–๐‘‘ ๐‘Ÿ๐‘–๐‘‘
๐‘–=1 ๐‘‘โˆˆ๐ท ๐‘–
๐‘
๐œ‹๐‘–๐‘‘ = 1
๐‘–=1 ๐‘‘โˆˆ๐ท ๐‘–
๐‘
๐œ‹๐‘—๐‘‘ =
๐‘‘โˆˆ๐ท ๐‘—
๐œ‹๐‘–๐‘‘ ๐‘ ๐‘—|๐‘–, ๐‘‘ ,
๐‘‘โˆˆ๐ท ๐‘– ๐‘–=1
๐œ‹๐‘–๐‘‘ โ‰ฅ 0,
โˆ€๐‘–, ๐‘‘
[email protected]
โˆ€๐‘—
Machine Replacement
max 100๐œ‹๐ธ๐พ + 80๐œ‹๐บ๐พ + 50๐œ‹๐ด๐พ + 10๐œ‹๐ต๐พ โˆ’ 100 ๐œ‹๐บ๐‘… + ๐œ‹๐ด๐‘… + ๐œ‹๐ต๐‘…
๐œ‹๐ธ๐พ + ๐œ‹๐บ๐พ + ๐œ‹๐ด๐พ + ๐œ‹๐ต๐พ + ๐œ‹๐บ๐‘… + ๐œ‹๐ด๐‘… + ๐œ‹๐ต๐‘… = 1
๐œ‹๐ธ๐พ = .7 ๐œ‹๐ธ๐พ + ๐œ‹๐บ๐‘… + ๐œ‹๐ด๐‘… + ๐œ‹๐ต๐‘…
๐œ‹๐บ๐พ + ๐œ‹๐บ๐‘… = .3 ๐œ‹๐บ๐‘… + ๐œ‹๐ด๐‘… + ๐œ‹๐ต๐‘… + ๐œ‹๐ธ๐พ + .7๐œ‹๐บ๐พ
๐œ‹๐ด๐‘… + ๐œ‹๐ด๐พ = .3๐œ‹๐บ๐พ + .6๐œ‹๐ด๐พ
๐œ‹๐ต๐‘… + ๐œ‹๐ต๐พ = ๐œ‹๐ต๐พ + .4๐œ‹๐ด๐พ
By solving, we obtain ๐‘ง = 60 and ๐œ‹๐ธ๐พ = .35, ๐œ‹๐บ๐พ = .50, ๐œ‹๐ด๐‘… = .15.
[email protected]
The End
[email protected]