Stochastic Dynamic Programming Fatih Cavdur [email protected] Example Example: For a price of $1/gallon, the Safeco Supermarket chain has purchased 6 gallons of milk from a local dairy. Each gallon of milk is sold in the chainโs three stores for $2/gallon. The dairy must buy back for 50¢/gallon any milk that is left at the end of the day. Unfortunately for Safeco, demand for each of the chainโs three stores is uncertain. Past data indicate that the daily demand at each store is as shown in the below table. Safeco wants to allocate the 6 gallons of milk to the three stores so as to maximize the expected net daily profit (revenues less costs) earned from milk. Use dynamic programming to determine how Safeco should allocate the 6 gallons of milk among the three stores. [email protected] Example Daily Demand Probability Store 1 1 .6 2 .0 3 .4 Store 2 1 .5 2 .1 3 .4 Store 3 1 .4 2 .3 3 .3 Table: Problem Data [email protected] Example ๐๐ก ๐๐ก = expected revenue earned from ๐๐ก gallons assigned to store ๐ก ๐๐ก ๐ฅ = maximum expected revenue earned from ๐ฅ gallons assigned to stores ๐ก, ๐ก + 1, โฆ ,3 We have, ๐3 ๐ฅ = ๐3 (๐ฅ) ๐๐ก ๐ฅ = max ๐๐ก ๐๐ก + ๐๐ก+1 (๐ฅ โ ๐๐ก ) , ๐ก = 1,2 ๐๐ก ๐3 0 = $0 ๐3 1 = $2.00 ๐3 2 = $3.40 ๐3 3 = $4.35 ๐2 0 = $0 ๐2 1 = $2.00 ๐2 2 = $3.25 ๐2 3 = $4.35 ๐1 0 = $0 ๐1 1 = $2.00 ๐1 2 = $3.10 ๐1 3 = $4.20 [email protected] Example Stage 3 Computations: ๐3 0 = ๐3 0 = 0.00 โ ๐3 0 = 0 ๐3 1 = ๐3 1 = 2.00 โ ๐3 1 = 1 ๐3 2 = ๐3 2 = 3.40 โ ๐3 2 = 2 ๐3 3 = ๐3 3 = 4.35 โ ๐3 3 = 3 [email protected] Example ๐2 0 = ๐2 0 + ๐3 0 โ 0 = 0.00 โ ๐2 0 = 0 ๐2 1 = max ๐2 0 + ๐3 1 โ 0 = 2.00 โ ๐2 1 = 0 โจ 1 ๐2 1 + ๐3 1 โ 1 = 2.00 ๐2 0 + ๐3 2 โ 0 = 0.00 + 3.40 = 3.40 ๐2 2 = max ๐2 1 + ๐3 2 โ 1 = 2.00 + 2.00 = 4.00 โ ๐2 2 = 1 ๐2 2 + ๐3 2 โ 2 = 3.25 + 0.00 = 3.25 ๐2 ๐ ๐2 3 = max 2 ๐2 ๐2 0 1 2 3 + ๐3 + ๐3 + ๐3 + ๐3 3โ0 3โ1 3โ2 3โ3 = 0.00 + 4.35 = 4.35 = 2.00 + 3.40 = 5.40 โ ๐2 3 = 1 = 3.25 + 2.00 = 5.25 = 4.35 + 0.00 = 4.35 [email protected] Example ๐2 1 + ๐3 4 โ 1 = 2.00 + 4.35 = 6.35 ๐2 4 = max ๐2 2 + ๐3 4 โ 2 = 3.25 + 3.40 = 6.65 โ ๐2 4 = 2 ๐2 3 + ๐3 4 โ 3 = 4.35 + 2.00 = 6.35 ๐2 5 = max ๐2 2 + ๐3 5 โ 2 = 3.25 + 4.35 = 7.60 โ ๐2 5 = 3 ๐2 3 + ๐3 5 โ 3 = 4.35 + 3.40 = 7.75 ๐2 6 = ๐2 3 + ๐3 6 โ 3 = 4.35 + 4.35 = 8.70 โ ๐2 6 = 3 [email protected] Example Stage 1 Computations: ๐1 ๐ ๐1 6 = max 1 ๐1 ๐1 0 1 2 3 + ๐2 + ๐2 + ๐2 + ๐2 6โ0 6โ1 6โ2 6โ3 = 0.0 + 8.70 = 8.70 = 2.0 + 7.75 = 9.75 โ ๐1 6 = 1 โจ 2 = 3.1 + 6.65 = 9.75 = 4.2 + 5.40 = 9.60 [email protected] A Stochastic Inventory Model Consider the following three-period inventory problem. At the beginning of each period, a firm must determine how many units should be produced during the current period. During a period in which ๐ฅ units are produced, a production cost ๐(๐ฅ) is incurred, where ๐ 0 = 0, and for ๐ฅ > 0, ๐ ๐ฅ = 3 + 2๐ฅ. Production during each period is limited to at most 4 units. After production occurs, the periodโs random demand is observed. Each periodโs demand is equally likely to be 1 or 2 units. After meeting the current periodโs demand out of current production and inventory, the firmโs end-of-period inventory is evaluated, and a holding cost of $1 per unit is assessed. Because of limited capacity, the inventory at the end of each period cannot exceed 3 units. It is required that all demand be met on time. Any inventory on hand at the end of period 3 can be sold at $2 per unit. At the beginning of period 1, the firm has 1 unit of inventory. Use dynamic programming to determine a production policy that minimizes the expected net cost incurred during the three periods. [email protected] A Stochastic Inventory Model We have, for ๐ก = 3, ๐3 ๐ = min ๐ ๐ฅ + ๐ฅ ๐+๐ฅโ1 ๐+๐ฅโ2 2 ๐+๐ฅโ1 2 ๐+๐ฅโ2 + โ โ 2 2 2 2 and, for ๐ก = 2,1, ๐๐ก ๐ = min ๐ ๐ฅ + ๐ฅ ๐+๐ฅโ1 ๐+๐ฅโ2 ๐๐ก+1 ๐ + ๐ฅ โ 1 ๐๐ก+1 ๐ + ๐ฅ โ 2 + + + 2 2 2 2 [email protected] A Stochastic Inventory Model Stage 3 Computations: ๐3 ๐ = min ๐ ๐ฅ + ๐ฅ ๐ ๐ฅ 3 3 2 2 2 1 1 1 0 0 0 0 1 0 1 2 1 2 3 2 3 4 ๐ ๐ฅ 0 5 0 5 7 5 7 9 7 9 11 ๐+๐ฅโ1 ๐+๐ฅโ2 2 ๐+๐ฅโ1 2 ๐+๐ฅโ2 + โ โ 2 2 2 2 Expected Holding Cost 3 ๐+๐ฅโ 2 3/2 5/2 1/2 3/2 5/2 1/2 3/2 5/2 1/2 3/2 5/2 Expected Salvage Value 2๐ + 2๐ฅ โ 3 Expected Total Cost 3 5 1 3 5 1 3 5 1 3 5 -3/2 5/2 -1/2 7/2 9/2 9/2 11/2 13/2 13/2 15/2 17/2 [email protected] ๐3 ๐ ; ๐ฅ3 ๐ ๐3 3 = โ3/2; ๐ฅ3 0 = 0 ๐3 2 = โ1/2; ๐ฅ3 0 = 0 ๐3 1 = 9/2; ๐ฅ3 0 = 1 ๐3 0 = 13/2; ๐ฅ3 0 = 2 A Stochastic Inventory Model Stage 2 Computations: ๐2 ๐ = min ๐ ๐ฅ + ๐ฅ ๐+๐ฅโ1 ๐+๐ฅโ2 ๐3 ๐ + ๐ฅ โ 1 ๐3 ๐ + ๐ฅ โ 2 + + + 2 2 2 2 [email protected] A Stochastic Inventory Model ๐ ๐ฅ ๐ ๐ฅ Expected Holding Cost 3 ๐+๐ฅโ 2 Expected Future Cost ๐3 ๐ + ๐ฅ โ 1 ๐3 ๐ + ๐ฅ โ 2 + 2 2 Expected Total Cost ๐2 ๐ ; ๐ฅ2 ๐ 7 2 ๐ฅ2 3 = 0 3 0 0 3/2 2 7/2 3 1 5 5/2 -1 13/2 2 0 0 1/2 11/2 6 2 2 โฆ 0 1 2 โฆ 2 5 7 โฆ 7 3/2 5/2 โฆ 1/2 2 -1 โฆ 11/2 17/2 17/2 โฆ 13 0 3 9 3/2 2 25/2 0 4 11 5/2 -1 25/2 [email protected] ๐2 3 = ๐2 2 = 6 ๐ฅ2 2 = 0 โฆ 25 2 ๐ฅ2 0 = 3 25 ๐2 0 = 2 ๐ฅ2 0 = 4 ๐2 0 = A Stochastic Inventory Model Stage 1 Computations: ๐1 ๐ = min ๐ ๐ฅ + ๐ฅ ๐ ๐ฅ 1 1 1 2 1 3 ๐ ๐ฅ 7 9 11 ๐+๐ฅโ1 ๐+๐ฅโ2 ๐2 ๐ + ๐ฅ โ 1 ๐2 ๐ + ๐ฅ โ 2 + + + 2 2 2 2 Expected Holding Cost 3 ๐+๐ฅโ 2 1/2 3/2 Expected Future Cost ๐2 ๐ + ๐ฅ โ 1 ๐2 ๐ + ๐ฅ โ 2 + 2 2 23/2 33/4 5/2 19/4 [email protected] Expected Total Cost 17 67/4 65/4 ๐1 ๐ ; ๐ฅ1 ๐ 65 4 ๐ฅ1 1 = 3 ๐1 1 = Gamblerโs Problem A gambler has $2. She is allowed to play a game of chance four times, and her goal is to maximize her probability of ending up with a least $6. If the gambler bets ๐ dollars on a play of the game, then with probability .40, she wins the game and increases her capital position by ๐ dollars; with probability .60, she loses the game and decreases her capital by ๐ dollars. On any play of the game, the gambler may not bet more money than she has available. Determine a betting strategy that will maximize the gamblerโs probability of attaining a wealth of at least $6 by the end of the fourth game. We assume that bets of zero dollars (that is, not betting) are permissible. [email protected] Gamblerโs Problem We let ๐๐ก ๐ be the probability that the gambler will have at least $6 by the end of given that she has ๐ dollars immediately before the game is played for the ๐กth time and given that she acts optimally. If the gambler playing the game for the 4th and final time, her optimal strategy is clear: ๏ท If she has $6 or more, donโt bet anything. ๏ท If she has less than $6, bet enough money to ensure (if possible) that she will have $6 if she wins the last game. [email protected] Gamblerโs Problem We hence have the following: ๐4 ๐4 ๐4 ๐4 ๐4 ๐4 0 1 2 3 4 5 = .0 โ ๐4 = .0 โ ๐4 = .0 โ ๐4 = .4 โ ๐4 = .4 โ ๐4 = .4 โ ๐4 0 1 2 3 4 5 = $0 = $0, $1 = $0, $1, $2 = $3 = $2, $3, $4 = $1, $2, $3, $4, $5 For ๐ โฅ 6, ๐4 ๐ = 1 โ ๐4 ๐ = $0, $1, โฆ , $ ๐ โ 6 We can write, ๐๐ก ๐ = max ๐โ 0,โฆ๐ . 4๐๐ก+1 ๐ + ๐ + .6๐๐ก+1 ๐ โ ๐ [email protected] Gamblerโs Problem Stage 3 Computations: ๐3 0 = 0 โ ๐3 0 = 0 ๐3 1 = max ๐ . 4๐4 1 + .6๐4 1 = 0 โ ๐3 1 = 0 โจ 1 . 4๐4 2 + .6๐4 0 = 0 . 4๐4 2 + .6๐4 2 = .00 ๐3 2 = max . 4๐4 3 + .6๐4 1 = .16 โ ๐3 2 = 1 โจ 2 ๐ . 4๐4 4 + .6๐4 0 = .16 If we continue similarly, . 4๐4 5 . 4๐4 6 . 4๐4 7 ๐3 5 = max ๐ . 4๐4 8 . 4๐4 1 . 4๐4 10 + .6๐4 + .6๐4 + .6๐4 + .6๐4 + .6๐4 + .6๐4 5 4 3 2 9 0 = .40 = .64 = .64 โ ๐3 5 = 1 โจ 2 = .40 = .40 = .40 [email protected] Gamblerโs Problem Stage 2 Computations: ๐2 0 = 0 โ ๐2 0 = 0 ๐2 1 = max ๐ . 4๐3 1 + .6๐3 1 = .000 โ ๐2 1 = 1 . 4๐3 2 + .6๐3 0 = .064 . 4๐3 2 + .6๐3 2 = .16 ๐2 2 = max . 4๐3 3 + .6๐3 1 = .16 โ ๐2 2 = 1 โจ 2 โจ 3 ๐ . 4๐3 4 + .6๐3 0 = .16 If we continue similarly, . 4๐3 5 . 4๐3 6 . 4๐3 7 ๐2 5 = max ๐ . 4๐3 8 . 4๐3 1 . 4๐3 10 + .6๐3 + .6๐3 + .6๐3 + .6๐3 + .6๐3 + .6๐3 5 4 3 2 9 0 = .640 = .640 = .640 โ ๐3 5 = 0 โจ 1 โจ 2 = .496 = .400 = .400 [email protected] Gamblerโs Problem Stage 1 Computations: . 4๐2 2 + .6๐2 2 = .1600 ๐1 2 = max . 4๐2 3 + .6๐2 1 = .1984 โ ๐2 2 = 1 โจ 2 ๐ . 4๐2 4 + .6๐2 0 = .1984 Hence, the gambler has a chance of .1984 reaching $6. [email protected] Gamblerโs Problem [email protected] Tennis Player A tennis player has two types of serves, a hard (H) and a soft (S) one. The probability that her hard serve will land in bounds is ๐๐ป and the probability that her soft serve will land in bounds is ๐๐ . If her hard serve lands in bounds, there is a probability ๐ค๐ป that she will win the point. If her soft serve lands in bounds, there is a probability ๐ค๐ that she will win the point. We assume that ๐๐ป < ๐๐ and ๐ค๐ป > ๐ค๐ . Her goal is to maximize the probability of winning the point. Use DP to help her! ๏ [email protected] Tennis Player We let ๐๐ก be the probability that she wins the point if she is about to take her ๐กth service, ๐ก = 1,2. ๐๐ป ๐ค๐ป ๐2 = max ๐ ๐ค = ๐๐ ๐ค๐ โ ๐ฅ2 = ๐ ๐ฅ ๐ ๐ ๐1 = max ๐ฅ ๐๐ป ๐ค๐ป + 1 โ ๐๐ป ๐2 ๐๐ ๐ค๐ + 1 โ ๐๐ ๐2 [email protected] Tennis Player If we assume, ๐๐ ๐ค๐ > ๐๐ป ๐ค๐ป , โ ๐๐ป ๐ค๐ป + 1 โ ๐๐ป ๐2 โฅ ๐๐ ๐ค๐ + 1 โ ๐๐ ๐2 โ ๐๐ป ๐ค๐ป + 1 โ ๐๐ป ๐๐ ๐ค๐ โฅ ๐๐ ๐ค๐ + 1 โ ๐๐ ๐๐ ๐ค๐ โ ๐๐ป ๐ค๐ป โฅ ๐๐ ๐ค๐ + 1 + ๐๐ป โ ๐๐ We now assume, ๐๐ ๐ค๐ โค ๐๐ป ๐ค๐ป , where she should serve hard on both. We then have ๐๐ป ๐ค๐ป ๐2 = max ๐ ๐ค = ๐๐ป ๐ค๐ป โ ๐ฅ2 = ๐ป ๐ฅ ๐ ๐ She should serve hard on the first if โ ๐๐ป ๐ค๐ป + 1 โ ๐๐ป ๐2 โฅ ๐๐ ๐ค๐ + 1 โ ๐๐ ๐2 โ ๐๐ป ๐ค๐ป + 1 โ ๐๐ป ๐๐ป ๐ค๐ป โฅ ๐๐ ๐ค๐ + 1 โ ๐๐ ๐๐ป ๐ค๐ป โ ๐๐ ๐ค๐ โค ๐๐ป ๐ค๐ป + 1 + ๐๐ โ ๐๐ป [email protected] Markov Decision Processes (MDP) An MDP is described by the following information: ๏ท ๏ท ๏ท State Space Decision Set Transition Probabilities ๏ท Expected Rewards [email protected] Markov Decision Processes (MDP) At the beginning of each week, a machine is in one of four conditions (states): excellent (E), good (G), average (A), or bad (B). The weekly revenue earned by a machine in each type of condition is as follows: excellent, $100; good, $80; average, $50; bad, $10. After observing the condition of a machine at the beginning of the week, we have the option of instantaneously replacing it with an excellent machine, which costs $200. The quality of a machine deteriorates over time, as given in the following matrix. For this situation, determine the state space, decision sets, transition probabilities, and expected rewards. [email protected] Markov Decision Processes (MDP) ๐ธ ๐= ๐บ ๐ด ๐ต ๐ธ .7 .0 .0 .0 ๐บ .3 .7 .0 .0 ๐ด ๐ต . 0 0.0 . 3 0.0 . 6 0.4 . 0 1.0 State Set: ๐ = ๐ธ, ๐บ, ๐ด, ๐ต Decision Set: ๐ = ๐ , ๐พ , replace and do not replace (keep) We have ๐ท ๐ธ = ๐พ and ๐ท ๐บ = ๐ท ๐ด = ๐ท ๐ต = ๐ , ๐พ We are given the following transition probabilities: ๐ ๐ธ|๐ธ, ๐พ = .7 ๐ ๐บ|๐ธ, ๐พ = .3 โฆ โฎ โฎ โฑ ๐ ๐ธ|๐ต, ๐พ = .0 ๐ ๐บ|๐ต, ๐พ = .0 โฆ [email protected] ๐ ๐ต|๐ธ, ๐พ = 0.0 โฎ ๐ ๐ต|๐ต, ๐พ = 1.0 Markov Decision Processes (MDP) If we replace a machine with an excellent machine, the transition probabilities will be the same as if we had begun the week with an excellent machine. ๐ ๐ธ|๐บ, ๐ = ๐ ๐ธ|๐ด, ๐ = ๐ ๐ธ|๐ต, ๐ = .7 ๐ ๐บ|๐บ, ๐ = ๐ ๐บ|๐ด, ๐ = ๐ ๐บ|๐ต, ๐ = .3 ๐ ๐ด|๐บ, ๐ = ๐ ๐ด|๐ด, ๐ = ๐ ๐ด|๐ต, ๐ = .0 ๐ ๐ต|๐บ, ๐ = ๐ ๐ต|๐ด, ๐ = ๐ ๐ต|๐ต, ๐ = .0 [email protected] Markov Decision Processes (MDP) If the machine is not replaced, then, during the week, we receive the revenues given in the problem. ๐๐ธ,๐พ = $100, ๐๐บ,๐พ = $80, ๐๐ด,๐พ = $50, ๐๐ต,๐พ = $10 If we replace the machine, ๐๐ธ,๐ = ๐๐บ,๐ = ๐๐ด,๐ = ๐๐ต,๐ = โ$100 [email protected] Markov Decision Processes (MDP) Definition: A policy is a rule that specifies how each periodโs decision is chosen. Definition: A policy ๐ฟ is a stationary policy if whenever the state ๐, the policy ๐ฟ chooses (independently of the period) the same decision ๐ฟ ๐ . [email protected] Markov Decision Processes (MDP) ๐ฟ: an arbitrary policy ฮ: the set of all policies ๐๐ก : random variable for the state of MDP at the beginning of period ๐ก ๐1 : given state of the process at the beginning of period 1 (initial state) ๐๐ก : decision chosen during period ๐ก ๐๐ฟ ๐ : expected discounted reward earned during an infinite number of periods, given that at the beginning of period 1, state is ๐ and stationary policy ๐ฟ is followed. [email protected] Markov Decision Processes (MDP) We can then write, โ ๐ฝ ๐กโ1 ๐๐ฅ ๐ก ๐ ๐ก |๐1 = ๐ ๐๐ฟ ๐ = ๐ธ๐ฟ ๐ก=1 In a max problem, ๐ ๐ = max ๐๐ฟ ๐ ๐ฟ โฮ In a min problem, ๐ ๐ = min ๐๐ฟ ๐ ๐ฟโฮ [email protected] Markov Decision Processes (MDP) Definition: If a policy ๐ฟ โ has the property that for all ๐ โ ๐ ๐ ๐ = ๐๐ฟ โ ๐ then, ๐ฟ โ is an optimal policy. [email protected] Markov Decision Processes (MDP) We can use the following approaches to find the optimal stationary policy: ๏ท ๏ท Policy Iteration Linear Programming ๏ท Value Iteration or Successive Approximations [email protected] Policy Iteration ๐๐ฟ ๐ can be found by solving the following system of linear equations: ๐ ๐๐ฟ ๐ = ๐๐,๐ฟ ๐ +๐ฝ ๐ ๐|๐, ๐ฟ ๐ ๐๐ฟ ๐ , ๐ = 1, โฆ , ๐ ๐ =1 Consider the following stationary policy in the machine replacement example: ๐ฟ ๐ธ = ๐ฟ ๐บ = ๐พ; ๐ฟ ๐ด = ๐ฟ ๐ต = ๐ We then have, ๐๐ฟ ๐๐ฟ ๐๐ฟ ๐๐ฟ ๐ธ ๐บ ๐ด ๐ต = 100 + .9 . 7๐๐ฟ ๐ธ + .3๐๐ฟ ๐บ = 80 + .9 . 7๐๐ฟ ๐บ + .3๐๐ฟ ๐ด = โ100 + .9 . 7๐๐ฟ ๐ธ + .3๐๐ฟ ๐บ = โ100 + .9 . 7๐๐ฟ ๐ธ + .3๐๐ฟ ๐บ By solving these, ๐๐ฟ ๐ธ = 687.81, ๐๐ฟ ๐บ = 573.19, ๐๐ฟ ๐ด = 487.81, and ๐๐ฟ ๐ต = 487.81 [email protected] Howardโs Policy Iteration Step 1) Policy Evaluation: Choose a stationary policy ๐ฟ and use the equations to find ๐๐ฟ ๐ , ๐ = 1, โฆ , ๐ Step 2) Policy Improvement: For all states ๐ = 1, โฆ , ๐, compute ๐ ๐๐ฟ ๐ = max ๐โ๐ท ๐ ๐๐,๐ + ๐ฝ ๐ ๐|๐, ๐ ๐๐ฟ ๐ ๐ =1 If ๐๐ฟ ๐ = ๐๐ฟ ๐ , โ๐ โ ๐ฟ is an optimal policy. If ๐๐ฟ ๐ > ๐๐ฟ ๐ , โ๐ โ ๐ฟ is not an optimal policy. Modify ๐ฟ to obtain ๐ฟโฒ for which ๐๐ฟโฒ ๐ โฅ ๐๐ฟ ๐ , โ๐. Return to Step (1) with policy ๐ฟโฒ. [email protected] Howardโs Policy Iteration Machine Replacement Example: Consider the following stationary policy: ๐ฟ ๐ธ = ๐ฟ ๐บ = ๐พ and ๐ฟ ๐ด = ๐ฟ ๐ต = ๐ We have found that ๐๐ฟ ๐ธ = 687.81, ๐๐ฟ ๐บ = 573.19, ๐๐ฟ ๐ด = 487.81, and ๐๐ฟ ๐ต = 487.81 [email protected] Howardโs Policy Iteration Now, ๐๐ฟ ๐ธ = ๐๐ฟ ๐ธ = 687.81 โ100 + .9 . 7๐๐ฟ ๐ธ + .3๐๐ฟ ๐บ = 487.81 80 + .9 . 7๐๐ฟ ๐บ + .3๐๐ฟ ๐ด = ๐๐ฟ ๐บ = 572.19 = 572.19 โ ๐ฟ ๐บ = ๐พ ๐๐ฟ ๐บ = max โ100 + .9 . 7๐๐ฟ ๐ธ + .3๐๐ฟ ๐บ = 487.81 50 + .9 . 6๐๐ฟ ๐ด + .4๐๐ฟ ๐ต = 489.03 = 489.03 โ ๐ฟ ๐ด = ๐พ ๐๐ฟ ๐ด = max โ100 + .9 . 7๐๐ฟ ๐ธ + .3๐๐ฟ ๐บ = ๐๐ฟ ๐ต = 487.81 10 + .9๐๐ฟ ๐ต = 449.03 = 487.81 โ ๐ฟ ๐บ = ๐ ๐๐ฟ ๐ต = max [email protected] Howardโs Policy Iteration Since ๐๐ฟ ๐ > ๐๐ฟ ๐ , for ๐ = ๐ด, the policy ๐ฟ is not optimal. So replace it with ๐ฟโฒ given as ๐ฟโฒ ๐ธ = ๐ฟโฒ ๐บ = ๐ฟโฒ ๐ด = ๐พ and ๐ฟโฒ ๐ต = ๐ We now return to Step (1) and compute ๐๐ฟโฒ ๐ธ = 690.23, ๐๐ฟโฒ ๐บ = 575.50, ๐๐ฟโฒ ๐ด = 492.35, and ๐๐ฟโฒ ๐ต = 490.23 by solving the following system. Note that ๐๐ฟโฒ ๐ > ๐๐ฟ ๐ , โ๐. ๐๐ฟโฒ ๐๐ฟโฒ ๐๐ฟโฒ ๐๐ฟโฒ ๐ธ ๐บ ๐ด ๐ต = 100 + .9 . 7๐๐ฟโฒ ๐ธ + .3๐๐ฟโฒ ๐บ = 80 + .9 . 7๐๐ฟโฒ ๐บ + .3๐๐ฟโฒ ๐ด = 50 + .9 . 6๐๐ฟโฒ ๐ด + .4๐๐ฟโฒ ๐บ = โ100 + .9 . 7๐๐ฟโฒ ๐ธ + .3๐๐ฟโฒ ๐บ [email protected] Howardโs Policy Iteration Now apply the policy iteration procedure as follows: ๐๐ฟโฒ ๐ธ = ๐๐ฟโฒ ๐ธ = 690.23 โ100 + .9 . 7๐๐ฟโฒ ๐ธ + .3๐๐ฟโฒ ๐บ = 490.23 80 + .9 . 7๐๐ฟโฒ ๐บ + .3๐๐ฟโฒ ๐ด = ๐๐ฟโฒ ๐บ = 575.50 = 575.50 โ ๐ฟโฒ ๐บ = ๐พ ๐๐ฟโฒ ๐บ = max โ100 + .9 . 7๐๐ฟโฒ ๐ธ + .3๐๐ฟโฒ ๐บ = 490.23 50 + .9 . 6๐๐ฟ ๐ด + .4๐๐ฟ ๐ต = 492.35 = 492.35 โ ๐ฟ ๐ด = ๐พ ๐๐ฟโฒ ๐ด = max โ100 + .9 . 7๐๐ฟ โฒ ๐ธ + .3๐๐ฟ โฒ ๐บ = ๐๐ฟโฒ ๐ต = 490.23 10 + .9๐๐ฟโฒ ๐ต = 451.21 = 490.23 โ ๐ฟ ๐บ = ๐ ๐๐ฟโฒ ๐ต = max Since ๐๐ฟโฒ ๐ = ๐๐ฟโฒ ๐ , โ๐, the policy ๐ฟโฒ is an optimal stationary policy. [email protected] Linear Programming It can be shown that an optimal stationary policy for a maximization problem can be found by solving the following LP: ๐ min ๐๐ ๐ =1 ๐ ๐๐ โ ๐ฝ ๐ ๐|๐, ๐ ๐๐ โฅ ๐๐๐ , โ๐, โ๐ โ ๐ ๐ ๐ =1 For a minimization problem can be found by solving the following LP: ๐ max ๐๐ ๐ =1 ๐ ๐๐ โ ๐ฝ ๐ ๐|๐, ๐ ๐๐ โค ๐๐๐ , โ๐, โ๐ โ ๐ ๐ ๐ =1 [email protected] Linear Programming min ๐๐ธ + ๐๐บ + ๐๐ด + ๐๐ต ๐๐ธ โฅ 100 + .9 . 7๐๐ธ + .3๐๐บ ๐๐บ โฅ 80 + .9 . 7๐๐บ + .3๐๐ด ๐๐บ โฅ โ100 + .9 . 7๐๐ธ + .3๐๐บ ๐๐ด โฅ 50 + .9 . 6๐๐ด + .4๐๐ต ๐๐ด โฅ โ100 + .9 . 7๐๐ธ + .3๐๐บ ๐๐ต โฅ 10 + .9๐๐ต ๐๐ต โฅ โ100 + .9 . 7๐๐ธ + .3๐๐บ (K in E) (K in G) (R in G) (K in A) (R in A) (K in B) (R in B) By solving the LP, we obtain ๐๐ธ = 690.23, ๐๐บ = 575.50, ๐๐ด = 492.35 and ๐๐ต = 490.23. Note that 1st, 2nd, 4th and 7th constraints are binding. [email protected] Value Iteration ๐ ๐๐ก ๐ = max ๐โ๐ท ๐ ๐๐,๐ + ๐ฝ ๐ ๐|๐, ๐ ๐๐กโ1 ๐ , ๐กโฅ1 ๐ =1 ๐0 ๐ = 0 Let ๐๐ก ๐ be the decision that must be chosen during period 1 in state ๐ to attain ๐๐ก ๐ . For an MDP with finite state space and each ๐ท ๐ containing a finite number of elements, we have, ๐๐ก ๐ โ ๐ ๐ ๐ฝ๐ก โค max ๐ , 1 โ ๐ฝ ๐,๐ ๐๐ For an optimal stationary policy ๐ฟ โ ๐ , we have lim ๐๐ก ๐ = ๐ฟ โ ๐ ๐กโโ [email protected] ๐ = 1, โฆ , ๐ Value Iteration Stage 1: ๐1 ๐ธ = 100 (๐พ) ๐1 ๐บ = max 80 โ100 (๐พ) = 80 (๐ ) ๐1 ๐ด = max 50 โ100 (๐พ) = 50 (๐ ) ๐1 ๐ต = max 10 โ100 (๐พ) = 10 (๐ ) [email protected] Value Iteration Stage 2: ๐2 ๐ธ = 100 + .9 . 7๐1 ๐ธ + .3๐1 ๐บ = 184.6 (๐พ) ๐2 ๐บ = max 80 + .9 . 7๐1 ๐บ + .3๐1 ๐ด = 143.9 (๐พ) = 143.9 โ100 + .9 . 7๐1 ๐ธ + .3๐1 ๐บ = โ15.4 (๐ ) ๐2 ๐ด = max 50 + .9 . 6๐1 ๐ด + .4๐1 ๐ต = 80.6 (๐พ) = 80.6 โ100 + .9 . 7๐1 ๐ธ + .3๐1 ๐บ = โ15.4 (๐ ) ๐2 ๐ต = max 10 + .9๐1 ๐ต = 19 โ100 + .9 . 7๐1 ๐ธ + .3๐1 ๐บ = โ15.4 [email protected] (๐พ) = 19 (๐ ) Value Iteration Observe that after two iterations of successive approximations, we have not yet come close to the actual values of ๐ ๐ and have not found it optimal to replace even a bad machine. In general, if we want to ensure that all the ๐๐ก ๐ are within ๐ of the corresponding ๐ ๐ , we would perform ๐ก โ iterations of successive iterations where โ ๐ฝ๐ก max ๐ < ๐ 1 โ ๐ฝ ๐,๐ ๐๐ [email protected] Maximizing Average Reward ๐ max ๐๐๐ ๐๐๐ ๐=1 ๐โ๐ท ๐ ๐ ๐๐๐ = 1 ๐=1 ๐โ๐ท ๐ ๐ ๐๐๐ = ๐โ๐ท ๐ ๐๐๐ ๐ ๐|๐, ๐ , ๐โ๐ท ๐ ๐=1 ๐๐๐ โฅ 0, โ๐, ๐ [email protected] โ๐ Machine Replacement max 100๐๐ธ๐พ + 80๐๐บ๐พ + 50๐๐ด๐พ + 10๐๐ต๐พ โ 100 ๐๐บ๐ + ๐๐ด๐ + ๐๐ต๐ ๐๐ธ๐พ + ๐๐บ๐พ + ๐๐ด๐พ + ๐๐ต๐พ + ๐๐บ๐ + ๐๐ด๐ + ๐๐ต๐ = 1 ๐๐ธ๐พ = .7 ๐๐ธ๐พ + ๐๐บ๐ + ๐๐ด๐ + ๐๐ต๐ ๐๐บ๐พ + ๐๐บ๐ = .3 ๐๐บ๐ + ๐๐ด๐ + ๐๐ต๐ + ๐๐ธ๐พ + .7๐๐บ๐พ ๐๐ด๐ + ๐๐ด๐พ = .3๐๐บ๐พ + .6๐๐ด๐พ ๐๐ต๐ + ๐๐ต๐พ = ๐๐ต๐พ + .4๐๐ด๐พ By solving, we obtain ๐ง = 60 and ๐๐ธ๐พ = .35, ๐๐บ๐พ = .50, ๐๐ด๐ = .15. [email protected] The End [email protected]
© Copyright 2026 Paperzz