Artificial Intelligence

ARTIFICIAL INTELLIGENCE
CH 17
MAKING COMPLEX DECISIONS
GROUP (9)

Team Members :
Ahmed Helal Eid
 Mina Victor William


Supervised by :
 Dr. Nevin M. Darwish
AGENDA
Introduction
 Sequential Decision Problems
 Optimality in sequential decision problems
 Value Iteration
 The value iteration algorithm
 Policy Iteration

INTRODUCTION
 PREVIOUSLY IN CH16

MAKING SIMPLE DECISION.

Concerned with episodic decision problems, in
which the utility of each action's outcome was well
known.
Episodic
environment: the agent experience is
divided into atomic episodes each one consists of the
agent perceiving and then performing a single
action.
INTRODUCTION
 IN
THIS CHAPTER

The computational issues involved in making
decisions in stochastic environment.

Sequential decision problems,
in which the agent's utility depends on a sequence of
decisions.
 Sequential decision problems, which include utilities,
uncertainty, and sensing, generalize the search and
planning problems as special cases.

SEQUENTIAL DECISION PROBLEMS
SEQUENTIAL DECISION PROBLEMS
Unfortunately the environment
not go along with this situation
3
+1
2
-1
What if the environment
was deterministic ?
start
1
1
2
3
4
Actions A(s) in every state are (Up , Down , Left , Right)
SEQUENTIAL DECISION PROBLEMS
0.8
3
+1
0.8
-1
2
start
0.1
0.1
1
0.1
1
0.1
2
3
4
Model for stochastic motion

[ Up, Up, Right, Right, Right ]
0.8^5 =0.32768

[ Right, Right, Up, Up, Right ]
0.1^4× 0.8 =0.00008
SEQUENTIAL DECISION PROBLEMS
 Transition

Probability of reaching state S` if
action a is done at state S
model
T( S, a, S`)
 Markovian
transition
 Utility function
 Reward R(s)
3 -0.04 -0.04 -0.04
+1
-0.04
-1
2 -0.04
-0.04 -0.04 -0.04
1
1
2
3
4
Probability of reaching state
S` from S depend only on S
Depend on a sequence of state
environment history
Agent receives reward in each
state (+ve Or –ve)
Utility = ( - 0.04 × 10 )+1=0.6
For 10 steps to the goal
MARKOV DECISION PROCESS (MDP)

We use MDPs to solve sequential decision problems.

We eventually want to find the best choice of action for each
state.

Consists of:
a set of actions A(s)
 for actions in each state in state s
 transition model P(s' | s, a)
 describing the probability of reaching s' using action a in s
 transitions are Markovian - only depends on s not previous
states
 reward function R(s)
 the reward an agent receives for arriving in state s

SEQUENTIAL DECISION PROBLEMS
What is the solution to a problem look like ?
 Policy

(π)
A solution must specify what the agent should do for
any state that the agent might reach.
 (π(s))

The action recommended by the policy π for state S
 Optimal

policy (π*)
Yield the highest expected utility
CONTINUE……
3
+1
2
-1
R (s) < -1.6284
1
3
+1
2
-1
1
-0.4278 < R (s) < -0.0850
CONTINUE……
3
2
+1
-1
-0.0221 < R (s) < 0
1
3
+1
R (s) > 0
2
1
-1
THE HORIZON
Is there a finite Or infinite horizon
for decision making ?

Finite horizon:
Fixed time N after which nothing matter
(the game is over)
 Optimal policy is Non-stationary

EXAMPLE OF FINITE HORIZON
3
+1
2
-1
start
1
1

N= 3
2
3
4
Optimal action in a given state could change over time
OPTIMALITY IN SEQUENTIAL
DECISION
PROBLEMS
Is there a finite Or infinite horizon
for decision making ?

Infinite horizon:
No fixed deadline (time at state doesn’t matter)
 Optimal policy is stationary

EXAMPLE OF INFINITE HORIZON
N= 100
3
+1
2
-1
start
1
1

2
3
4
Optimal action in a given state could not change over time
OPTIMALITY IN SEQUENTIAL
DECISION
PROBLEMS
Is there a finite Or infinite horizon
for decision making ?
We are mainly going to use infinite horizon utility
functions because

there is no reason to behave differently in the same state.
 Hence,
the optimal action depends only on the current state,
and the optimal policy is stationary.
Optimality in sequential
decision problems
OPTIMALITY IN SEQUENTIAL
DECISION
PROBLEMS
How to calculate utility of a state Sequence ?
 Additive
reward:
U h ([ s0 , s1 , s2 ...])  R( s0 )  R( s1 )  R( s2 )  ...
 Discount
reward:
U h ([ s0 , s1 , s2 ...])  R ( s0 )  R ( s1 )   2 R ( s2 )  ...
Discount factor is between 0 & 1
OPTIMALITY IN SEQUENTIAL
DECISION
PROBLEMS
What if there isn't terminal State Or agent never reach
one?

If the environment doesn’t contain a terminal state, Or if
the agent never reach one, then

all environment Histories will be infinitely long, and utilities
with Additive rewards will generally be infinite.
OPTIMALITY IN SEQUENTIAL
DECISION
PROBLEMS
What if there isn't terminal State Or agent never reach
one?
Solution

With Discount rewards : the utility of an infinite
sequence is finite, if rewards are bounded by Rmax and
γ<1
Uh([S0,S1,…..])=
𝒕=∞ 𝒕
𝒕=𝟎 γ
R(𝑺𝒕 ) <=
=
𝒕=∞ 𝒕
𝒕=𝟎 γ
𝑹𝒎𝒂𝒙
(𝟏−𝜸)
𝑹𝒎𝒂𝒙 )
OPTIMAL POLICIES FOR UTILITIES OF STATES



Expected utility for some policy π starting in
state s
The optimal policy π* has the highest expected
utility and will be given by
This sets π*(s) to the argument a of A(s) which
gives the highest utility
OPTIMAL POLICIES FOR UTILITIES OF STATES


Policy is actually independent of start state:

actions will differ but policy will never change

this comes from the nature of a Markovian decision
problem with discounted utilities over infinite
horizons
U(s) is also independent of start state and
current state
OPTIMAL POLICIES FOR UTILITIES OF STATES
The utilities are higher for states closer to the +1 exit.
Because fewer steps are required to reach the exit
ALGORITHMS FOR CALCULATING THE OPTIMAL
POLICY
Value iteration
Policy iteration
VALUE ITERATION ALGORITHM

Hard to calculate


because it's non-linear so use an iterating algorithm.
Basic idea
Start at an initial value for all states then
 update each state using their neighbours until they
hit equilibrium.

VALUE ITERATION ALGORITHM
VALUE ITERATION ALGORITHM
USING VALUE ITERATION ON THE EXAMPLE
VALUE ITERATION ALGORITHM
When to terminate??!
 Bellman update is small.
So the error compared with the true utility
function is small.
If ||Ui+1-Ui|| < ε(1- γ)/ γ
then
||Ui+1-U|| < ε
Why use c Rmax(1- γ) / γ
Recall: if γ < 1 and infinite-horizon then Uh converges to
Rmax / (1 – γ) when summed over infinity
POLICY ITERATION
 Policy

iteration algorithm alternates two steps:
policy evaluation :given policy πi
calculate Ui=U πi , the utility of each state if were to be
executed.

policy improvement: calculate a new policy Πi+1
 (s)  arg max T (s, a, s )U (s )
*
/
a
s/
/
POLICY ITERATION

Algorithm
start with policy π0
repeat
Policy evaluation: for each state calculate Ui
given by policy πi



simplified version of Bellman Update eqn – no need for
max
check if unchanged
Policy improvement: for each state
if the max utility over each action gives a better result
than π(s)
 set π(s) to the new policy


until unchanged
POLICY ITERATION ALGORITHM