Introduc3on to Ar3ficial Intelligence

Introduc)ontoAr)ficialIntelligence
Week2:ProblemSolvingandOp)miza)on
ProfessorWei-MinShen
Week9.2
U)lity,Policy,Itera)ons
•  Probabilis)cDecisionMaking
–  Ac)onandsensormodels
–  U)litytheory
–  Decisionnetworks
–  Valueitera)on
–  Policyitera)on
–  Markovdecisionprocesses(MDP)
–  par)allyobservableMDP(POMDP)
2
KeyIdeas
Readings:AIMAChapter-1617andALFE4-5
ModelsbyStates,Ac)onsandSensors(review)
Stateshaveu)lityU(s)
MaximumExpectedU)lity
–  EU(a|e)=Sum[P(s|a,e)U(s)]
•  MarkovDecisionProcess(MDP):
states,ac)ons,transi)onsP(s’|s,a),rewards:sèr,or(s,a)èr
–  Solu)onsarerepresentedaspolicy:sèa.
–  Totaldiscountedexpectedreward(Bellman):
• 
• 
• 
• 
•  Equa)on17.5orinALFEpage167.
•  Par)allyObservedMDP(POMDP)
–  Statescannotbecompletelyobserved
–  NeedasensormodelP(z|s)
•  Objec)ve:findingtheop)malpolicybasedonu)li)es
3
Ac)onandSensorModels(review)
z2
z1
s23
1.Ac)ons
a1
2.Percepts(observa)ons)
3.States
4.Appearance:statesàobserva)ons
5.Transi)ons:(states,ac)ons)àstates
6.CurrentState
z1
a1
z3
s94
a1
s77
Whataboutthegoals?
4
U)lityorValueofStates
U)lityóGoalInforma)on
z2
z1
U(s23)=0.4
s23
z1
a1
z3
s94
1.Ac)ons
a1
U(s94)=0.2
2.Percepts(observa)ons)
a1
s77
3.States
4.Appearance:statesàobserva)ons
U(s77)=0.4
5.Transi)ons:(states,ac)ons)àstates
6.CurrentState
7.Value/U)lityofstatesU(s)(usefulnessforgoals)
5
RewardandAc)onExample
+1
-1
start
Rewardsasshown,
Twoterminalstates:-1,+1(goal)
-0.04forallnonterminalstates
0.1
Intended
0.8
0.1
Ac)ons:ß,à,^,v,
outcomeprobability:
0.8forintended,
0.2sideways
6
LiklePrinceExample(youcanaddu)lity)
•  (ac)on)Transi)onProbabili)esΦ(Fwd,Back,Turn)
F
S0
S1
S2
S3
B
S0
S1
S2
S3
T
S0
S1
S2
S3
S0
0.1
0.1
0.1
0.7
S0
0.1
0.1
0.7
0.1
S0
0.7
0.1
0.1
0.1
S1
0.1
0.1
0.7
0.1
S1
0.1
0.1
0.1
0.7
S1
0.1
0.7
0.1
0.1
S2
0.7
0.1
0.1
0.1
S2
0.1
0.7
0.1
0.1
S2
0.1
0.1
0.1
0.7
S3
0.1
0.7
0.1
0.1
S3
0.7
0.1
0.1
0.1
S3
0.1
0.1
0.7
0.1
•  (sensor)AppearanceProbabili)esθ
θ
Rose
Volcano
Nothing
S0
0.8
0.1
0.1
S1
0.1
0.8
0.1
S2
0.1
0.1
0.8
S3
0.1
0.1
0.8
•  Ini)alStateProbabili)esπ
π
S0
S1
S2
S3
.25
.25
.25
.25
Youcanaddu)litytostates(e.g.,prefertoseerose)
7
LiklePrinceinAc)on(POMDP)
E1:T
S1S2S3S4
……
S1S2S3S4
S1S2S3S4
S1S2S3S4
Par)ally
Observable
{rose},fward,{none},…,turn,{rose},back,{volcano}
)me
•  Given:“experience”E1:T()me1throughT)
•  Defini)ons:
– 
– 
– 
– 
StateSandac)onsA(Forward,Back,Turn)
(Ac)onmodel)Transi)onProbabili)esΦ
(Sensormodel)AppearanceProbabili)esθ(rose,volcano,none)
(localiza)on)Ini)al/currentStateProbabili)esπ
8
StateU)lityandDecisiononAc)ons
•  Combinethefollowingtogether
–  Par)allyobservableac)on/percep)onmodel
•  E.g.,LiklePrinceExample
–  Statevaluesfromgoals
•  E.g.,seeDynamicProgrammingexample(nextslide)
–  Policy(chooseac)onfromstatevalues)
9
StateValuesinDynamicProgramming(ALFE6.1.1)
Backwardrecursion:computethefuturecostbybackfromthegoal*stagebystage
R(s,a):rewardfrom
anac)ononastate
Backwardrecursionequa)on:
10
RewardandAc)onExample
+1
-1
start
Rewardsasshown,
Twoterminalstates:-1,+1(goal)
-0.04forallnonterminalstates
0.1
Intended
0.8
0.1
Ac)ons:ß,à,^,v,
outcomeprobability:
0.8forintended,
0.2sideways
11
Par)allyObservableMarkovDecisionProcess
•  APOMDPconsistsof
–  StateS(s,…)andac)onsA(a,…)
–  Ini)alStates0ProbabilityDistribu)onπ
–  Transi)onModelΦ(s’|s,a)
–  SensorModelθ(z|s)
–  Arewardfunc)onR(s)U)lity
12
MarkovDecisionProcess(MDP)
•  AMDPconsistsof
–  StateSandac)onsA
–  Ini)alStates0ProbabilityDistribu)onπ
–  Transi)onModelΦ(s’|s,a)
–  SensorModelθ(z|s)
–  Arewardfunc)onR(s)
Note:Thesensorshavenouncertaintyçu)li)es
13
MaximumExpectedU)lity(MEU)
•  Everystatehasau)lityU(s)
•  Theexpectedu)lityofanac)ongiventhe
evidenceorobserva)one,istheaverageu)lity
valueoftheoutcomes,weightedbythe
probabilitythattheoutcomeoccurs:
z2
z1
U(s2)=0.4
EU(a | e) = ∑ P(Re sult(a) = s' | a, e)U(s')
s'
•  Theprincipleofmaximumexpectedu)lity
(MEU)isthatara)onalagentshouldchoose
theac)onthatmaximizestheagent’sexpected
u)lity:
a1
s2
a1
a1
s9
U(s9)=0.2
s7
U(s7)=0.4
action = arg max EU(a | e)
a
14
Policy
•  Solu)onof(PO)MDPcanberepresentedas
–  APolicyπ(s)=a
–  Eachexecu)onofpolicyfroms0mayyield
adifferenthistoryorpath
–  Aop)malpolicyπ*(s)isapolicythatyieldsthe
highestexpectedu)lity
15
U)li)esofStates(example)
0.812
0.868
0.762
0.705
0.655
0.918
+1
0.660
-1
0.611
0.388
Supposewehavetheu)li)esofthestatesasabove,
Whatpathwouldanop)malpolicywouldchoose?
16
Op)malPolicyExamples
dependingonrewarddistribu)onR(s)
Whentherewardforthenonterminalstatesareevenlydistributed-.04,thepath
Choosewilldependonthetransi)onmodel.
3
2
1
à
à
+1
^
-1
ß
ß
ß
2
3
4
à
^
^
1
NonterminalR(s)=-0.04
Cau)on:becausethenondeterminis)cac)ons,fromstate(3,2)or(4,1),
youmay“accidentally”goto(4,2).Sothereisa“risk”inthispolicy.
17
Op)malPolicyExamples
dependingonrewarddistribu)onR(s)
à
NonterminalR(s)=-0.04
à
à
^
à
^
+1
-1
^
à
ß
ß
ß
?
Whynotgoup?
Hint:Considerthenatureofac)ons,
determinis)cornot
à
+1
à
à
-1
^
à
^
^
suicide
à
R(s)<-1.6284
(lifeispainful,deathisgood)
à
^
à
à
^
^
ß
v
ß
-1
v
ß
v
ß
-0.0221<R(s)<0
risky
(lifeisgood,minimize
risks,willingtoendnicely)
+1
^
-1
^
ß
risky
à
(lifeisOK,willingtorisk)
+1
norisk
à
-0.4278<R(s)<-0.0850
à
endnicely
à
v
ß
+1
don’tend
ß -1
norisk
v
v
V
norisk
R(s)>0
(lifeisrewarding,
noendplease)
18
ComputeU)li)esOverTime
•  Addi)verewards
U h = ([s0 , s1, s2 ,...]) = R(s0 ) + R(s1 ) + R(s2 ) +…
•  Discountedrewards
2
U h = ([s0 , s1, s2 ,...]) = R(s0 ) + γ R(s1 ) + γ R(s2 ) +…
•  Theexpectedu)lityUπ(s)obtainedby
execu)ngpolicyπstar)ngins
⎡
⎤
t
U (s) = E ⎢ ∑ γ R(st )⎥
⎣ t=0
⎦
π
4/28/17
∞
Op)malPolicy
π s* = arg maxU π (s)
π
19
ComputeU)li)es(example)
0.812
0.868
0.762
0.705
0.655
0.918
+1
0.660
-1
0.611
0.388
Rewardinallnonterminalstateis-0.04
Discount=1
20
StateU)lityValueItera)on
(improvingU(s)everystep)
•  U(S):theexpectedsumofmaximumrewards
achievablestar)ngatapar)cularstate
•  Bellmanequa)ons:
–  Manyequa)onsmustbesolvedsimultaneously
U * (s) = R(s) + γ max ∑T (s, a, s')U * (s')
a
s'
•  Bellmanitera)on:
–  ConvergetoU*(s)stepbystep
Ui+1 (s) ← R(s) + γ max ∑ T (s, a, s')Ui (s')
a
4/28/17
s'
21
BellItera)onExample
Ini)alU)li)es
RewardR(s)
-0.04 -0.04 -0.04
+1
-0.04
-1
-0.04
-0.04 -0.04 -0.04 -0.04
Transac)onprobability
T(s,a,s’)=0.8for
intended,0.2forsideways
Gamma=1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Ui+1 (s) ← R(s) + γ max ∑ T (s, a, s')Ui (s')
a
0.812
0.868
0.762
0.705
0.655
s'
0.918
+1
0.660
-1
0.611
0.388
22
Op)malPolicy
•  YoucancomputeitonceazerU*(s)isknown
*
*
π s = argmax ∑T (s, a, s')U (s')
a
•  Or,youcancomputeitincrementallyevery
itera)onwhenUi(s)isupdated
–  Thisiscalled“PolicyItera)on”(seethenextslide)
23
PolicyItera)on
(improvingπ(s)everystep)
•  Startwitharandomlychosenini)alpolicyπ0
•  Iterateun)lnochangeinu)li)es:
•  Policyevalua)on:givenapolicyπi,calculatetheu)lityUi(s)of
everystatesusingpolicyπibysolvingthesystemofequa)ons:
Uπ (s) = R(s) + γ ∑ T (s, π (s), s')Uπ (s')
s'
•  Policyimprovement:calculatethenewpolicyπi+1usingonesteplook-aheadbasedonUi(s):
π i+1 (s) = argmax ∑T (s, a, s')U * (s')
a
4/28/17
s'
24
PolicyItera)onComments
•  Eachstepofpolicyitera)onisguaranteedto
strictlyimprovethepolicyatsomestatewhen
improvementispossible
•  Convergetoop)malpolicy
•  Givesexactvalueofop)malpolicy
25
PolicyItera)onExample
26
PolicyItera)onExample
27
PolicyItera)onExample
28
PolicyItera)onExample
29
PolicyItera)onExample
30
PolicyItera)onExample
31
PolicyItera)onExample
•  Thenewpolicyπ2azeranitera)onfromπ1
–  π2(Hungry)àEat
–  π2(Full)àSleep
32
U)li)esOverTime
•  Stateshaveu)lityU(s)
•  MaximumExpectedU)lity
–  EU(a|e)=Sum[P(s|a,e)U(s)]
•  MarkovDecisionProcess(MDP):state,ac)on,
transi)onP(s’|s,a),reward:sèr,or(s,a)èr
–  Solu)onsarerepresentedaspolicy:sèa.
–  Totaldiscountedexpectedreward(Bellman):
•  Equa)on17.5orinALFEpage167.
•  Par)allyObservedMDP(POMDP)
–  Statescannotbecompletelyobserved
–  ItisnecessarytohaveasensormodelP(z|s)
•  Objec)ve:findingtheop)malpolicybasedonu)li)es
33
U)lityorValueofStates
z2
z1
U(s23)=0.4
s23
z1
a1
a1
z3
s94
U(s94)=0.2
1.Ac)ons
a1
2.Percepts(observa)ons)
s77
3.States
U(s77)=0.4
4.Appearance:statesàobserva)ons
5.Transi)ons:(states,ac)ons)àstates
6.CurrentState
7.Value/U)lityofstatesU(s)(usefulforgoals)
34
Sofar…
•  GivenanMDPmodelweknowhowtofind
op)malpolicies
–  ValueItera)onorPolicyItera)on
•  Butwhatifweonlyknowmodelstatessbutnot
transi)onsT(s,a,s’)andrewardR(s)?
–  Likewhenwewerebabies...
–  Allwecandoiswanderaroundtheworldobserving
whathappens,ge{ngrewardedandpunished
–  ThisisReinforcementLearning
35
POMDP,Transi)ons,BeliefStates
•  APOMDPconsistsof
– 
– 
– 
– 
– 
– 
AsetofstatesS(withanini)alstates0)
AsetofAc)ons(s)ofac)onsineachstate
Atransi)onmodelP(s’|s,a),orT(s,a,s’)
Arewardfunc)onR(s)
AsensormodelP(e|s)
Abeliefofwhatthecurrentstateis
•  Ifb(s)wasthepreviousbeliefstate,andtherobotdoesac)on
“a”andthenperceivesevidence“e”,thenthenewbeliefstate:
whereisanormaliza)onconstantthatmakesthebeliefstates
sumto1.
36
ValueItera)oninPOMDP
•  Theop)malac)ondependsonlyonthe
robot’scurrentbeliefstate
⎛
⎞
α p (s) = R(s) + γ ⎜ ∑ T (s, a, s')∑ P(e | s')α p.e (s')⎟
⎝ s'
⎠
e
ComparetotheMDPvalueitera)onbelow,thedifferentisatthelastterm
Wherestateisuncertainandmustbeaverageoverallpossibleevidences
U(s) = R(s) + γ max ∑ T (s, a, s')U(s')
a
s'
37
ExtraSlides
38
Ra)onalDecisions
• 
• 
• 
• 
• 
• 
Ra)onalpreferences
U)li)es
Money
Mul)-akributeu)li)es
Decisionnetworks
Valueofinforma)on
39
Preferences
•  An agent chooses among prizes (A, B, etc.) and
lotteries, i.e., situations with uncertain prizes
•  Lottery L = [p, A; (1-p), B]
•  Notation:
A ÂB
A preferred to B
A»B
indifference between A and B
A Â
B not preferred to A
» B
40
Ra)onalpreferences
• 
• 
Idea: preferences of a rational agent must obey constraints.
Rational preferences )
behavior describable as maximization of expected utility
•  Constraints:
Orderability
(A Â B) _ (B Â A) _ (A » B)
Transitivity
(A Â B) ^ (B Â C) ) (A Â C)
Continuity
A Â B Â C ) 9 p [p, A; 1-p,C] » B
Substitutability
A » B ) [p, A; 1-p,C] » [p, B; 1-p,C]
Monotonicity
A Â B ) (p ¸ q , [p, A; 1-p,B] [q, A; 1-q, B])
Decomposability
[p, A; 1-p, [q, B, (1-q), C]]) » [p, A; (1-p)q, B; (1-p)(1-q), C]
41
Decomposability
•  A “complex” hierarchical lottery can be
collapsed into a single multi-choice lottery:
p
L
1-p p
A
q
1-q B
C
[p, A; 1-p, [q, B, (1-q), C]]
L
(1-p)q (1-p)(1-q) A
B
C
[p, A; (1-p)q, B; (1-p)(1-q), C]
42
Ra)onalpreferencescont’d
•  Violating the constraints leads to self-evident irrationality
•  For example: an agent with intransitive preferences can
be induced to give away all its money
•  If B Â C, then an agent who has C
would pay (say) 1 cent to get B
•  If A Â B, then an agent who has B
would pay (say) 1 cent to get A
•  If C Â A, then an agent who has A
would pay (say) 1 cent to get C
43
MaximizingExpectedU)lity(MEU)
•  Theorem (Ramsey, 1931; von Neumann and Morgenstern, 1944):
–  Given preferences satisfying the constraints there exists a real-valued
function U such that
U(A) > U(B) , A Â B
U(A) = U(B) , A » B Â
»
U(A) ¸ U(B) , A B
U([p1, S1; …; pn, Sn]) = Σi piU(Si)
•  MEU Principle:
Choose the action that maximizes expected utility
•  Note: an agent can be entirely rational (consistent with MEU) without
ever representing or manipulating utilities and probabilities
–  E.g., a lookup table for perfect tic-tac-toe
44
U)li)es
Utilities map states to real numbers. Which numbers?
Standard approach to assessment of human utilities:
compare a given state A to a standard lottery Lp that has
“best possible prize” u> with probability p
“worst possible catastrophe” u? with probability (1-p)
adjust lottery probability p until A » Lp
45
U)lityscales
•  Normalized utilities: u> = 1:0, u? = 0:0
•  Micromorts: one-millionth chance of death
useful for Russian roulette, paying to reduce product risks,
etc.
•  QALYs: quality-adjusted life years
useful for medical decisions involving substantial risk
•  Note: behavior is invariant w.r.t. +ve linear transformation
U’(x) = k1U(x) + k2 where k1 > 0
•  With deterministic prizes only (no lottery choices), only ordinal
utility can be determined, i.e., total order on prizes
46
Money
•  Money does not behave as a utility function
•  Given a lottery L with expected monetary value EMV (L),
usually U(L) < U(EMV (L)), i.e., people are risk-averse
•  Utility curve: for what probability p am I indifferent between a
prize x and a lottery [p, $M; (1-p), $0] for large M?
•  Typical empirical data, extrapolated with risk-prone behavior:
47
DecisionNetworks
•  MEU:Choosetheac)onwhichmaximizesthe
expectedu)litygiventheevidence
•  Candirectlyopera)onalizethiswithdecision
diagrams
–  Bayesiannetswithnodesforu)lityandac)ons
–  Letsuscalculatetheexpectedu)lityforeach
ac)on
•  Newnodetypes:
–  Chancesnodes(justlikeBNs)
–  Ac)ons(rectangles,mustbeparents,actas
observedevidence)
–  U)li)es(dependonac)onandchancenodes)
Umbrella
U
Weather
Forecast
48
DecisionNetworks
•  MEUAc)onselec)on:
–  Instan)ateallevidence
–  Calculateposteriorover
parentsofu)litynode
–  Setac)onnodeeach
possibleway
–  Calculateexpectedu)lity
foreachac)on
–  Choosemaximizingac)on
Umbrella
U
Weather
Forecast
49
Example:DecisionNetworks
Umbrella=leave
Umbrella=take
EU(take) = ∑ P( w)U ( take)
w
= 0.7 ⋅ 20 + 0.3 ⋅ 70 = 35
Op)mal=leave
W
MEU(
φ ) = max EU(a) = 70
a
Umbrella
U
Weather
P(W)
sun
0.7
rain
0.3
A
W
U(A,W)
leave
sun
100
leave
rain
0
take
sun
20
take
rain
70
50
Example:DecisionNetworks
W
P(W|F=bad)
Umbrella=leave
sun
rain
EU(leave) = ∑ P( w)U (leave)
w
= 0.34 ⋅ 100 + 0.66 ⋅ 0 = 34
Umbrella=take
Weather
EU(take) = ∑ P( w)U ( take)
w
= 0.34 ⋅ 20 + 0.66 ⋅ 70 = 53
Forecast=
bad
Op)mal=take
F = bad) = max EU(a | bad) = 53
MEU(
Umbrella
0.34
0.66
U
A
W
U(A,W)
leave
sun
100
leave
rain
0
take
sun
20
take
rain
70
a
4/28/17
51
DecisionNetworkscontd.
•  Add action nodes and utility nodes to belief networks to enable
rational decision making
Algorithm:
For each value of action node
compute expected value of utility node given action, evidence
Return MEU action
52
Mul)-AkributeU)lity
•  How can we handle utility functions of many variables X1 … Xn?
E.g., what is U(Deaths, Noise, Cost)?
•  How can complex utility functions be assessed from preference
behavior?
•  Idea 1: identify conditions under which decisions can be made
without complete identification of U(x1, …, xn)
•  Idea 2: identify various types of independence in preferences
and derive consequent canonical forms for U(x1, …, xn)
53
Strictdominance
Typically define attributes s.t. U is monotonic in each
Strict dominance: choice B strictly dominates choice A iff
8i Xi(B) ¸ Xi(A) (and hence U(B) ¸ U(A))
Strict dominance seldom holds in practice
54
Stochas)cdominance
• 
Distribution p1 stochastically dominates distribution p2 iff
• 
If U is monotonic in x, then A1 with outcome distribution p1 stochastically
dominates A2 with outcome distribution p2:
∞
∫−∞
• 
p1 ( x)U ( x)dx ≥∫
∞
−∞
p2 ( x)U ( x)dx
Multiattribute case: stochastic dominance on all attributes ) optimal
4/28/17
55
Stochas)cdominancecont’d
•  Stochastic dominance can often be determined without
exact distributions using qualitative reasoning
•  E.g., construction cost increases with distance from city
S1 is closer to the city than S2
) S1 stochastically dominates S2 on cost
•  E.g., injury increases with collision speed
56
Preferencestructure:Determinis)c
•  X1 and X2 preferentially independent of X3 iff
preference between hx1; x2; x3i and hx’1; x’2; x3i
does not depend on x3
•  E.g., hNoise, Cost, Safetyi:
h20,000 suffer, $4.6 billion, 0.06 deaths/mpmi vs.
h70,000 suffer, $4.2 billion, 0.06 deaths/mpmi
•  Theorem (Leontief, 1947): if every pair of attributes is P.I. of its
complement, then every subset of attributes is P.I of its
complement: mutual P.I..
•  Theorem (Debreu, 1960): mutual P.I. ) 9 additive value
function:
V(S) = Sumi [Vi(Xi(S))]
•  Hence assess n single-attribute functions; often a good
approximation
57
Preferencestructure:Stochas)c
•  Need to consider preferences over lotteries:
X is utility-independent of Y iff
preferences over lotteries in X do not depend on y
•  Mutual U.I.: each subset is U.I of its complement
) 9 multiplicative utility function:
U = k1 U 1 + k 2 U 2 + k3 U 3
+ k1k2U1U2 + k2k3U2U3 + k3k1U3U1
+ k1k2k3U1U2U3
where Ui = Ui(xi)
•  Routine procedures and software packages for generating
preference tests to identify various canonical families of utility
functions
58
Valueofinforma)on
•  Idea: compute value of acquiring each
possible piece of evidence
•  Can be done directly from decision
network
•  Example: buying oil drilling rights
–  Two blocks A and B, exactly one has oil,
worth k
–  Prior probabilities 0.5 each
–  Current price of each block is k/2
–  MEU=0 (either action is a maximizer)
–  “Consultant” offers accurate survey of A.
–  What is a fair price?
DrillLoc
U
OilLoc
O
P
D
O
U
a
0.5
a
a
k/2
b
0.5
a
b
-k/2
b
a
-k/2
b
b
k/2
59
Valueofinforma)on(cont’d)
•  Solu)on
DrillLoc
–  computeexpectedvalueofinforma)on
èexpectedgaininMEUfromobservingnew
informa)on
U
OilLoc
•  ProbegivesaccuratesurveyofA
–  Fairprice?
–  Surveymaysay“oilinA”or“nooilinA”
•  prob0.5each
– 
– 
– 
– 
– 
IfweknowO,MEUisk/2(eitherway)
Ini)alMEU=0
GaininMEU(k/2–0)=k/2
VPI(O)=k/2
Fairprice:k/2
O
P
D
O
U
a
0.5
a
a
k/2
b
0.5
a
b
-k/2
b
a
-k/2
b
b
k/2
60
Valueofinforma)on(cont’d)
•  CurrentevidenceE=e,u)litydependsonS=s
MEU(e) = max ∑ P( s | e) ⋅U ( s, a)
a
s
•  Poten)alnewevidenceE’:supposeweknewE’=e’
MEU(e, e' ) = max ∑ P( s | e, e' ) ⋅U ( s, a)
a
s
•  ButE’isarandomvariablewhosevalueiscurrentlyunknown,so:
–  Mustcomputeexpectedgainoverallpossiblevalues
VPI e ( E ' ) = ∑ P(e' | e) ⋅ (MEU(e, e' ) − MEU(e) )
e'
•  VPI=valueofperfectinforma)on
61
VPIExample
Weather
Umbrella
•  MEUwithnoevidence
U
MEU(φ ) = max EU(a) = 70
a
•  MEUifforecastisbad
Forecast
MEU(F = bad) = max EU(a | bad) = 53
a
•  MEUifforecastisgood
MEU(F = good) = max EU(a | bad) = 95
A
W
U(A,W)
leave
sun
100
leave
rain
0
take
sun
20
take
rain
70
a
•  Forecastdistribu)on
F
P(F)
good
0.59
bad
0.41
0.59 ⋅ (95 − 70) + 0.41 ⋅ (53 − 70)
0.59 ⋅ (+25)+ 0.41 ⋅ (−17 ) = +7.78
VPI e ( E ' ) = ∑ P(e' | e)(MEU(e, e' ) − MEU(e) )
e'
6262
Proper)esofVPI
•  Nonnegative—in expectation, not post hoc
∀E',e : VPI e ( E ' ) ≥ 0
•  Nonadditive—consider, e.g., obtaining Ej twice
VPI e ( E j , E k ) ≠ VPI e ( E j ) + VPI e (E k )
•  Order-independent
VPI e ( E j ,Ek ) = VPI e ( E j ) + VPI e,E j ( Ek )
= VPI e ( Ek ) + VPI e,Ek ( E j )
• 
Note: when more than one piece of evidence can be gathered, maximizing VPI
for each to select one is not always optimal
) evidence-gathering becomes a sequential decision problem
63
Qualita)vebehaviors Imagineac)ons1and2,forwhichU1>U2
Howmuchwillinforma)onaboutEjbeworth?
Likle–we’resure
ac)on1isbeker
Alot–eithercouldbe
muchbeker
Likle–infolikelyto
changeourac)onbut
notouru)lity
64

Download Report

Introduc3on to Ar3ficial Intelligence

Paperzz.com

Your Paperzz