Learning Theory of Optimal Decision Making

L EARNING T HEORY
OF O PTIMAL D ECISION M AKING
PART II: BATCH LEARNING IN M ARKOVIAN D ECISION
P ROCESSES
Csaba Szepesvári1
1 Department
of Computing Science
University of Alberta
Machine Learning Summer School, Ile de Re, France, 2008
with thanks to: RLAI group, SZTAKI group, Jean-Yves Audibert, Remi
Munos
O UTLINE
1
H IGH LEVEL OVERVIEW OF THE TALKS
2
M OTIVATION
What is it?
Why should we care?
The challenge
3
M ARKOVIAN D ECISION P ROBLEMS
Basics
Dynamic programming
4
R EGRESSION
5
BATCH RL
6
C ONCLUSION
H IGH LEVEL OVERVIEW OF THE TALKs
Day 1: Online learning in
stochastic environments
Day 2: Batch learning in
Markovian Decision
Processes
Day 3: Online learning in
adversarial environments
R EFRESHER
Union bound: P (A ∪ B) ≤ P (A) + P (B)
Inclusion: A ⊂ B ⇒ P (A) ≤ P (B)
Inversion: If P (X > t) ≤ F (t) holds for all t then for any
0 < δ < 1, w.p. 1 − δ, X ≤ F −1 (δ)
R∞
Expectation: If X ≥ 0, E [X ] = 0 P (X ≥ t) dt
Jensen: If f is convex then f (E [X ]) ≤ E [f (X )]
R EFRESHER II
Linearity: E [X + Y ] = E [X ] + E [Y ]
Law of total expectation:
P
E [Z ] = x E [ZP
|X = x] P (X = x)
E [Z |U = u] = x E [Z |U = u, X = x] P (X = x|U = x)
def
∼ tower rule: E [Z |Y ] = f (Y ), where f (y ) = E [Z |Y = y ].
Then
E [Z ] = E [E [Z |X ]] , E [Z |U] = E [E [Z |U, X ] |U] .
A corollary of the Markov property: Let X1 , X2 , . . . be a
Markov process. Then
E [f (X1 , X2 , . . .)|X1 = y , X0 = x] = E [f (X1 , X2 , . . .)|X1 = y ]
W HAT IS IT ?
P ROTOCOL OF LEARNING
Concepts: Experimenter, Learner, Environment
states, actions, rewards
One-shot learning:
1
Experimenter generates training data
D = {(X1 , A1 , X10 , R1 ), . . . , (Xn , An , Xn0 , Rn )} by following
some policy πb in the Environment
2
Learner computes a policy π based on D
3
Policy is implemented in the Environment
(it’s performance Vπ is compared with that of πb )
Goal: maximize Vπ → max
W HY SHOULD WE CARE ?
Batch – a frequent situation
learning – you know!
in Markovian Decision Processes – convenience, ..
T HE CHALLENGE
State space X is ..
large
infinite
continuous
Action space A is ..
large
infinite
continuous
Model based approaches require
efficient learners (Markov kernel?)
efficient planners
Direct approach?
N OTE
In this talk A is finite.
M ARKOVIAN D ECISION P ROBLEMS
(X , A, p, r , γ)
X – set of states
A – set of actions
p – transition kernel
p(·|x, a) – next state
distribution
p(y|x, a) – prob. of y after
taking a in state x
r – reward function
r (x, a, y ), or r (x, a), or r (x)
0 < γ ≤ 1 – discount factor
T HE PROCESS VIEW
(Xt , At , Rt ) – controlled Markov
process
Xt ∈ X –state at time t
At ∈ A – action at time t
Rt ∈ R – reward at time t
Laws:
Xt+1 ∼ p(·|Xt , At )
At ∼ π(·|Xt , At−1 , Rt−1 , . . . , A1 , R1 , X0 )
π – policy, mapping histories to M(A)
Rt = r (Xt , At , Xt+1 ) + Wt
Wt – reward noise (can depend on transition)
T HE CONTROL PROBLEM
Value functions:
Vπ (x) = Eπ
"∞
X
#
γ t Rt | X0 = x ,
t=0
Optimal value function:
V ∗ (x) = sup Vπ (x),
x ∈X
π
Optimal policy π ∗ :
Vπ∗ (x) = V ∗ (x),
x ∈X
x ∈X
A PPLICATIONS OF MDP S
Operations research, econometrics control, statistics,
games, AI, . . .
Optimal investments
Replacement problems
Option pricing
Logistics, inventory
management
Active vision
Production scheduling
Dialogue control
Bioreactor control
Robotics (e.g., Robocup
Soccer)
Driving
Real-time load balancing
Design of experiments
(Medical tests)
B ELLMAN OPERATORS
D EFINITION ( S UPREMUM NORM )
kV k∞ = maxx∈X |V (x)| (or kV k∞ = supx∈X |V (x)| in infinite spaces).
We let B(X ) = (X , k·k∞ ).
Let π : X → A be a stationary policy
B(X ) = {V : X → R | kV k∞ < +∞} – value functions
Tπ : B(X ) → B(X ) – Bellman operator of π:
X
(Tπ V )(x) =
p(y |x, π(x)) {r (x, π(x), y ) + γV (y)}
y∈X
T HEOREM
Vπ is the fixed point of Tπ
Tπ Vπ = Vπ
and is unique.
B ELLMAN OPERATORS II
N OTE
Tπ Vπ = Vπ is a linear system of equations!
E.g., X = {1, 2, . . . , n}
Pπ ∈ Rn×n : (Pπ )ij = p(j|i, π(i))
P
rπ ∈ Rn :
(rπ )i = j p(j|i, π(i))r (i, π(i), j)
def
Let Vπ ∈ Rn . Then Tπ Vπ ∈ Rn , (Tπ Vπ )i = (Tπ Vπ )(i),
(Tπ Vπ )i
=
n
X
p(j|i, π(i)) {r (i, π(i), j) + γVπ (j)}
j=1
= (rπ )i + γ(Pπ Vπ )i ,
(Tπ Vπ ) = Vπ
m
rπ + γPπ Vπ = Vπ
Also: Vπ = (I − γPπ )−1 rπ .
hence
P ROOF OF THE FIXED POINT EQUATION
Define Rt:∞ =
P∞
s=0 γ
sR
t+s .
Then R0:∞ = R0 + γR1:∞ .
Vπ (x) = Eπ [R0:∞ | X0 = x]
X
=
P (X1 = y|x, π(x)) Eπ [R0:∞ | X0 = x, X1 = y]
y∈X
=
X
p(y |x, π(x))Eπ [R0:∞ | X0 = x, X1 = y]
y∈X
=
X
p(y |x, π(x))Eπ [R0 + γR1:∞ | X0 = x, X1 = y]
y∈X
=
X
p(y |x, π(x)) {r (x, π(x), y ) + γVπ (y )}
y∈X
= (Tπ Vπ )(x)
T HE BANACH FIXED - POINT THEOREM
D EFINITION (C ONTRACTION )
Let T : Rn → Rn . T is a contraction if ∃γ < 1 s.t. for any
U, V ∈ Rn ,
kTU − TV k ≤ γ kU − V k .
T HEOREM (BANACH FIXED - POINT THEOREM )
Let T : Rn → Rn be a contraction with factor γ. Then ∃!V ∈ Rn
s.t. TV = V . Further, ∀V0 ∈ Rn , the sequence Vk+1 = TVk
converges to V and kVk − V k = O(γ k ).
N OTE
This all holds when (Rn , k · k) is replaced by a Banach-space
(e.g., (RX , k · k∞ )).
P OLICY EVALUATIONS ARE CONTRACTIONS
Let k·k = k·k∞ .
T HEOREM
Let π be any stationary policy. Then Tπ is a γ-contraction.
C OROLLARY
The function Vπ is the unique fixed point of Tπ and
Vk+1 = Tπ Vk → Vπ for any V0 ∈ B(X ) and kVk − V0 k = O(γ k ).
T HE B ELLMAN OPTIMALITY OPERATOR
D EFINITION (BOO)
Let T : B(X ) → B(X ),
def
(TV )(x) = max
a∈A
X
p(y|x, a) {r (x, a, y ) + γV (y )} .
y∈X
D EFINITION (G REEDY POLICY )
Policy π : X → A is greedy w.r.t. V if Tπ V = TV , or
X
p(y |x, π(x)) {r (x, π(x), y ) + γV (y)}
y∈X
= max
a∈A
P ROPOSITION
T is a γ-contraction.
X
y∈X
p(y|x, a) {r (x, a, y) + γV (y )}
B ELLMAN OPTIMALITY EQUATION
T HEOREM
TV ∗ = V ∗ .
D EFINITION
Let T1 , T2 : B(X ) → B(X ). We say that T1 ≤ T2 if ∀V ∈ B(X ),
T1 V ≤ T2 V .
P ROOF .
Let V be the fixed point of T .
Tπ ≤ T ⇒ V ∗ ≤ V .
Let π be greedy w.r.t. V : Tπ V = TV ⇒ Vπ = V
⇒ V = Vπ ≤ maxπ Vπ = V ∗ ⇒ V = V ∗ .
VALUE ITERATION
T HEOREM
For any V0 ∈ B(X ), let Vk+1 = TVk , k = 0, 1, . . .. Then
Vk → V ∗ and kVk − V ∗ k = O(γ k ).
T HEOREM
Let V ∈ B(X ) arbitrary and π be greedy w.r.t. V . Then
−V k
kVπ − V ∗ k ≤ 2kTV
.
1−γ
P ROOF .
kVπ − V ∗ k ≤ kVπ − V k + kV − V ∗ k.
Tπ V = TV ⇒ Vπ − V = TVπ − TV + TV − V .
V ∗ − V = TV ∗ − TV + TV − V .
Use triangle inequalities.
P OLICY ITERATION (H OWARD , 1960)
D EFINITION
For U, V ∈ B(X ) we say that U ≥ V if ∀x ∈ X , U(x) ≥ V (x).
For U, V ∈ B(X ) we say that U > V if U ≥ V and ∃x ∈ X s.t.
U(x) > V (x).
T HEOREM (P OLICY I MPROVEMENT )
Let π be any policy. Let π 0 be greedy w.r.t. Vπ . Then Vπ0 ≥ Vπ .
If TVπ > Vπ then Vπ0 > Vπ .
P OLICY ITERATION ALGORITHM
P OLICY-I TERATION (π)
1
2
V := Vπ
do
1
2
3
V 0 := V
Let π be greedy w.r.t. V
Evaluate π: V := Vπ
3
while (V > V 0 )
4
return π
T HEOREM
Consider a finite MDP. Policy-Iteration stops after a finite
number of steps, returning an optimal policy. Further, it is at
least as fast as value iteration.
VALUE ITERATION ON LARGE STATE SPACES
A PPROXIMATE VALUE ITERATION
Vk+1 = TVk + k .
T HEOREM
Let = maxk kk k. Then
lim sup kVk − V ∗ k ≤
k→∞
2γ .
1−γ
P ROOF .
Let ak = kVk − V ∗ k.
ak+1 = kVk+1 − V ∗ k = kTVk − TV ∗ + k k ≤ γ kVk − V ∗ k + = γak + .
Hence, ak is bounded.
Take lim sup of both sides ⇒ a ≤ γa + , reorder.
VALUE ITERATION ON LARGE STATE SPACES
D EFINITION (N ON - EXPANSION )
A : B(X ) → B(X ) is a non-expansion if ∀U, V ∈ B(X ),
kAU − AV k ≤ kU − V k.
F ITTED VALUE ITERATION [G ORDON , 1995]
Vk+1 = ATVk .
T HEOREM
Let U, V ∈ B(X ) s.t. ATU = U, TV = V . Then
kU − V k ≤
kAV − V k
.
1−γ
P ROOF .
−V k
Let U 0 be the fixed point of TA. Then kU 0 − V k ≤ γkAV
.
1−γ
Since AU 0 = AT (AU 0 ), thus U = AU 0 .
Hence, kU − V k = kAU 0 − V k ≤ kAU 0 − AV k + kAV − V k.
ACTION - VALUE FUNCTIONS
D EFINITION (ACTION VALUES )
Let A0 = a, from step 1 follow policy to get π to obtain R0 , R1 , . . ..
Qπ (x, a) = E [R0:∞ |X0 = x, A0 = a] .
D EFINITION (O PTIMAL ACTION - VALUE FUNCTION )
Q ∗ : B(X × A) → B(X × A):
Q ∗ (x, a) = sup Qπ (x, a),
(x, a) ∈ X × A.
π
D EFINITION (O PERATORS )
Tπ , T : B(X × A) → B(X × A):
X
(Tπ Q)(x, a) =
p(x, a, y ) {r (x, a, y ) + Q(y , π(y ))} ,
y ∈X
(TQ)(x, a)
=
X
y ∈X

ff
0
p(x, a, y ) r (x, a, y ) + max
Q(y,
a
)
.
0
a ∈A
ACTION - VALUE FUNCTIONS II
D EFINITION (G REEDY P OLICY )
Policy π : X → A is greedy w.r.t. Q if
Q(x, π(x)) = maxa∈A Q(x, a).
A LGORITHMS
Value iteration for policy evaluation: Qk+1 = Tπ Qk → Qπ
Value iteration for computing Q ∗ : Qk+1 = TQk → Q ∗
Policy iteration: πk+1 greedy w.r.t. Qπk .
Policy Iteration Theorem continues to hold.
N OTE
Bounds for approximate procedures still hold.
R EGRESSION
P ROBLEM
Let (Xi , Yi ) ∼ P∗ (·, ·), Xi ∈ X , Yi ∈ R, i = 1, 2, . . . , n.
Find f : X → R s.t.
h
i
L(f ) = E (f (X ) − Y )2
is minimal, where (X , Y ) ∼ P∗ (·, ·).
T HEOREM (O PTIMAL SOLUTION ≡ CONDITIONAL
EXPECTATION )
Let f ∗ (x) = E [Y |X = x]. Then for any f , L(f ∗ ) ≤ L(f ).
P ROCEDURES
D EFINITION (E MPIRICAL R ISK F UNCTIONAL )
Ln (f ) =
1
n
Pn
i=1 (f (Xi )
− Yi )2 .
N OTE ( LAW OF LARGE NUMBERS )
ˆ
˜
Fix f . As n → ∞, Ln (f ) → L(f ) = E (f (X ) − Y )2 .
E MPIRICAL R ISK M INIMIZATION
Fix F ⊂ RX , fn := argminf ∈F Ln (f ).
Note: if F = RX , Ln (fn ) = 0: “overfitting”.
S TRUCTURAL R ISK M INIMIZATION
Fix (Fd )d∈N , an infinite sequence of increasing set of functions.
fn = argminf ∈Fd ,d∈N Ln (f ) + pen(n, d).
R EGULARIZATION
Fix F ⊂ RX . fn = argminf ∈F Ln (f ) + λ kf k.
I SSUES
D EFINITION (C ONSISTENCY )
An algorithm
A is consistent if for any probability distribution P∗ ,
E L(fnA ) → L∗ .
T HEOREM (S LOW R ATES ; T HEOREM 3.1 OF [G Y ÖRFI ET AL ., 2002])
Pick A consistent, (an )n , an → 0.
Then ∃P∗ s.t. L∗ = 0 and asymptotically E L(fnA ) ≥ an .
T HEOREM (L OWER BOUND , “ CURSE OF DIM ”; T HEOREM 3.3 OF
[G Y ÖRFI ET AL ., 2002])
Let X = [0, 1]d . Let D be the class of (p, C)-smooth
distributions over X × R. Then for any A, any slow (bn )n∈N ,
bn → 0, there exists a distribution P ∈ D s.t. on A on data from
P gives
− 2p
E L(fnA ) ≥ bn n 2p+d asymptotically
B OUNDS
E RROR DECOMPOSITIONS
L(fn ) = Ln (fn ) + {L(fn ) − Ln (fn )} .
Best regressor in class F: fF∗ := argminf ∈F L(f ).
L(fn ) − L∗
=
loss to best =
{L(fF∗ ) − L∗ }
+ {L(fn ) − L(fF∗ )}
approximation error + estimation error.
Estimation error bound:
E [L(fn )] ≤ Ln (fn ) + B(n, F)
Oracle bound:
E [L(fn )] ≤ inf L(f ) + B(n, F)
f ∈F
Error bound relative to the minimum loss:
E [L(fn )] ≤ L∗ + B(n, F)
E MPIRICAL P ROCESSES
D EFINITION (L OSS CLASS )
L = {`f ∈ RX ×R : `f (x, y ) = (f (x) − y )2 , f ∈ F}.
E MPIRICAL M EASURE
def
L(f ) = E (f (X ) − Y )2 = E [`f (X , Y )] = P`f .
P
def
Ln (f ) = 1/n ni=1 (f (Xi ) − Yi )2 = En [`f (X , Y )] = Pn `f .
H OEFFDING BOUND
Assume `(Xi , Yi ) ∈ [a, a + b]. Then
r
|P`f − Pn `f | ≤ b
log(2/δ)
.
2n
E MPIRICAL P ROCESSES II
For any fixed f ∈ F,
r
P`f ≤ Pn `f + b
log(2/δ)
.
2n
E MPIRICAL P ROCESSES : U NIFORM DEVIATIONS
L(fn ) − Ln (fn ) ≤ sup(L(f ) − Ln (f )).
f ∈F
Let F = {f1 , . . . , fN }.
When is supf ∈F (L(f ) − Ln (f )) large (bad event)?
If it is large for some fi :
N
X
P sup(L(f ) − Ln (f )) > ≤
P (L(fi ) − Ln (fi ) > )
f ∈F
i=1
≤ N exp(−2n2 /b2 ).
E MPIRICAL P ROCESSES : U NIFORM DEVIATIONS II
Inversion: for any δ > 0, w.p. 1 − δ, for any f ∈ F simultaneously it holds
that
r
log N + log(1/δ)
.
L(f ) ≤ Ln (f ) + b
2n
Estimation error bound:
r
log N + log(1/δ)
∗
.
L(fn ) ≤ L(fF ) + 2b
2n
Perspective: Key is how P`f − Pn `f varies as f ∈ F. For F
infinite, cover F with balls + tones of tricks to get
r
DF + log(1/δ)
,
L(f ) ≤ Ln (f ) + C
2n
where DF is characteristic of the size (metric-entropy ≈
dimension) of F.
W HAT IS IT ?
P ROTOCOL OF LEARNING
Concepts: Experimenter, Learner, Environment
states, actions, rewards
One-shot learning:
1
Experimenter generates training data
D = {(X1 , A1 , X10 , R1 ), . . . , (Xn , An , Xn0 , Rn )} by following
some policy πb in the Environment
2
Learner computes a policy π based on D
3
Policy is implemented in the Environment
(it’s performance Vπ is compared with that of πb )
Goal: maximize Vπ → max
E VALUATING A POLICY WITH FITTED VALUE ITERATION
T RAINING DATA
Training data D = {(X1 , A1 , R1 , X2 , A2 , R2 , . . . , Xn , An , Rn , Xn+1 } generated by
following some policy πb
FACT
∀Q
h ∈ B(X × A), π : X → A it holds that i
E Rt + γQ(Xt+1 , π(Xt+1 )) Xt = x, At = a = (Tπ Q)(x, a)
F ITTED VALUE ITERATION FOR POLICY EVALUATION
Let Ln (f ; Q) =
Pn
t=1 {Rt
1
Pick Q0 , m := 0.
2
do
+ γQ(Xt+1 , π(Xt+1 )) − f (Xt , At )}2 .
3
Qm+1 := argminQ∈F Ln (Q; Qm ).
4
m := m + 1
5
while (kQm − Qm−1 k > )
F ITTED VALUE ITERATION II
Ln (f ; Q) =
n
X
{Rt + γQ(Xt+1 , π(Xt+1 )) − f (Xt , At )}2
t=1
Qm+1 = argminQ∈F Ln (Q; Qm ).
E RROR A NALYSIS
Define m = Qm+1 − Tπ Qm . Then Qm+1 = Tπ Qm + m .
Plan:
Show that m is small if n is big and F is rich enough.
Show that kQM − Qπ k2ν is small if all the errors are small
“Error propagation”
E RROR PROPAGATION I.
Let Qm+1 = Tπ Qm + m , −1 = Q0 − Qπ .
Um+1 = Qm+1 − Qπ
= Tπ Qm − Qπ + m
= Tπ Qm − Tπ Qπ + m
= γPπ Um + m .
Hence
UM
=
M
X
m=0
(γPπ )M−m m−1 .
E RROR PROPAGATION II.
Notation:
R
R
for ρ measure, ρf = f (x)ρ(dx); (Pf )(x) = f (y )P(dy|x).
UM
=
M
X
(γPπ )M−m m−1 .
m=0
µ|UM |2
„
≤
1
1−γ
«2
M
´2
1−γ X m `
γ µ (Pπ )m M−m−1
1 − γ M+1
m=0
(Jensen twice)
„
≤
≤
C1
1
1−γ
«2
M
1−γ X m
γ ν|M−m−1 |2
1 − γ M+1
m=0
(Jensen for operators, ∀ρ : ρPπ ≤ C1 ν, ν ≤ C1 µ)
!
„
«2
M
X
1
1−γ
M
2
m 2
C1
γ
ν|
|
+
γ
−1
1−γ
1 − γ M+1
m=0
( := max ν|m |2 )
m
„
=
C1
1
1−γ
«2
2 + C1
γ M ν|−1 |2
1 − γ M+1
S UMMARY
Result: If the regression errors ν|m |2 , m = Qm+1 − Tπ Qm ,
are small and the system is noisy (∀ρ : ρPπ ≤ C1 ν), and
the test and train distributions are similar (ν ≤ C1 µ), then
µ|UM |2 is small.
How to make the regression errors small?
Regression error decomposition:
kQm+1 − Tπ Qm k22 = kQm+1 − ΠF Tπ Qm k22
+ kΠF Tπ Qm − Tπ Qm k22 .
Bias-variance dilemma:
If F is big, the estimation error will be big
If F is small, the approximation error will be big
M ORE ALGORITHMS
A LGORITHMS
Policy iteration
Evaluate policies with fitted value iteration
Solve the projected fixed point equation ΠF Tπ Q = Q
[Lagoudakis and Parr, 2003, Antos et al., 2008]
Modified Bellman-error minimization [Antos et al., 2008]
Value iteration [Munos and Szepesvári, 2008]
R ESULTS
Consistency
Oracle inequalities
Rate of convergence
Regularization
C ONCLUSIONS
Merging regression and RL ⇒ efficient algorithms
Works in practice! [Ernst et al., 2005, Riedmiller, 2005]
Works in theory!
Difficulty: How can you prove performance improvement?
How to take advantage of other regularities?
Factored dynamics
Relevant features
..
.
R EFERENCES I
Antos, A., Szepesvári, C., and Munos, R. (2008).
Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single
sample path.
Machine Learning, 71:89–129.
Ernst, D., Geurts, P., and Wehenkel, L. (2005).
Tree-based batch mode reinforcement learning.
Journal of Machine Learning Research, 6:503–556.
Gordon, G. (1995).
Stable function approximation in dynamic programming.
In Prieditis, A. and Russell, S., editors, Proceedings of the Twelfth International Conference on Machine
Learning, pages 261–268, San Francisco, CA. Morgan Kaufmann.
Györfi, L., Kohler, M., Krzyżak, A., and Walk, H. (2002).
A distribution-free theory of nonparametric regression.
Springer-Verlag, New York.
Lagoudakis, M. and Parr, R. (2003).
Least-squares policy iteration.
Journal of Machine Learning Research, 4:1107–1149.
Munos, R. and Szepesvári, C. (2008).
Finite-time bounds for fitted value iteration.
Journal of Machine Learning Research, 9:815–857.
Riedmiller, M. (2005).
Neural fitted Q iteration – first experiences with a data efficient neural reinforcement learning method.
In 16th European Conference on Machine Learning, pages 317–328.