A Semiparametric Statistics Approach for Model

A Semiparametric Statistics Approach
to Model-Free Policy Evaluation
Tsuyoshi UENO(1), Motoaki KAWANABE(2),
Takeshi MORI(1), Shin-ich MAEDA(1) , Shin ISHII(1),(3)
(1)Kyoto
University
(2)Fraunhofer FIRST
1
Summary of This Talk
• We discussed LSTD-based policy evaluation from the
viewpoint of semiparametric statistics and
estimating function.
1. How good is LSTD?
LSTD is a type of estimating function method, and
evaluate the asymptotic estimation variance of LSTD.
2. Can we improve LSTD ?
We derive an optimal estimating function with the minimum
asymptotic estimation variance.
We propose a new policy evaluation algorithm (gLSTD)
2
Model-Free Reinforcement Learning
Action a
p
Policy p
Environment
Reward
p
r
State sp
Goal: Obtain an optimal policy p *
which maximizes the sum of future rewards
3
Policy Iteration [Sutton & Barto, 1998]
Policy Evaluation
(Estimate the value function)
Policy Improvement
(Update the policy)
If the value function can be correctly estimated,
policy iteration converges the optimal policy p *
Value function estimation is a key of policy iteration !!
4
Policy Evaluation Method: LSTD
[Bratke & Barto, 1996]
• Least Squares Temporal Difference (LSTD)
– LSTD-based policy iteration algorithms have shown good
practical performance.
• Least Squares Policy Iteration (LSPI) [Lagoudakis & Parr, 2003]
• Natural Actor-Critic (NAC) [Peters et.al., 2003, 2005]
• Representation Policy Iteration (RPI)[Mahadevan & Maggino, 2007]
LSTD is one of the important algorithms in RL field
5
Least Square Temporal Difference (LSTD)
• Assumption V p (s ) :=
Ґ
е
t= 0
t
щ
Ep й
g
r
|
s
t
+
1
л
ы
T
p
T
t
V (st ) := f (st ) q = f q
Feature
Parameter
We assume that the linear function ‘completely’ represents the value function.
(There are no bias.)
• Bellman equation
[Bellman, 1966 ]
T
p
й
щ
f q = E лrt + 1 | st ы+ g E [f t | st ] q
T
t
p
6
Least Square Temporal Difference (LSTD)
• Linearly approximated bellman equation
{(f
T
p
щ
йrt + 1 | st щ
- gf t + 1 ) + g (f t + 1 - E й
f
|
s
q
+
r
E
)
(
}
t+1
л t+1 t ы
л
ы) = rt + 1
p
t
Input: x t
Parameter
Noise
Noise
Output:
the input and observation noise are
mutually dependent!!
T
t
x q + et = yt
Just a linear regression problem
(Error in (input) variable problem [Young,1984])
7
yt
Linear Regression with Error in Variables
• Ordinary least squares method (OLS):
- 1
€
q
OLS
йN
щ
= ке xt xt ъ
клt = 1
ъ
ы
йN
щ
ке xt yt ъ
кt = 1
ъ
л
ы
OLS
y
y= x
the observation noise depends on
the input variable,
x
€ is biased.
OLS estimator q
LS
8
Instrumental Variable Method
[Soderstrom and Stoica, 2002]
• Introduce the instrumental variable: zt
- 1
Output: y
€
q
OLS
йN
щ
= ке xt xt ъ
клt = 1
ъ
ы
- 1
йN
щ
ке xt yt ъ
кt = 1
ъ
л
ы
€
q
IV
йN
щ
= ке zt xt ъ
клt = 1
ъ
ы
йN
щ
ке zt y t ъ
клt = 1
ъ
ы
The instrumental variable is
y =input
x but
correlated with the
uncorrelated with the noise
Input: x
€ is an unbiased estimator
q
IV
9
Least Square Temporal Difference (LSTD)
• LSTD = Instrumenatal variable method.
– Instrumental variable : zt = f t
-1
€
q
LSTD
N- 1
йN - 1
Tщ
= ке f t (f t - gf t + 1 ) ъ е f t rt + 1
клt = 0
ъ
ы t= 0
(for example)
zt = f t + c, zt = f t - k , L , zt = a f t + c
are also instrumental variables
It is important to choose
an appropriate instrumental variable.
10
Our Approach
• How good is LSTD ?
We analysis the asymptotic estimation variance
of instrumental variable method.
• Can we improve LSTD?
We optimize the instrumental variable
so as to minimize the asymptotic estimation variance.
We introduce a viewpoint of semiparametric statistical inference
11
Semiparametric Statistics Approach
• Semiparametric model:
p (x; q, k )
– q is target parameter
– k are nuisance parameter (infinite degree of freedom )
• Linearly approximated Bellman equation
x t = f t - gf t + 1
yt = xtT q + et
y t = rt + 1
We don’t know the noise distribution.
We need to estimate only the target parameter
regardless of the nuisance parameters
12
Inference of Semiparametric Model
• Estimating function [Godambe, 1985]
[Conditions]
p
E [f (x, y; q)]= 0,
For any nuisance
parameter
й¶
щ
E к f (x, y ; q) ъ№ 0, E p
кл¶ q
ъ
ы
p
йf (x, y ; q) 2 щ< Ґ
кл
ъ
ы
• Estimating equation
N- 1
€ = 0
f xt , y t ; q
е (
)
t= 0
€ converges to the true parameter q*
q
regardless of nuisance parameter.
13
Estimating Functions
• Estimating function = LSTD
fLSTD = f t
{(f
T
t
- gf t + 1 ) q - rt + 1
}
• Estimating function = Instrumental variable method
{
T
fIV = z (st , L , st - k ) (f t - gf t + 1 ) q - rt + 1
}
Instrumental Variable
Are there any other estimating functions ?
14
Are There Any Other Estimating Functions ?
No !!
Proposition 1
Every admissible estimating functions must have the form of
{
T
}
fIV = z (st , st - 1, L , st - T ) (f t - gf t + 1 ) q - rt + 1 .
“Inadmissible” estimating function means there are superior
estimating functions to it.
15
Asymptotic Variance of LSTD-Based Estimators
Lemma 2.
The asymptotic estimation variance of estimating function for
value functions is given by
1 -1
T - 1
й
щ
€
AV кqъ=
A M (A )
лы N
Tщ
й
where A = E клzt (f t - gf t + 1 ) ы
ъ,
T
*
*
e
=
f
g
f
q
- rt + 1.
(
)
t
t+1
and t
p
й
* 2
M = E к(et ) zt ztT щ
ъ
л
ы
p
Which instrumental variable performs
the minimum asymptotic variance ?
16
The Optimal Estimating Function
Theorem 1.
The optimal instrumental variable with the minimum
asymptotic variance is given by
- 1
й
* 2
щ f - g E p йf | s щ
z = E к(et ) | st ъ
(
)
t
t+1
tы
л
л
ы
*
t
p
T
*
t
*
e
=
f
g
f
q
- rt + 1.
(t
where
t+ 1)
True parameter
(unknown)
Unknown conditional expectations
Approximation is necessary
17
gLSTD
gLSTD
The optimal instrumental variable
- 1
2
*
щ f - g E p йf | s щ
zt* = E p й
e
|
s
кл( t ) t ы
ъ ( t
л t + 1 t ы)
(Unknown)
• The residual of true parameter et* ¬ e€tLSTD
Replace the regression residual of true parameter
with that of LSTD estimator.
• Unknown conditional expectations
й
* 2
щ, E p йf
щ
E к(et ) | st ъ
t + 1 | st ы
л
л
ы
p
Approximate these conditional expectations
by using a sample-based function approximation technique.
18
Summary of gLSTD
1) Calculate the initial estimator and replace the true residual
-1
qLSTD
N- 1
йN - 1
щ
Tщй
¬ ке f t (f t - gf t + 1 ) ъ ке f t rt + 1 ъ
клt = 0
ъ
ъ
ы клt = 0
ы
LSTD
€
e ¬ et
*
t
2) Approximate the conditional expectations
3) Construct the instrumental variable
- 1
* 2
щ f - g E p йf | s щ
€zt ¬ E й
e
(
кл( t ) | st ъ
л t + 1 t ы)
ы t
p
4) Calculate the gLSTD estimator
-1
qgLSTD
N- 1
йN - 1
щ
Tщй
к
ъ
к
€zt rt + 1 ъ
¬ е €zt (f t - gf t + 1 )
е
к
ъ
ъ
лt = 0
ык
лt = 0
ы
19
й
* 2
щ
E к(et ) | st ъ
, Ep й
f t + 1 | st щ
л
ы
л
ы
p
Simulation (Markov Random Walk)
1
2
3
4
5
R=0
R=0
R=0
R=0.5
R=1.0
• Conditions of the simulation experiment
–
–
–
–
Policy: Random
The number of steps: 100
The number of episodes: 100
Discounted factor: 0.9
• Basis function :
– We generated three basis functions by the diffusion model.
[Mahadevan & Maggino, 2007]
20
Simulation Result.
20%
Median
The upper and lower quartiles
The estimator of gLSTD achieved 20% smaller
MSE than that of the LSTD
21
Conclusion
• We discussed LSTD-based policy evaluation in the
framework of semiparametric statistics approach.
– We evaluated the asymptotic variance of LSTD-based
estimator.
– We derived the optimal estimating function with the
minimum asymptotic variance and proposed its practical
implementation method: gLSTD.
– Through an simple Markov chain problem, we
demonstrated that gLSTD reduces the estimation variance
of LSTD.
22
Future Work
Application to the policy improvement
- Least Squares Policy Iteration (LSPI)
- Natural Actor Critic (NAC) etc.
A Semiparametric Approach to
Model-Free Policy Evaluation
A Semiparametric Approach to
Model-Free Reinforcement Learning
23
End
Thank you for your attention!!
24
Cost Function
LS
D
q
1 € LS
= arg min VDr - V *
2
gLS
D
q
1 € gLS
= arg min VDr - V *
2
2
T
DLS
=
I
g
P
D
F
F
Dr (I - g P)
(
)
r
r
DrLS
2
DrgLS
- 1
- 1
DgLS
=
I
g
P
S
D
S
(
)
(I - g P)
r
r
25
Simulation Result
1
2
3
щ
r= й
0
0
0
0
1.0
к
ъ
л
ы
26
4
5
27
28
Questions
1. How good is the LSTD?
LSTD is a type of estimating function method, and
evaluate the asymptotic estimation variance of LSTD.
2. Can we improve the LSTD ?
We derive the optimal estimating function with the
minimum asymptotic estimation variance.
31
The Suboptimal Estimating Function
(LSTDc)
• GLSTD is required to estimate the
functions depending on current state.
• To avoid estimating these functions, we
simple replace them by constant value.
zt = f t + c
Optimize it to minimize the asymptotic variance
32
The Suboptimal Estimating Function
(LSTDc)
Theorem 2.
The optimal shift is given by
- 1
* 2
щ- (1 - g )E p йe* 2 f f T щE p йf - gf
Tщ
p
Ep й
e
f
f
E
[f t ]
(
)
(
)
(
)
t
tъ
t
t t ъ
t
t+1
t ы
к
к
к
ъ
л
л
ы
л
ы
c* = - 1
* 2щ
p й * 2
Tщ p й
Tщ
p
Ep й
e
1
g
E
e
f
E
f
g
f
f
E
(
)
[f t ]
)
(кл t ) ъы
(кл t ) t ыъ кл( t
t+1
t ы
ъ
*
t
T
*
e
=
f
g
f
q
- rt + 1.
(
)
t
t+1
where
33
Summary of This Talk
• We introduce a semiparametric statistical
viewpoint for estimation of value function with
linear model.
• Our aim
– Evaluate the estimation variance of value
functions
– Develop more efficient estimation methods
34
Summary of Our Main Results
1. Formulate the estimation problem of linearlyrepresented value functions as a semiparametric
inference problem
2. Evaluate the asymptotic variance of estimations
of value function
3. Derive the optimal estimation method with the
minimum asymptotic variance
36
Estimating Functions
• Question
Which function is appropriate when more
than one estimating function exist ?
• Answer
Choose the estimating function with
minimum asymptotic variance
Tщ
й
*
*
€щ:= E к q
€- q q
€- q ъ
AV й
q
кл ъ
ы
л
ы
(
)(
37
)
Instrumental Variable (IV) Method
zt
• Instrumental variable: zt
xt
– Correlated to the input variable,
but uncorrected to the noise.
T
{xt + ext } q + eyt = yt
zt
e xt
• Instrumental variable method
- 1
q = E [zt xt ] E [zt yt ]
38
Statistics approach
39
What is the Semiparametric Approach ?
• Semiparametric model: p (x; q, k )
– Parameter: q
– Nuisance parameter:
k
We need to estimate the parameter q
regardless of the nuisance parameter k .
• Estimating function
[Conditions]
[Godambe, 1985]
E [f (x; q)]= 0
Show the detail in [Godambe, 1985]
N- 1
–
е (
t= 0
€ converges to
q
€ = 0
f xt ; q
)
40
the true parameter q*