Boundary estimation in the presence of measurement

Boundary estimation in the presence of measurement error
with unknown variance
Alois Kneip
University of Bonn
Léopold Simar
and
Ingrid Van Keilegom
Université catholique de Louvain
Brussels, May 25, 2012
Boundary estimation in the presence of measurement error with unknown variance – p. 1/36
Outline of the talk
1. Introduction and motivation
2. Estimation method
3. Extension to covariates
4. Asymptotic results
5. Simulations
6. Data analysis
7. Conclusions
Boundary estimation in the presence of measurement error with unknown variance – p. 2/36
Introduction and motivation (1/3)
Goal : To estimate the boundary of the support,
Outputs ∈
R+
i.e. the (production) frontier ϕ
ϕ
Some examples :
b
b
b
b
b
b
b
b
b
b
b
⋆ Family farms :
b
b
Input : Number of cows, hectars of land, labor, ...
b
b
b
Output : Liters of milk
b
b
⋆ Productivity of universities :
b
Inputs ∈
R+
Input : Human and financial capital
Output : Number of publications, of PhDs, ...
We restrict attention for the moment to the one-dimensional case (i.e. no
inputs).
Boundary estimation in the presence of measurement error with unknown variance – p. 3/36
Introduction and motivation (1/3)
Goal : To estimate the boundary of the support,
Outputs ∈
b
R+
b
b
b
b
Some examples :
b
b
b
b
b
b
b
b
b
b
b
b
b
b
i.e. the (production) frontier ϕ
ϕ
b
⋆ Family farms :
b
b
Input : Number of cows, hectars of land, labor, ...
b
b
b
Output : Liters of milk
b
b
⋆ Productivity of universities :
b
Inputs ∈
R+
Input : Human and financial capital
Output : Number of publications, of PhDs, ...
We restrict attention for the moment to the one-dimensional case (i.e. no
inputs).
Boundary estimation in the presence of measurement error with unknown variance – p. 4/36
Introduction and motivation (2/3)
Consider the model
Y = X · Z,
where Y is observed, Y ∼ g
X is the true unobserved variable of interest, X ∼ f
Z is the noise, supposed to be independent of X .
Equivalently, we can also write Y ⋆ = X ⋆ + Z ⋆ , where Y ⋆ = log Y ,
X ⋆ = log X and Z ⋆ = log Z .
Assume
log Z ∼ N (0, σ 2 ),
where σ 2 > 0 is an unknown variance, i.e. Z is a log-normal.
Suppose Support(X) = [0, τ ] with f (τ ) > 0 and τ unknown.
Aim of the paper : Estimation of τ (and of σ ).
Boundary estimation in the presence of measurement error with unknown variance – p. 5/36
Introduction and motivation (3/3)
Data : Y1 , . . . , Yn ∼ Y i.i.d.
Literature :
⋄ σ known : extensive literature, see e.g.
*
*
*
*
Goldenshluger and Tsybakov (2004)
Delaigle and Gijbels (2006)
Meister (2006)
Aarts, Groeneboom and Jongbloed (2007), among many others
⋄ σ unknown : Hall and Simar (2002) :
* density of Z unknown but symmetric
* σ = σn → 0
Related research : Estimation of f and of σ when f is smooth on (0, ∞)
⋄ Butucea and Matias (2005), Butucea, Matias and Pouet (2008)
⋄ Schwarz and Van Bellegem (2009)
Boundary estimation in the presence of measurement error with unknown variance – p. 6/36
Estimation method (1/6)
Note that the density of Z is given by (for z > 0)
log z 1
.
φ
σz
σ
A subindex 0 will be added to indicate the true quantities (like f0 , g0 , τ0 , ...).
It can be shown that for all y > 0 :
Z 1
1
y
1
h0 (t)φ
log
dt,
g0 (y) =
σ0 y 0
σ0
tτ0
(1)
where h0 (t) = τ0 f0 (tτ0 ) for 0 ≤ t ≤ 1.
Theorem
There exists a unique σ0 > 0, a unique τ0 > 0 and a unique density h0
such that (1) holds true, i.e. such that the model is identifiable.
Boundary estimation in the presence of measurement error with unknown variance – p. 7/36
Estimation method (2/6)
Main idea :
Penalized profile likelihood maximization involving unknown functions.
Define
gh,τ,σ
1
:=
σy
Z
1
0
1
y
h(t)φ
log
σ
tτ
dt.
Obviously, g0 = gh0 ,τ0 ,σ0 .
Two steps :
Step 1 : Estimation of h for fixed τ and σ
Step 2 : Estimation of τ0 and σ0
Boundary estimation in the presence of measurement error with unknown variance – p. 8/36
Estimation method (3/6)
Step 1 : Estimation of h for fixed τ and σ
Fix τ > 0 and σ > 0 and let M be a natural number. Let
n
o
PM
Γ = γ = (γ1 , . . . , γM ) : γk > 0 for all k and
k=1 γk = M ,
and define
hγ (t) = γ1 I(t = 0) +
PM
k=1 γk I(qk−1
< t ≤ qk )
for 0 ≤ t ≤ 1, where qk = k/M (k = 0, 1, . . . , M ).
It is clear that hγ is a density for all γ ∈ Γ. Then,
Z qk
M
1
X
1
y
ghγ ,τ,σ (y) =
dt.
φ
log
γk
σy
σ
tτ
qk−1
k=1
Boundary estimation in the presence of measurement error with unknown variance – p. 9/36
Estimation method (4/6)
The estimator of h is now defined as
b
hτ,σ (·) = hγbτ,σ (·),
where
n
b
γτ,σ = argmaxγ∈Γ n
n
X
−1
i=1
o
log ghγ ,τ,σ (Yi ) − λpen(ghγ ,τ,σ ) ,
where λ ≥ 0 is a fixed value independent of n, and where
pen(ghγ ,τ,σ ) = max |γj − 2γj−1 + γj−2 |.
3≤j≤M
Boundary estimation in the presence of measurement error with unknown variance – p. 10/36
Estimation method (5/6)
Step 2 : Estimation of τ0 and σ0
Let
n
n
o
X
−1
(τb, σ
b) = argmaxτ >0,σ>0 n
log gbh ,τ,σ (Yi ) − λpen(gbh ,τ,σ ) .
τ,σ
τ,σ
i=1
Moreover, b
h := b
hτb,σb estimates h0 , and b
g := gbh,τb,σb estimates g0 .
Note 1 :
⋄ The estimation procedure can equivalently be carried out in one step,
by maximizing jointly over γ, τ and σ .
⋄ When M = Mn → ∞, we get a sieve maximization procedure.
Boundary estimation in the presence of measurement error with unknown variance – p. 11/36
Estimation method (6/6)
Note 2 :
⋄ λ can be taken equal to 0
⇒ Both penalized and non-penalized estimators are considered
But : penalized estimator attains better rate of convergence.
⋄ λ is chosen independent of n
⋄ Choice of penalty function : motivated by the proof (see later)
Note 3 :
For the asymptotic results we will need to be a bit more precise regarding
the space to which σ , τ and h belong (see below).
Boundary estimation in the presence of measurement error with unknown variance – p. 12/36
Extension to covariates (inputs) (1/4)
Goal : To estimate the boundary of the support,
Outputs ∈
b
R+
b
b
b
b
Some examples :
b
b
b
b
b
b
b
b
b
b
b
b
b
b
i.e. the (production) frontier ϕ
ϕ
b
⋆ Family farms :
b
b
Input : Number of cows, hectars of land, labor, ...
b
b
b
Output : Liters of milk
b
b
⋆ Productivity of universities :
b
Inputs ∈
R+
Input : Human and financial capital
Output : Number of publications, of PhDs, ...
We restrict attention for the moment to the one-dimensional case (i.e. no
inputs).
Boundary estimation in the presence of measurement error with unknown variance – p. 13/36
Extension to covariates (2/4)
Consider the model
Y = ϕ(W ) exp(−U ) exp(V ),
(2)
where V ∼ N (0, σ 2 (W ))
U > 0 has a jump at the origin and its distribution (possibly) depends on W
U and V are independent given W
only W and Y are observed.
Equivalently, log Y = log ϕ(W ) − U + V .
Note that
⋄ If ϕ ≡ τ is constant, then the model can be written as Y = X · Z ,
where X = τ exp(−U ) and Z = exp(V )
⇒ Model (2) extends our previous model to covariates.
⋄ U represents the inefficiency, V represents the error.
Boundary estimation in the presence of measurement error with unknown variance – p. 14/36
Extension to covariates (3/4)
References :
⋄ Fully parametric approach (ϕ, fU and fV parametric) :
many papers; see Greene (2008) for a survey
⋄ Semiparametric approach (ϕ nonparametric, fU and fV parametric) :
see e.g. Fan et al (1996), Kumbhakar et al (2007)
Our goal :
ϕ and fU nonparametric, fV normal but with unknown variance.
But :
Dropping parametric assumptions on the distribution of U greatly
complicates the problem and enforces to develop completely
new methods.
Boundary estimation in the presence of measurement error with unknown variance – p. 15/36
Extension to covariates (4/4)
Let (W1 , Y1 ), . . . , (Wn , Yn ) ∼ (W, Y ) i.i.d.
Fix w0 in the support of W and define
(τb(w0 ), σ
b(w0 ), b
γ (w0 ))
=
n
X
argmaxτ >0,σ>0,γ∈Γ n−1
b
where b is a bandwidth, nb :=
i:kWi −w0 k2 ≤b
Pn
i=1 I{kWi
o
log ghγ ,τ,σ (Yi ) − λ pen(ghγ ,τ,σ ) ,
− w0 k2 ≤ b}, and
pen(ghγ ,τ,σ ) = max |γj − 2γj−1 + γj−2 |.
3≤j≤M
This ‘local constant’ estimator can be improved to a ‘local linear’ esimator
(details omitted).
Boundary estimation in the presence of measurement error with unknown variance – p. 16/36
Asymptotic results (1/5)
Consider the case without covariates, and assume that
(A1) For some 0 < σl < σu < ∞, 0 < τl < τu < ∞, 0 < hl < hu < ∞, and
0 < δ < 1 the estimators (b
g , τb, σ
b) are determined by minimizing over all
(hγ , τ, σ) ∈ Hn × [τl , τu ] × [σl , σu ],
where Hn ⊂ Hhl ,hu ,δ , and
Hhl ,hu ,δ = {h|h is square integrable density with support [0, 1] satisfying
sup h(t) ≤ hu and
t
inf
1−δ≤t≤1
h(t) ≥ hl }.
(A2) h0 ∈ Hhl ,hu ,δ and is twice continuously differentiable, τ0 ∈ [τl , τu ], and
σ0 ∈ [σl , σu ].
(A3) For some 0 < β < 1/5, M = Mn ∼ nβ as n tends to ∞.
Boundary estimation in the presence of measurement error with unknown variance – p. 17/36
Asymptotic results (2/5)
(A4) For some A >
√
2, P log Y < −A(log n)
1/2
σ0 = o(n−1 ).
Remark. Note that (A4) is satisfied when e.g. h0 ≡ 0 on [0, ǫ] for some ǫ > 0.
For two arbitrary densities g1 and g2 , let
Z p
2
p
1
2
H (g1 , g2 ) =
g1 (y) − g2 (y) dy
2
be the Hellinger distance between g1 and g2 .
Theorem 1. Assume (A1)-(A4). Then, if λ ≥ 0,
and if λ > 0,
H(b
g , g0 ) = OP (Mn−2 ),
pen(b
g ) = OP (Mn−2 ).
Boundary estimation in the presence of measurement error with unknown variance – p. 18/36
Asymptotic results (3/5)
Theorem 2. Assume (A1)-(A4). Then,
a) If λ = 0 (i.e. without penalization),
−1
σ
b − σ0 = OP (log n)
,
− 21
τb − τ0 = OP (log n)
.
b) If λ > 0 (i.e. with penalization),
−2
σ
b − σ0 = OP (log n)
,
− 32
τb − τ0 = OP (log n)
,
−1
b
h(1) − h0 (1) = OP (log n)
.
Boundary estimation in the presence of measurement error with unknown variance – p. 19/36
Asymptotic results (4/5)
Remark. Instead of using a histogram estimator for h0 , one could use
suitable spline estimators to approximate h0 .
We have shown that if h0 is m-times continuously differentiable for some
m > 2, then
)
−(1+ m
2
σ
b − σ0 = OP (log n)
,
− m+1
τb − τ0 = OP (log n) 2 ,
m
−
b
h(1) − h0 (1) = OP (log n) 2 .
as long as b
g = gbh,τb,σb (obtained with splines or another approximation
method) satisfies
H(b
g , g0 ) = OP (n−κ )
for some κ > 0.
Boundary estimation in the presence of measurement error with unknown variance – p. 20/36
Asymptotic results (5/5)
Some remarks and thoughts
⋄ Extension to covariates
⋄ Choice of M and λ
⋄ The proofs ...
⋄ What changes if error is not normal ?
Boundary estimation in the presence of measurement error with unknown variance – p. 21/36
Simulations (1/11)
Recall that
Y = τ exp(−U ) exp(V ), where U > 0 and V ∼ N(0, σ 2 ).
or equivalently Y = X · Z , with X = τ exp(−U ) and Z = exp(V ).
Suppose that U ∼ Exp(β). Then, the density of X can be written as
f (x) =
β β−1
x
I(0 ≤ x ≤ τ ).
β
τ
Let
⋄ β = 1 and β = 2
⋄ τ =1
⋄ σ = σV = ρσU with ρ = 0, 0.01, 0.05, 0.25, 0.50, 0.75.
⋄ n = 100
Boundary estimation in the presence of measurement error with unknown variance – p. 22/36
Simulations (2/11)
Density of X when U ∼ Exp(β)
β=1
β=2
Inefficiency U is Exponential with β = 1
Inefficiency U is Exponential with β = 2
2
2
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Boundary estimation in the presence of measurement error with unknown variance – p. 23/36
Simulations (3/11)
Consider
⋄ 500 replications of each experiment
⋄ Choice of λ : minimization of
RM SE(τb) + RM SE(σ
b)
for log10 λ = −4, −3, −2, −1, 0, 1, 2, 3, 4
⋄ Choice of M : let
M = max 3, c × round(n
1/5
)
(rule of thumb).
We fix c = 2. Very similar results were obtained with c = 3 (and even
with c = 1 but here the number of bins was very small).
For n = 100 we have M = 5.
Boundary estimation in the presence of measurement error with unknown variance – p. 24/36
Simulations (4/11)
Case 1 : β = 1
ρ
0
log10 λ
-3
0.05
-2
0.25
-1
0.75
1
RM SE
BIAS
ST D
RM SE
BIAS
ST D
RM SE
BIAS
ST D
RM SE
BIAS
ST D
τ̂
0.0138
-0.0098
0.0098
0.0370
-0.0067
0.0365
0.0988
-0.0251
0.0956
0.0872
-0.0460
0.0742
σ̂
0.39e-04
0.13e-04
0.37e-04
0.0350
-0.0121
0.0328
0.0840
0.0182
0.0821
0.1495
0.1153
0.0952
Boundary estimation in the presence of measurement error with unknown variance – p. 25/36
Simulations (5/11)
Case 2 : β = 2
ρ
0
log10 λ
1
0.05
-2
0.25
-1
0.75
-1
RM SE
BIAS
ST D
RM SE
BIAS
ST D
RM SE
BIAS
ST D
RM SE
BIAS
ST D
τ̂
0.0066
-0.0042
0.0050
0.0178
-0.0019
0.0177
0.0352
0.0020
0.0351
0.0750
0.0250
0.0708
σ̂
0.45e-03
0.42e-03
0.17e-03
0.0190
-0.0054
0.0182
0.0332
0.0049
0.0329
0.0544
-0.0090
0.0537
Boundary estimation in the presence of measurement error with unknown variance – p. 26/36
Simulations (6/11)
Results of the simulations
⋄ When ρ increases the performance deteriorates.
⋄ Selection of λ : seems not to be crucial. In practice, we suggest to use
a bootstrap procedure to estimate the RM SE of the estimators as a
tool for selecting λ.
⋄ Performance for β = 2 is better than for β = 1.
⋄ Rule of thumb for selecting M (number of bins) seems to work well.
⋄ Simulations not shown here, demonstrate that
⋄ Performance improves when n increases.
⋄ Method also works when jump size is rather small (truncated
normal).
Boundary estimation in the presence of measurement error with unknown variance – p. 27/36
Simulations (7/11)
Density of X when U ∼ N + (α, β 2 )
α = 0, β = 0.8
α = 0.6, β = 0.6
Inefficiency U is Truncated Normal with α = 0 and β = 0.8
Inefficiency U is Truncated Normal with α = 0.6 and β = 0.6
1.8
1.4
1.6
1.2
1.4
1
1.2
0.8
1
0.6
0.8
0.6
0.4
0.4
0.2
0
0.2
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Boundary estimation in the presence of measurement error with unknown variance – p. 28/36
Simulations (8/11)
Extension to covariates :
Recall the model :
Y = ϕ(W ) exp(−U ) exp(V ),
and consider
⋄ W ∼ U n[0, 1], U ∼ Exp(3), V ∼ N (0, (.667)2 )
⋄ ϕ(w) = w1/2
⋄ W , U and V are independent
⋄ w = 0.25, 0.50 and 0.75
⋄ h = 1.06 min(s, r/1.349)n−1/5 (normal reference rule)
⋄ λ computed from B = 200 loops
⋄ 500 Monte-Carlo loops
Then, ρ = σV /σU = 0.20.
Boundary estimation in the presence of measurement error with unknown variance – p. 29/36
Simulations (9/11)
Frontier estimation for one sample
n = 100
n = 500
1.4
1.4
Data points
Breakpoints
True frontier
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.1
0.2
Data points
Breakpoints
True frontier
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Boundary estimation in the presence of measurement error with unknown variance – p. 30/36
Simulations (10/11)
n
100
τ (w)
new
HS
500
σ(w)
new
τ (w)
new
HS
σ(w)
new
BIAS
M SE
BIAS
M SE
BIAS
M SE
BIAS
M SE
BIAS
M SE
BIAS
M SE
w = 0.25
.0034
.0004
.0048
.0014
-.0071
.0003
.0083
.0002
.0009
.0005
-.0053
.0001
w = 0.50
.0046
.0008
.0017
.0025
-.0031
.0003
.0105
.0004
.0041
.0007
-.0033
.0001
w = 0.75
-.0030
.0014
-.0003
.0028
.0009
.0003
.0064
.0004
.0041
.0011
-.0005
.0001
HS = Hall-Simar (2002)
Boundary estimation in the presence of measurement error with unknown variance – p. 31/36
Simulations (11/11)
Results
⋄ Choice of λ does not seem to be crucial.
⋄ For τ (w) : MSE much better than in HS :
⋄ For n = 100 : MSE is 30% of the MSE in HS (50% when w = 0.75).
⋄ For n = 500 : MSE is less than 50% of the MSE in HS for all cases.
⋄ For σ(w) : MSE good, but no comparison with HS is possible.
Boundary estimation in the presence of measurement error with unknown variance – p. 32/36
Data analysis (1/3)
Data from Christensen and Greene (1976) concerning 123 American
electricity utility companies.
For each company, we observe
⋄ Y = production output
⋄ W = total cost involved in production
Frontier function = curve of most efficient companies
Boundary estimation in the presence of measurement error with unknown variance – p. 33/36
Data analysis (2/3)
Electric Utilities
12
Data points
Smoothed frontier
Breakpoints
11
Log(Production)
10
9
8
7
6
5
4
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
6
Log(Cost)
Boundary estimation in the presence of measurement error with unknown variance – p. 34/36
Data analysis (3/3)
values of w
1.50
1.85
2.20
2.55
2.90
3.25
3.60
3.95
4.30
4.65
5.00
nb (w)
10
17
26
28
37
43
38
32
24
21
15
τb(w)
7.15
7.41
7.79
8.17
8.58
8.98
9.33
9.49
9.86
10.19
10.47
σ
b(w)
0.0342
0.0404
0.0357
0.0257
0.0193
0.0246
0.0226
0.0232
0.0227
0.0201
0.0123
b |W = w)
E(U
0.0769
0.0607
0.0463
0.0504
0.0504
0.0464
0.0404
0.0255
0.0278
0.0294
0.0278
Boundary estimation in the presence of measurement error with unknown variance – p. 35/36
Conclusions
⋄ We considered the model Y = X · Z , where Y is observed, X is the
variable of interest with support [0, τ ] and Z is the noise.
⋄ We supposed that f0 (τ0 ) > 0 and that Z is independent of X and is
log-normal with unknown variance σ02 .
⋄ We showed that the model is identifiable.
⋄ We proposed estimators for τ0 and σ0 and proved their consistency
and rate of convergence.
⋄ We extended the method to covariates (inputs).
⋄ We showed that the estimators work well for small n, both with and
without covariates.
⋄ We applied the method to data on efficiency of electricity companies.
Boundary estimation in the presence of measurement error with unknown variance – p. 36/36