3 Ratio and regression estimators 3.1 Motivating examples

STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
3
3.1
Ratio and regression estimators
Motivating examples
Frequently, we are interested in measuring the ratio of a matched pair
of variables. This occurs when the sampling unit comprises a group or
cluster of individuals, and our interest is in the population mean per
individual.
For example, to estimate average income/adult in the population in a
household survey, we record for the ith household (i = 1, · · · , n) the
number of adults who live there, xi, and the household income, yi.
Then the parameter, average income per adult in the population,
N
P
household income
= i=1
R=
N
total no. of adults P
Yi
Xi
i=1
can be estimated by the ratio estimator
n
P
b = r = i=1
R
n
P
yi
xi
ȳ
= .
x̄
i=1
Relationship between estimates
Ratio
Mean
Total
×X
×N
R
−→ Y −→
R
−→
×X
SydU STAT3014 (2015) Second semester
Y
Y
Dr. J. Chan
34
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
3.2
Two characteristics per unit in SRS
Theorem: If Xi and Yi are a pair of numerical characteristics defined
on every unit of the population, and ȳ and x̄ are the corresponding
means from a SRS without replacement of size n , then
"P
#
N
n 1
n Sxy
i=1 (Yi − Ȳ )(Xi − X̄)
Cov (x̄, ȳ) = 1 −
= 1−
N n
N −1
N n
(1)
and
PN
Pn
(y
−
ȳ)(x
−
x̄)
i
i=1 i
i=1 (Yi − Ȳ )(Xi − X̄)
E
.
(2)
=
n−1
N −1
Proof. Consider Ui = Xi + Yi and the corresponding sample values are
ui = xi + yi. Clearly
"P
#
N
2
n 1
U
i=1 (Xi − X̄ + Yi − Ȳ )
Var (ū) = 1 −
= 1−
N n
N n
N −1
"P
#
PN
PN
N
2
2
(X
−
X̄)
+
(Y
−
Ȳ
)
+
2
(X
−
X̄)(Y
−
Ȳ
)
n 1
i
i
i
i=1
i=1 i
i=1
= 1−
N n
N −1
"P
#
N
2
n
i=1 (Xi − X̄)(Yi − Ȳ )
= Var (x̄) + Var (ȳ) +
1−
.
n
N −1
N
n S2
Since Var (ū) = Var (x̄ + ȳ) = Var (x̄) + Var (ȳ) + 2Cov(x̄, ȳ), (1) is
proved. (2) can be proved in a similar way.
Theorem: For large sample,
(a) E(r) − R ≈ 0, approximately unbiased,
"P
#
N
2
1
n 1
1 n Sr2
i=1 (Yi − RXi )
(b) Var(r) ≈ 2 1 −
= 2 1−
.
N n
N −1
N n
X̄
X̄
SydU STAT3014 (2015) Second semester
Dr. J. Chan
35
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Proof:
(a) Recall E(ȳ) = Ȳ , E(x̄) = X̄ and Var(x̄) = O(n−1) (order of n−1).
Thus for large sample,
ȳ E(ȳ)
E(r) = E
= R.
≈
x̄
X̄
(b) Note that
r−R=
ȳ − Rx̄
ȳ
−R≈
.
x̄
X̄
Thus, for large sample,
¯
1
E(d¯2) Var(d)
2
=
Var(r) = E[(r − R) ] ≈ 2 E[(ȳ − Rx̄) ] =
X̄
X̄ 2
X̄ 2
2
where d¯ = ȳ−Rx̄ is the sample mean of di = yi −Rxi, i = 1, · · · , n,
drawn from the population of Di = Yi − RXi, i = 1, · · · , N with
¯ = E(ȳ − Rx̄) = E(ȳ) − RE(x̄) = Ȳ − RX̄ = Ȳ −
E(d)
For a SRS of di,
n Sr2
¯
Var(d) = 1 −
N n
where
N
Hence
Ȳ
X̄ = 0.
X̄
N
1 X
1 X
2
2
Sr =
(Di − D̄) =
(Yi − RXi)2.
N − 1 i=1
N − 1 i=1
"
#
N
X
1
n 1
1
1 n Sr2
2
Var(r) ≈ 2 1 −
(Yi − RXi) = 2 1 −
,
N
n
N
−
1
N
n
X̄
X̄
i=1
"
#
n
X
1
n 1
1
1 n s2r
2
var(r) ≈ 2 1 −
(yi − rxi) = 2 1 −
N n n − 1 i=1
N n
X̄
X̄
SydU STAT3014 (2015) Second semester
Dr. J. Chan
36
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
2
1. Ordinary: x not related to y Yb̄ = ȳ & var(Yb̄ ) = 1 − n sy
N n
y6
yi s
Solid line: yi − ȳ
ȳ
s
s
s
s2y
s
=
2
i (yi −ȳ)
P
n−1
-
x
2
2. Ratio: x positively related to y Yb̄ r = ȳ Xx̄ & var(Yb̄ r ) = 1 − Nn snr
y6
yi s Solid line: z = y − rx
i
i
i
rxi
s
rX
s
Ȳ 2 P z 2
s
i i
sr = n−1
< s2y
6
s
y = rx
= s2y − 2rρ̂sxsy + r2s2x
(a = 0,b = r)
- x
X
Calculation of s2r :
n
1 X
2
sr =
(yi − rxi)2
n − 1 i=1
n
ȳ
1 X
=
[(yi − ȳ) − r(xi − x̄)]2 since ȳ − rx̄ = ȳ − x̄ = 0
n − 1 i=1
x̄
#
" n
n
n
X
X
X
1
=
(yi − ȳ)2 − 2r
(xi − x̄)(yi − ȳ) + r2
(xi − x̄)2
n − 1 i=1
i=1
i=1
= s2y − 2r sxy + r2 s2x = s2y − 2r ρ̂sxsy + r2 s2x
or s2r =
1
n−1
n
X
i=1
yi2 − 2r
n
X
xi yi + r 2
i=1
SydU STAT3014 (2015) Second semester
n
X
!
x2i .
i=1
Dr. J. Chan
37
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Remark:
1. If Xi and Yi are positively related, we have s2r s2y . Hence Xi can be
used as an auxiliary variable which provides additional information
and hence improves the precision of the estimate Ȳ .
2. When X is replaced by x if it is unknown, ordinary estimator results.
3. When ratio estimation is used, estimates of variance and sample size
are quite sensitive to data points that do not fit the ideal pattern
called influential observation. It is important to plot the data and
look for these unusual data points before proceeding with an analysis.
b = y is biased and can be almost unbiased
4. The ‘ratio of means’ R
x
if n is large. Another ratio estimator is the ‘mean of ratios’
n
N
P
P
yi
yi
yi
1
1
∗
∗
∗
∗
b
=
R = r = n
where
r
is
unbiased
for
R
=
i
xi
xi
N
xi .
i=1
i=1
b∗ gives equal weight to each cluster which may vary greatly
However R
b∗ , R
b is weighed by the cluster size which is an
in size. Unlike R
b∗ .
advantage over R
SydU STAT3014 (2015) Second semester
Dr. J. Chan
38
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
3.3
Ratio estimate for population mean and total
The ratio estimator of the population total Y is
ȳ
Ybr = X = rX
x̄
Similarly, the ratio estimator of population mean is
ȳ
Yb̄ r = X̄ = rX̄
x̄
These ratio estimates use extra information of xi, i = 1, · · · , n and
the true total and mean X or X̄, thus improving the precision of ratio
estimates over the ordinary estimates Yb = N ȳ and Yb̄ = ȳ respectively.
From the previous result,
(a) E(Yb̄ r ) = X̄E(r) ≈ X̄R = Ȳ .
Similarly E(Ybr ) = XE(r) ≈ XR = Y .
S2
n
n Sr2
r
2
b̄
(b) Since Var(Y r ) ≈ 1 −
and Var(Ybr ) ≈ N
1−
,
N n
N n
s2
n
n s2r
r
2
b̄
and var(Ybr ) = N 1 −
.
var(Y r ) = 1 −
N n
N n
The estimator r for R is generally biased , so Ybr and Yb̄ r are also
biased for Y and Ȳ respectively.
Bias:
ȳ x̄ − E(r)E(x̄)
Cov(r, x̄) = E(rx̄) − E(r)E(x̄) = E
x̄
so
E(r) =
E(ȳ) Cov(r, x̄)
ρr,x̄ σr σx̄
−
=R−
.
E(x̄)
E(x̄)
X̄
SydU STAT3014 (2015) Second semester
Dr. J. Chan
39
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Therefore for any ratio estimates,
ρr,x̄ σx̄
σx̄
|bias r| |R − E(r)|
=
=
≤
= cv(x̄)
σr
σr
X̄
X̄
(3)
b = r is small
since |ρr,x̄| ≤ 1. Thus if the CV(x̄) is small, the bias of R
b But if n is small, the bias can be large.
relative to SE(r) of R.
Efficiency:
The ratio estimator is more efficient than the ordinary estimator, that is
var(Yb ) > var(Yb r ), if
cv(x)
(4)
ρ̂ >
2cv(y)
where cv(y) is the sample cv for Y defined as
sy
cv(y) = .
y
Then
var(Yb ) − var(Yb r ) > 0 ⇒
⇒
⇒
⇒
⇒
n 1 2
[sy − s2r ] > 0
1−
N n
2
[sy − (s2y − 2rρ̂sxsy + r2s2x)] > 0
rsx(2ρ̂sy − rsx) > 0
2ρ̂sy − rsx > 0 since r > 0 & sx > 0
y sx
cv(x)
y
ρ̂ >
=
since r =
x 2sy 2cv(y)
x
and the equality holds when ρ̂ =
cv(x)
.
2cv(y)
SydU STAT3014 (2015) Second semester
Dr. J. Chan
40
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Example: (7-11) The manager of 7-11 is interested in estimating the
total sale in thousands for all of its 300 branches. From last year record,
the total sale in thousands for all the 300 branches is 21300. Careful
check of this year records are obtained for a SRS of 15 branches with the
following results:
Branch Last year sale x This year sale y
1
50
56
2
35
48
3
12
22
4
10
14
5
15
18
6
30
26
7
9
11
8
25
30
n
X
xi = 926,
i=1
n
X
x2i
= 117400,
i=1
s2y
n
X
Branch Last year sale x This year sale y
9
100
165
10
250
409
11
50
73
12
50
70
13
150
95
14
100
55
15
40
83
yi = 1175,
i=1
n
X
yi2
i=1
= 231815,
n
X
xi yi = 155753
i=1
= 9983.81
The ordinary estimate of the total sale this year in thousands is
1175
Yb = N y = 300
= 23500
15
with
r
se(Yb ) = N
(1 −
s2y
s
n
) = 300
N n
1−
15
300
9983.81
= 7543.72.
15
The ratio estimate and its se for the total sale this year in thousands are
1175
= 27027.54
Ybr = Xr = 21300
926
SydU STAT3014 (2015) Second semester
Dr. J. Chan
41
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
v
!
u
n
n
n
1 1
X
X
X
u
n
se(Ybr ) = N t 1 −
y 2 − 2r
xi yi + r2
x2i
N n n − 1 i=1 i
i=1
i=1
s
15
1175
1175 2
1
= 300
1−
231815 − 2 ·
· 155753 + (
) · 117400
300 15 × 14
926
926
= 3226.66
which is much smaller than se(Yb ) = 7543.72 thousands.
Read Tutorial 11 Q2a,b, Q3a,b.
3.4
Regression estimator
b = X y , the line y = mx with slope m = y passes
Since Yb r = X R
x
x
b
through the origin (0, 0) and (X, Y r ). However, the linear relationship
between X and Y may not pass through the origin. A more general
estimator, the regression estimator fits a regression line:
y = A + Bx = y − Bx + Bx = y + B(x − x)
(5)
to the sample data where the least square estimate of B is
PN
PN
SSxy
(yi − Y )(xi − X)
Sxy ρSy
i=1 xi yi − N XY
= i=1
= P
=
=
.
B=
PN
2
2
N
2
2
SSxx
S
S
x
x
i=1 (xi − X)
i=1 xi − N X
and A = y − Bx.
Note: Cov(X, Y ) = Sxy = SSxy /(N − 1), Var(X) = Sx2 = SSxx/(N − 1),
cov(X, Y ) = sxy = ssxy /(n − 1) and var(X) = s2x = ssxx/(n − 1).
Then the regression estimator of the population mean Y is to substitute
x = X to (5) to obtain
Yb reg = y + b(X − x)
SydU STAT3014 (2015) Second semester
Dr. J. Chan
42
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
where
b=
ssxy
ssxx
Pn
Pn
sxy
(y − y)(xi − x)
i=1 xi yi − nxy
Pn i
P
=
= i=1
=
.
n
2 − nx2
2
2
(x
−
x)
s
x
i
x
i=1
i=1 i
(6)
Since
Yb reg = y + b(X − x) ' y + B(X − x) = z 0
the sample mean of the variable zi0 = yi + B(X − xi), we have
E(Yb reg ) ' E[y +B(X −x)] = E(y)+B[X −E(x)] = Y Approx. unbiased
and
Var(Yb reg ) ' Var(z̄ 0) = Var[y + B(X − x)] = Var(y − Bx)
= Var(ȳ) + B 2 Var(x̄) − 2B Cov(ȳ, x̄)
2
S
n Sx2
Sy n ρSxSy
n Sy2
y
2
+ρ 2 1−
− 2ρ
1−
= 1−
N n
Sx
N n
Sx
N
n
n Sy2
= 1−
1 − ρ2 .
N n
Hence
n s2reg n s2y (1 − ρ̂2)
b
var(Y reg ) = 1 −
= 1−
N n
N
n
where s2reg is the sample variance of zi0 = yi + b(X − xi).
The regression estimator for the population total Y is
Ybreg = N [y + b(X − x)]
and its variance estimate is
s2
s2 (1 − ρ̂2)
n
n
reg
y
2
var(Ybreg ) = N 1 −
=N 1−
N n
N
n
2
SydU STAT3014 (2015) Second semester
Dr. J. Chan
43
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Bias:
Bias in Yb̄ reg = E(Yb̄ reg ) − Ȳ = E(ȳ) + E[b(X̄ − x̄)] − Ȳ
= E[b(X̄ − x̄)] = −Cov(b, x̄).
Efficiency:
1. The regression estimator is at least as efficient as the ordinary
estimator, that is var(Yb ) ≥ var(Yb reg ) since
n 1 2
b
b
var(Y ) − var(Y reg ) = 1 −
[sy − s2reg ]
N n
n 1 2 2
= 1−
s ρ̂ ≥ 0
N n y
where the equality holds when ρ̂ = 0, i.e. there is no association
between Y and X.
2. The regression estimator is more efficient than the ratio estimator,
that is var(Yb r ) ≥ var(Yb reg ) unless
y
x
in which case they are equivalent and the regression of y on x is
linear through the origin and the variance of y is proportional to x.
b=r=
n 1 2
1−
[sr − s2reg ]
N n
n 1 2
= 1−
[sy − 2rρ̂sxsy + r2s2x − s2y (1 − ρ̂2)]
N n
n 1 2 2
= 1−
(r sx − 2rρ̂sxsy + s2y ρ̂2)
N n
n 1
= 1−
(rsx − ρ̂sy )2
N n
var(Yb r ) − var(Yb reg ) =
SydU STAT3014 (2015) Second semester
Dr. J. Chan
44
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
n 1
= 1−
(rsx − bsx)2
N n
n s2x
= 1−
(r − b)2 > 0 ⇒ (r − b)2 > 0
N n
sxy
sxy sxy
where ρ̂sy =
sy =
= 2 sx = bsx.
sx sy
sx
sx
3. Since Yb reg = y + b(X − x), the regression estimator adjusts the y
up or down by an amount b(X − x).
(a) When the slope b = 0, the regression estimator Yb reg = y becomes
the ordinary estimator Yb .
(b) When the y-intercept a = y − bx = 0 ⇔ b = xy = r, the slope b
becomes the ratio estimate r and the regression estimator
y
y
y
Yb reg = y + (X − x) = y + X − y = X = Xr = Yb r
x
x
x
becomes the ratio estimator Yb .
r
Example: (7-11) Estimate the total sale using the regression estimator.
Solution: The regression estimate of the total sale this year in thousands is
n
X
926 1175
ssxy =
xiyi − nxy = 155753 − 15 ×
×
= 83216.33,
15
15
i=1
2
n
X
926
ssxx =
x2i − nx2 = 117400 − 15 ×
= 60234.93,
15
i=1
2
n
X
1175
ssyy =
yi2 − ny 2 = 231815 − 15 ×
= 139773.33.
15
i=1
SydU STAT3014 (2015) Second semester
Dr. J. Chan
45
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
We have
b=
and
ρ̂ = √
ssxy 83216.33
=
= 1.3815
ssxx 60234.93
ssxy
83216.33
=√
= 0.9069.
ssxxssyy
60234.93 × 139773.33
It follows that
Ybreg = N [y + b(X − x)]
1175
21300 926
= 300
= 27340.65
+ 1.3815
−
15
300
15
as compared with Yb = 23500 and Ybr = 27027.54. The s.e. estimate is
r
s2 (1 − ρ̂2)
n
y
se(Ybreg ) = N
1−
s N n
15 9983.81(1 − 0.90692)
= 300
= 3178.52
1−
300
15
which is < se(Ybr ) = 3226.66 << se(Yb ) = 7543.72. This shows that the
dropping of zero y-intercept assumption improves the estimate slightly.
Note that the y-intercept estimate is
1175
926
− 1.3815 ×
≈ −6.9531
15
15
which is quite close to zero.
a = y − bx =
Read Tutorial 11 Q2c,d, & 3c,d.
SydU STAT3014 (2015) Second semester
Dr. J. Chan
46
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
3.5
The Hartley - Ross Estimator
Since the ratio estimator r for R is biased, the following leads to an
unbiased estimator of R.
Theorem: Let Z = f (X, Y ) be a fixed function of two variables.
Define Zi = f (Xi, Yi) and zi = f (xi, yi). Then
Pn
PN
ZiXi
N − 1 i=1 zi(xi − x̄)
E z̄ +
= i=1
.
(7)
n−1
N X̄
N X̄
Proof: The LHS is
Pn
N −1
i=1 zi (xi − x̄)
E
E(z̄) +
n−1
N X̄
N
N
N
N
X
X
X
X
Zi(Xi − X̄)
ZiXi − X̄
Zi
Zi X i
N − 1 i=1
i=1
= Z̄ +
= Z̄ + i=1
= i=1
.
N −1
N X̄
N X̄
N X̄
For the problem of estimation of R from sample (xi, yi), i = 1, · · · , n,
we assume Xi > 0, i = 1, · · · , N and define the function
zi = f (xi, yi) = yi/xi = ri∗, i = 1, · · · , n
and Zi = Yi/Xi, i = 1, 2, · · · , N , so from (7)
∗
n(ȳ
−
x̄r̄
N
−
1
)
E r̄∗ +
=
n−1
N X̄
since
n
X
i=1
zi(xi − x̄) =
n
X
yi
i=1
xi
(xi − x̄) =
n
X
i=1
N
X
Yi
Xi
X
i
i=1
N X̄
yi − x̄
=
n
X
yi
i=1
xi
Ȳ
=R
X̄
= n(ȳ − x̄r̄∗).
Thus the Hartley-Ross estimator as an unbiased estimator of R is
SydU STAT3014 (2015) Second semester
Dr. J. Chan
47
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
R̂hr
N − 1 n(ȳ − r̄∗x̄)
= r̄ +
n−1
N X̄
∗
for which we need to know X̄ (or X = N X̄). This estimator contains
a mean of ratio estimate and an adjustment for unbiasness.
The Hartley-Ross estimators for mean and total are
for the population mean:
for the population total:
N − 1 n(ȳ − r̄∗x̄)
∗
b̄
and
Y hr = X̄ r̄ +
N
n−1
∗
−
r̄
x̄)
n(ȳ
∗
Ybhr = X r̄ + (N − 1)
.
n−1
Remarks:
n
ȳ
1 X yi
∗
b
b
1. So far, we have R = biased for R, R =
biased for R &
x̄
n i=1 xi
N
X
yi
1
bhr unbiased for R. Finally, could
unbiased for R∗ =
and R
N i=1 xi
we just use
bo = ȳ/X̄ ?
R
ȳ E(ȳ) Ȳ
=
= R which does
X̄
X̄
X̄
not use the information from the sample {xi} but is unbiased for R.
This is the ordinary estimator E
=
2. For small samples we might expect the Hartley-Ross estimator to be
better. There is no general result on the comparison of the variances
ȳ
ȳ
N − 1 n(ȳ − r̄∗x̄)
∗
of r = , ro =
, and rhr = r̄ +
for all
x̄
n−1
X̄
N X̄
sample sizes.
See Cochran (2nd Ed) Theorem 6.3 §6.15.
SydU STAT3014 (2015) Second semester
Dr. J. Chan
48
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Summary of estimators and variance estimates based on 1 SRS
Regression
Hartley-Ross
ȳ
x̄
-
N − 1 n(ȳ − r̄∗ x̄)
r̄ +
n−1
N X̄
n s2y
1
(1 − )
N n
X̄ 2
n s2r
1
(1
−
)
N n
X̄ 2
-
-
ȳ
ȳ
X̄
x̄
n s2y
(1 − )
N n
n s2
(1 − ) r
N n
Ratio R
Ord.
Ratio
ȳ
X̄
Mean Ȳ
y+
∗
sxy
(X − x)
s2x
X̄ r̄∗ +
N − 1 n(ȳ − r̄∗ x̄)
N
n−1
n s2y (1 − ρ̂2 )
(1 − )
N
n
-
var(Yb̄ r ) < var(Yb̄ ) var(Yb̄ reg ) < var(Yb̄ r )
if ρ̂ >
Total Y
ȳ
X
x̄
N ȳ
n s2y
N (1 − )
N n
2
ȳsx
2x̄sy
equal if b = r =
ȳ
x̄
sxy
n(ȳ − r̄∗ x̄)
∗
N y + 2 (X − N x) X r̄ + (N − 1)
sx
n−1
n s2r
n s2y (1 − ρ̂2 )
2
N (1 − )
N (1 − )
N n
N
n
2
-
n
X
1
=
(
yi2 − nȳ 2 ),
n − 1 i=1
n
n
n
X
X
X
1
ȳ
2
2
2
sr =
(
yi − 2r
x i yi + r
x2i ) = s2y − 2rρ̂sx sy + r2 s2x , r = ,
n − 1 i=1
x̄
i=1
i=1
s2y
SydU STAT3014 (2015) Second semester
Dr. J. Chan
49
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Example: (7-11)
Solution: The ratios and their summary are given below:
i
1
2
3
4
5
6
7
8
xi
50
35
12
10
15
30
9
25
yi
56
48
22
14
18
26
11
30
ri0 = yi/xi
1.120
1.371
1.833
1.400
1.200
0.867
1.222
1.200
i
9
10
11
12
13
14
15
Total
xi
100
250
50
50
150
100
40
yi
165
409
73
70
95
55
83
ri0 = yi/xi
1.650
1.636
1.460
1.400
0.633
0.550
2.075
19.618
n
19.618
1X ∗
ri =
= 1.3079, x̄ = 61.7333 and ȳ =
We have r̄ =
n i=1
15
78.3333.
∗
The Hartley-Ross estimate of the total sale this year in thousands is
∗
n(ȳ − r̄ x̄)
Ybhr = X r̄∗ + (N − 1)
n−1
15[78.333 − 1.3079(61.7333)]
= 21300(1.3079) + (300 − 1)
15 − 1
= 27086.9
Read Tutorial 12 Q1(a).
SydU STAT3014 (2015) Second semester
Dr. J. Chan
50
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Example: In a survey of family size (x1), weekly income (x2) and weekly
expenditure on food (y), we want to estimate the average weekly expenditure on food per family in the most efficient way. A simple random
sample of 27 families yields the following data:
X
X
X
x1i = 109,
x2i = 16277,
yi = 2831, ρ̂x1,y = 0.925, ρ̂x2,y = 0.573
i
i
The sample covariance matrix

547.8234



 26.5057


1796.5541
i
for y, x1 and x2 is
26.5057
1796.5541




1.4986
80.1595  .


80.1595 17967.0541
From the census data X̄1 = 3.91 and X̄2 = 542.
(a) Estimate the standard errors of the ratio estimators for Ȳ using x1
and using x2. Compare the standard errors with the s.e. for the
simple estimate ignoring the covariates. Which estimator has the
smallest estimated s.e.?
(b) Calculate the best available estimate of the average weekly expenditure on food per family and give an approximate 95% confidence
interval for this average.
Solution:
(a) The standard errors of the ratio estimators for Ȳ using x1 and using
x2 are
SydU STAT3014 (2015) Second semester
Dr. J. Chan
51
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
P
2831
yi
r1 = P i =
= 25.97
x
109
1i
i
2
2
sr1 = sy − 2r sx1y + r2 s2x1
= 547.8234 − 2(25.97)(26.5057) + 25.972(1.4986)
= 181.896
P
2831
yi
= 0.1739
r2 = P i =
x
16277
i 2i
2
2
sr2 = sy − 2r sx2y + r2 s2x2
= 547.8234 − 2(0.1739)(1796.5541) + 0.17392(17967.0541)
= r
475.7895 r
s2r1
181.896
b̄
=
= 2.5956
se(Y r1) =
n
27
r
r
2
sr2
475.7895.193
se(Yb̄ r2) =
=
= 4.1978
n
27
r
r
2
s
547.8234
y
=
= 4.5044
se(Yb̄ ) =
n
27
The first ratio estimator Yb̄ has the lowest s.e. due to the higher corr1
relation ρ̂y,x1 = 0.925. The second ratio estimator only has marginal
improvement as the correlation ρ̂y,xx = 0.573 is weak but
√
ȳsx
2831 · 17967.0541
√
= 0.4980.
ρx2,y = 0.573 >
=
2x̄sy 2 · 16277 · 547.8234
Note that fpc is ignored because the population size N is unknown.
(b) The estimate of the average weekly expenditure on food per family
Yb̄ = r X̄ = 25.97(3.91) = 101.5524
r1
1
95% CI for Ȳ = 101.5524 ∓ 1.96(2.5956) = (96.4651, 106.6397)
SydU STAT3014 (2015) Second semester
Dr. J. Chan
52
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
3.6
Ratio estimate for subpopulation in poststratification
For some Cl , we want to estimate:
,
, N
N
X
X
X
X
0
Rl =
Yi
Xi =
Yi
Xi0, = R0
i∈Cl
i=1
i∈Cl
i=1
if we define
(Yi0, Xi0) = (Yi, Xi) if i ∈ Cl
= (0, 0) if i ∈
/ Cl .
Note: X 0 = Xl , i.e. the sum of Xi0 over all population equals to the
sum of Xi over Cl . Hence the natural estimator of ratio and its variance
estimate is
P
n
P
0
yi
yi
i∈Cj
n s0rl 2
1 i=1
0
= P
r =P
1−
= rl and var(rl ) ≈
n
0 )2
N n
x
(
X̄
x0i i∈C i
l
i=1
where
(x0i, yi0 ) = (xi, yi) if i ∈ Cl
= (0, 0) if i ∈
/ Cl ,
n
X0
1X 0 1X
0
0
X̄ =
can be estimated by x̄ =
x =
xi and
N
n i=1 i n
i∈Cl
N
2
Srl0
1 X 0
=
(Yi − R0Xi0)2
N − 1 i−1
can be estimated by
n
2
s0rl
1 X 0
1 X
0 0 2
=
(yi − r xi) =
(yi − rl xi)2.
n − 1 i=1
n−1
i∈Cl
SydU STAT3014 (2015) Second semester
Dr. J. Chan
53
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
The ratio estimator of mean in Cl and its variance estimate are
s0 2
1
n
rl
Yb̄ rl = X̄l rl and var(Yb̄ rl ) ≈ 2 1 −
Wl
N n
since
X̄l2 n s0rl 2
2
b̄
var(Y rl ) = X̄l var(rl ) = 02 1 −
N n
X̄
02
0 2
2 X N
n srl
1 n s0rl 2
=
1−
= 2 1−
.
Nl2 X 02
N n
Wl
N n
Similarly, the ratio estimator of total in Cl and its variance estimate are
Ybrl = Xl rl
and
s0 2
n
rl
var(Ybrl ) ≈ N 1 −
N n
2
since
s0 2
s0 2
2 2 X
n
n
N
0
rl
rl
var(Ybrl ) = Xl2var(rl ) = 0l2 1 −
= X 2 02 1 −
.
N n
N n
X
X̄
Note that these estimators correspond to method 1 in Section 1.5 for
poststratification and nl does not come into any of these calculations.
Read Tutorial 12 Q1b,c.
SydU STAT3014 (2015) Second semester
Dr. J. Chan
54
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Ratio Estimation for Stratified SRS
3.7
In a stratified SRS, a SRS of a specified sample size nl is taken in each of
the L strata with known size Nl , e.g. the 6 states of Australia. There
are two types of ratio estimates depending on the order of taking ratio
and summing over strata.
bs =
1. Take ratios rl = ȳl /x̄l first and sum over Ybl = Xl rl to obtain R
PL b
l=1 Yl /X.
bl first to obtain Yb and X
b and then take ratio
2. Sum over Ybl and X
bc = Yb /X.
b
R
The ‘Separate’ Ratio Estimate
3.7.1
Suppose the stratum totals Xl , l = 1, · · · , L are known so X =
L
P
Xl
l=1
is known also. Then
L
L
L
X
X
X
b
1
1
1
Y
bs =
Ybl =
=
Xl rl =
R
Wl X̄l rl
X X
X
X̄
l=1
l=1
l=1
since
Xl N Nl Xl
ȳl
1
=
= Wl X̄l and rl = . Then
x̄l
X
X N Nl X̄
bs ) =
E(R
L
L
X
Xl Yl
1 X
Y
E(rl ) ≈
=
Yl =
= R,
X
X Xl X
X
L
X
Xl
l=1
Yl
since E(rl ) ≈ Rl =
and
Xl
l=1
l=1
2
L
X
ssrl
1
n
l
2
bs ) =
var(R
W
1
−
Nl nl
X̄ 2 l=1 l
SydU STAT3014 (2015) Second semester
Dr. J. Chan
55
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
where
"
nl
X
nl
X
1
yil2 − 2rl
xil yil + rl2
nl − 1 i=1
i=1
Similarly the separate ratio estimate for the mean is
s2rl = s2yl −2rl sxl yl +rl2s2xl =
Yb̄ st,s =
L
X
Wl X̄l rl and var(Yb̄ st,s) =
L
X
Wl2
l=1
l=1
nl
1−
Nl
nl
X
#
x2il .
i=1
s2srl
nl
Bias:
For large stratum sample sizes, rl will be approximately unbiased for Rl
and var(rl ) will approximate Var(rl ) reasonably well.
For moderate and small samples, bias is important, and we should consider it here. We know that in a single stratum
|bias rl | σx̄l
≤
= cv(x̄l )
σrl
X̄l
bs :
Consider the bias of R
bs)| = E(R
bs − R)
|bias (R
!
L
L
X
X
Xl
Xl
(rl − Rl ) =
E(rl − Rl )
= E
X
X
l=1
=
L
X
Xl
l=1
X
l=1
|bias rl | ≤ max |bias rl |
l
L
X
Xl
l=1
σx̄l σrl
≤ max |bias rl | ≤ max
l
l
X̄l
X
Hence
max σrl
bs)|
σx̄l
|bias (R
≤ l
max
bs
bs ) l
X̄l
s.e.R
s.e.(R
≤
SydU STAT3014 (2015) Second semester
√

max σrl

 max σx̄l
L l
l
min σrl
X̄l
l
Dr. J. Chan
56
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
since
v
v
u L
u L 2
uX
uX Xl
t
b
var(rl ) ≥ min σrl t
p2l
s.e.(Rs) =
l
X
l=1
l=1
v
r
u L 2
uX 1
1
L
√
≥
≥ min σrl t
≥ min σrl
min σrl
l
l
L
L2
L l
l=1
where Xl = pl X. The sum of squares of unequal proportions is higher
than that from equal proportions in general. This is due to the convexity
property of the function f (p) = p2. For example, when L = 2 with cases
(1 − p, p) and ( 12 , 12 ),
1
1 1
(1 − p)2 + p2 − 2( )2 = 2p2 − 2p + = (2p − 1)2 ≥ 0.
2
2 2
√
Therefore the ratio on the LHS can be L times as large as the σx̄l /X̄l
bound on individual relative biases. Even if the biases are individually
small, the overall bias can be large.
3.7.2
The ‘Combined’ Ratio Estimate
It is defined as
L
P
bc =
R
l=1
L
P
Wl ȳl
ȳst
Yb̄
Yb
=
=
=
b̄
b
x̄st X
X
Wl x̄l
l=1
SydU STAT3014 (2015) Second semester
Dr. J. Chan
57
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
bs, it does not require the knowledge of individual
and in contrast to R
Xl ’s. Note that
L
ȳ X
1
Y
ȳ
st
st
bc ) = E
≈E
≈
Wl E(ȳl ) =
E(R
= R Approx. unbiased
x̄st
X
X̄
X̄
l=1
Theorem:
!
PNl
L
2
X
nl 1
2
i=1 [Yil − Ȳl − R(Xil − X̄l )]
bc ) ≈ 1
W
1
−
.
Var(R
Nl nl
Nl − 1
X̄ 2 l=1 l
Proof: First
L
ȳst
1
1 X
b
Rc − R =
−R=
(ȳst − Rx̄st) =
Wl (ȳl − Rx̄l )
x̄st
x̄st
x̄st
l=1
L
1
1 ¯
1 X
dst ≈ d¯st
Wl d¯l =
=
x̄st
x̄st
X̄
l=1
where dli = yli − Rxli, i = 1, · · · , nl estimates Dli = Yli − RXli and
nl
X
1
d¯l =
dil . Note that typically D̄l 6= 0. Hence
nl i=1
2
L
X
1
1
n
Scrl
l
2
¯st) ≈
bc ) ≈
Var(
d
W
1
−
Var(R
l
Nl nl
X̄ 2
X̄ 2
l=1
where
N
2
Scrl
N
l
l
X
1 X
1
(Dli − D̄l )2 =
[Yli − Ȳl − R(Xli − X̄l )]2
=
Nl − 1 i=1
Nl − 1 i=1
= Sy2l − 2RSxl yl + R2Sx2l ,
and this can be estimated by
nl
X
1
s2crl =
[yli − ȳl − rc(xli − x̄l )]2 = s2yl − 2rcsxl yl + rc2s2xl
nl − 1 i=1
SydU STAT3014 (2015) Second semester
Dr. J. Chan
58
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
as compared to
" n
#
nl
nl
l
X
X
X
1
2
2
2
ssrl =
yli − 2rl
xliyli + rl
x2il = s2yl −2rl sxl yl +rl2s2xl
nl − 1 i=1
i=1
i=1
for separate ratio estimator.
There is less risk of bias in R̂c than in R̂s. We can show that
bc − R)|
σx̄l
|E(R
≤ max
bc
l
X̄l
s.e.R
in contrast to
bs − R)| √ maxl σr σx̄l
|E(R
l
≤ L
max
bs
l
minl σrl
X̄l
s.e.R
bs .
for the separate ratio estimator R
Similarly the combine ratio estimate for the mean and its variance are
2
L
X
scrl
n
l
bc and var(Yb̄ st,c) =
.
Yb̄ st,c = X̄ R
Wl2 1 −
Nl nl
l=1
and for the total are
bc
Ybst,c = X R
and
var(Ybst,c) = N 2
L
X
l=1
2
n
scrl
l
Wl2 1 −
.
Nl nl
Read Tutorial 12 Q2.
SydU STAT3014 (2015) Second semester
Dr. J. Chan
59
STAT3014/3914 Applied Stat.-Sampling C3-Ratio & reg est.
Estimators and variance estimates for stratified SRS (Ch.2)
Parameter Estimator
Variance
nl
P
1
2
Ordinary/naive estimator syl = nl −1 ( yli2 − nl ȳl2 ), Wl = NNl
i=1
2
L
L
X
X
syl
n
1
1
l
2
bst =
bst ) =
W
1
−
Wl ȳl
Ratio R
R
var(R
Nl nl
X̄ l=1
X̄ 2 l=1 l
2
L
L
X
syl
P
n
l
2
Wl 1 −
Mean Ȳ
Yb̄ st =
Wl ȳl
var(Yb̄ st ) =
Nl nl
l=1
l=1
2
L
L
X
syl
P
n
l
Total Y
Ybst = N
Wl ȳl
var(Ybst ) = N 2
Wl2 1 −
Nl nl
l=1
l=1
ȳl
Separate ratio estimator s2sr,l = s2yl − 2 rl ρ̂sxl syl + rl2 s2xl , rl =
x̄l
2
L
L
X
X
ssr,l
1
1
n
l
2
bst,sr =
bst,sr ) =
R
Ratio R
Wl X̄l rl var(R
W
1
−
Nl nl
X̄ l=1
X̄ 2 l=1 l
2
L
L
X
ssr,l
P
n
l
Wl X̄l rl
Mean Ȳ
Yb̄ st,sr =
var(Yb̄ st,sr ) =
Wl2 1 −
Nl nl
l=1
l=1
2
L
L
X
ssr,l
P
n
l
Ybst,sr = N
Wl X̄l rl
Total Y
var(Ȳst,sr ) = N 2
Wl2 1 −
Nl nl
l=1
l=1
Combine ratio estimator
L
P
Ratio R
bst,cr =
R
l=1
L
P
s2cr,l
=
Wl ȳl
= rst,cr
Wl x̄l
s2yl
− 2 rc ρ̂sxl syl + rc2 s2xl ,
rst,cr =
PL
l=1 Wl ȳl
PL
l=1 Wl x̄l
2
L
X
scr,l
1
n
l
2
bst,cr ) =
var(R
W
1
−
l
Nl nl
X̄ 2 l=1
l=1
Mean Ȳ
Yb̄ st,cr = X̄rst,cr
Total Y
Ybst,cr = N X̄rst,cr
L
X
nl s2cr,l
1−
Nl nl
l=1
L
X
nl s2cr,l
2
2
var(Ȳst,cr ) = N
Wl 1 −
Nl nl
var(Yb̄ st,cr ) =
Wl2
l=1
SydU STAT3014 (2015) Second semester
Dr. J. Chan
60