•
-.
PROBABILISTIC SURVEY ERROR MODELS TO CORRECT FOR NONRESPONSE
by
Camilla Anita "Brooks
Department of Biostatistics
University of North Carolina at Chapel Hill
Institute of Statistics Mimeo Series No. 1417
October 1982
•
•
This research was partially supported by the Bureau of the Census
through Joint Statistical Agreements JSA 80-19 and 81-28.
PROBABILISTIC SURVEY ERROR MODELS
TO CORRECT FOR NONRESPONSE
by
Cami 11 a An ita 8rooks
-.
A Dissertation submitted to the faculty of the University of North
Carolina at Chapel Hill in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in the Department of
Biostatistics.
Chapel Hill
1982
Approved by:
~"D~
A visor
I':"
•
•
ii
ABSTRACT
CAMILLA A. BROOKS. Probabilistic Survey Error Models to Correct for
Nonresponse. (Under the direction of WILLIAM D. KALSBEEK.)
Survey practitioners are becoming increasingly aware of the
effect of nonsamp1ing errors, in particular response and nonresponse
errors, on the accuracy of surveys and censuses.
In this research,
the 1946 Hansen and Hurwitz double sampling model used to correct
for nonresponse is modified to incorporate the emerging concept that
','
each unit in the population has a probability, generally unknown, of
responding in a survey.
-.
In addition, the mean square error (MSE) of
the sample estimate of the mean based on double sampling is extended
to incorporate response error.
Estimators of the MSE components for this model are developed
using a combination of the usual estimation methods of survey repetition and replication.
In order to study the interrelationships
of MSE components, the expected MSE of this model is developed using
a superpopu1ation model approach.
The E(MSE) of the double sampling
model is then compared to that of a Politz-Simmons and no-adjustment
model for fixed cost.
Double sampling is more efficient in many,
but not all, situations; the cost model, the expected response rate,
as well as the distribution of the response probabilities are instrumental in determining the preferred model in terms of E(MSE).
In addition, the estimation of response probabilities is inves-
•
tigated empirically using data from the Virginia Health Survey which
was conducted by the Research Triangle Institute and 1970 Census data.
I
iv
•
ACKNOWLEDGEMENTS
The author wishes to thank her advisor, Dr. William D. Kalsbeek,
for his suggestions, comments, and support throughout her research.
She would also like to thank the other members of her committee,
Drs. Sherman A. James, Gary G. Koch, Judith T. Lessler, and
Chirayath M. Suchindran for their comments and suggestions and moral
support during her research.
She gives special thanks to Dr. Lessler and Dr. Koch whose previous research served as a guide for her own research.
In addition,
she would like to thank Dr. P.K. Sen for some helpful suggestions
used in Chapter IV of this research; the Institute for Research in
Social Science for help in acquiring data from the 1970 Census tapes;
and the Research Triangle Institute and the· Virginia Department of
Health for making available data from the Virginia Health Survey.
She thanks Ms. Ernestine Bland for her careful typing of a
difficult manuscript.
And finally, the author extends her appreciation to her family
and to her friends, both old and new, for their support in the pursuit of her goal, their continuous encouragement, and their unfailing belief in its
•
succ~ssful
completion .
v
•
TABLE OF CONTENTS
Page
LIST OF TABLES
vii i
Chapter
I.
INTRODUCTION AND REVIEW OF THE LITERATURE..............
1
1.1
Introduction......................................
1
1.2
Errors in Survey Data.............................
1.2.1 Response Variance..........................
1.2.2 Nonresponse....
2
3
5
1.3 Mathematical Model s.. ..
.. . ..
10
Development of Estimators.........................
1.4.1 Response Variance - Simple and
Corre1 ated. ...... .. .. .. .. ...... .... ... ...
1.4.2 Estimation of Mean Square Error
Components 'in a Double Sampling
Scheme to Eliminate Bias
1.4.3 Response Probabilities.....................
24
30
33
Summary...........................................
36
1.6 The Research......................................
37
THE SURVEY ERROR MODEL AND OPTIMUM SUBSAMPLING
SCHEME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
39
2.1
Introduction......................................
39
2.2
The Estimator and Model Assumptions
40
1.4
1.5
II.
•
2.3 The Mean Square Error.............................
2.3.1 The Bias of it
2.3.2 The Variance of it
2.3.2.1 The Response Variance .........•...
2.3.2.2 Variance Due to Random
Assignment of Subsamp1e
Un i ts to I ntervi ewers. .. . .. .. .. .
2.3.2.3 Variance Due to Sampling
2.3.2.4 The Nonresponse Variance .•..•.....
25
45
46
47
48
52
53
58
vi
2.3.2.5 Variance Due to the Random
Assignment of the Initial
Sample Units to Interviewers
2.3.2.6 Summary of Variance Terms
III.
2.4 Optimization of Initial Sample Size and
SubsamplingRate
:
62
THE DEVELOPMENT OF ESTIMATORS OF THE MEAN SQUARE
ERROR COMPONENTS
~ .. . . . . .. . . .. . . .. .. .. . . .. .. . ..
66
3.1
66
Introduction
,....
3.2 Additional Design Assumptions and Methods
of Estimation.....................................
3.3 The Estimation of Components
3.3.1 The Estimation of Bias
3.3.2 The Estimation of Variance
3.3.2.1 The Estimation of Components
Involving the Initial Sample......
3.3.2.2 The Estimation of Components
Involving the Subsample of
Nonrespondents. . . . . . . . . . . . . . . . . . . .
3.3.2.3 The Estimation of Components
Involving the Correlation of
Initial and Subsample Respondents.
3.3.2.4 Summary of Estimation of
Variance Tenns... ..
IV.
59
61
THE EFFECT OF EXPECTED RESPONSE PROBABILITIES
ON THE EXPECTED MEAN SQUARE ERROR
4.1
•
67
71
72
73
73
84
92
95
98
Introduction...................................... 98
4.2 The Mean Square Error Models
4.2.1 Modell - No Adjustment
4.2.2 Model 2 - Politz-Simmons
4.2.3 Model 3 - Double Sampling
99
100
101
103
4.3 The Development of the Expected Mean
Square Error
4.3.1 Modell - No Adjustment
4.3.2 Model 2 - Pol itz-Simmons
4.3.3 Model 3 - Double Sampling
4.3.4 The Covariance of Functions of P and Y
4.3.5 Summary of Expected Mean Square
Error Components......... . . . . . . . . . . . . . . . . ..
106
109
111
112
116
119
•
vi i
•
4.4 Hypothetical Examples of Expected MSE
4.4.1 The Parameter Values
4.4.2 Discussion of Results
V. . AN EMPIRICAL INVESTIGATION OF THE ESTIMATION
.OF RESPONSE PROBABILITIES
=
5.1
VI.
Introduction
143
143
5.2 Estimation of Response Probabilities
143
5.3 The Estimation of Bias
151
SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH
158
6. 1 Summa ry. . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . .
6.2 Suggestions for Further Research
-.
, 120
121
123
BIBLIOGRAPHy
.. 158
160
163
APPENDIX A................•................•........•....... 165
Details of Derivation of Optimum nand k
APPENDI X B.................................................. 167
Details of Derivation of Estimators of
MSE Components
•
•
viii
LIST OF TABLES
Table
1. 3.1
-
•
Page
Comparison of Components of Mean Square
Error Models ....................................,......
12
4.4.2.1
Effect of e on the Expected Squared Bias
in the ,No A~justment Model. n = 200................... 130
4.4.2.2
E(MSE) Components as Percent of Total E(MSE)
for Varying Shapes of the Beta Distribution.
No Adjustment. n = 200 ..................•............. 131
4.4.2.3
E(MSE) Components as Percent of Total E(MSE)
for Varying Shapes of the Beta Di stributi on.
No Adjustment Model. n = 2000 ......................... 132
4.4.2.4
E(MSE) Components as Percent of Total E(MSE)
for Varying Shapes of the Beta Distribution.
Pol itz-Simmons Model. 'n = 200 ......................... 133
4.4.2.5
E(MSE) Components as Percent of Total E(MSE)
for Varying Shapes of the Beta Distribution.
Politz-Simmons Model. n = 2000 ..... ~ .................. 134
. 4.4.2.6
E(MSE) Components as Percent of Total E(MSE)
for Varying Shapes of the Beta Distribution.
Double Sampling Model. n = 200 ........................ 135
4.4.2.7
E(MSE) Components as Percent of Total E(MSE)
for Varying Shapes of the Beta Distribution.
Double Sampling Model. n = 2000 ....................... 136
4.4.2.8
Effect of Correlation Between Y] and Y?, on
PVAR and NRV in the Double Sampling Moae1.
n = 200
Optimum Values of nand k for Selected Cost
Functions and Selected Values of E(P), C-cO=
5000. c3 = 50
4.4.2.10 Comparison of E(MSE) of Three Models for Fixed
Cost. C-cO=5000. C3=50, c1/c3=.5, c2/c3=.1
4.4.2.11 Comparison of E(MSE) of Three Models for Fixed
Cost, C-c O=5000, c3=50. c1/c3=.2, c/c 3=.1.
137
4.4.2.9
•
5.2.1
Characteristics of Independent Variables Used
in the Estimation of Response Probabilities
138
139
141
145
ix
5.2.2
Estimated Response Probabil ities
. 150
5.2.3
The Estimated Response Probabilities in
the Sample
. 150
5.3.1
The Relative Bias of Estimates of Mean
Income (Based on 500 Samples of Size 200)
. 156
5.3.2
The Relative Bias of Estimates of Proportion of Households Having Incidence of
Heart Attack, Stroke, or Cancer (Based on
500 Samples of Size 200)
157
•
~
••
•
•
CHAPTER I
INTROOUCTION ·AND REVIEW OF THE LITERATURE
1.1
Introduction
Sampling errors associated with survey data have long been
provided by survey practitioners, and sophisticated users of the
data have come to not only accept the presentation of these errors
••
/
with· the data, but to expect them.
However, sampl ing errors are
but part of the total error associated with survey data; for many
survey statistics, particularly those produced from data collected
from relatively large and well-designed surveys, this component of
the mean square error may be overshadowed by those attributable to
nonresponse, interviewers, coding, processing, and the like.
In-
deed, many statistics derived from complete count censuses are
subject to more total error than those derived from large-scale ongoing sample surveys; this is due mainly to the necessary use of
less well-trained supervisors, interviewers, coders, and the like
than those who .in continuing surveys may have had years of experience.
Though data are available on biases and nonsampling variance
components of total error, in comparison with that on sampling
•
variances, they are very limited .. This lack of data is due more to
the cost, time, and difficulty in measuring nonsampling components
.2
of error than in the lack of appreciation of their importance; it
has come to be recognized that the continued study of the components
•
of error arising from all aspects of the study process will not only
lead to a more accurate assessment of data quality, but it will inevitably lead to improved survey procedures to reduce these errors.
1.2 Errors in Survey Data
Errors in survey data are generally discussed in the context of
deviations in the measurement of the "true value" of a population
characteristic. This true value is in some cases an elusive one,
particularly in the case of attitudes.
Hansen, et
~
(1953) define
"true val ue" in the context of three cri teria:
a)
The true value must be uniquely defined.
b)
The true value must be defined in such a manner that the
purposes of the survey are met.
c)
Where it is possible to do so consistently with the first
two criteria, the true value should be defined in terms of
operations which can actually be carried through (even
though it might be difficult or expensive to perform the
operations).
•
Deviations from this "true value" are the result of both variable errors and biases.
Of the variable errors, or variances,
sampling errors have received the most attention.
The sampling error
of a sample estimate is defined as the difference between the esti=
mate computed from a particular sample and the result that would be
obtained from a complete census under the same general conditions.
However, variable errors are also the result of nonsampling processes,
including interviewing, coding and processing, and nonresponse.
•
•
3
Biases.are systematic errors that affect any sample taken under a
specified survey design.
In mathematical terminology, it is simply
the quantity by which the expected value of an estimate differs
from its "true", generally unknown, value.
,
As in the case of vari-
able errors, biases may be due to the sampling process or nonsampl.ing sources.
Included in the sampling biases is the use of
biased but consistent estimators, as for example, the use of the
n
ratio r = y/x as an estimator of R = Y/X or the use of L (y._y)2/ n
.
i=l 1
2
as an estimator of 0 (Kish, 1965). The nonsamp1ing biases include
noncoverage errors, that is, incomplete sampling frames, incomplete
-.
within household coverage, use of proxy respondents, nonresponse,
and others too numerous to list.
This fourfold classification of
survey errors, i.e., variable errors - sampling and nonsamp1ing and
biases - sampling and nonsamp1ing are discussed at some length in
Kish(1965).
Nonsamp1ing biases and variances resulting from response and
nonresponse have received considerable attention in the literature.
Indeed, response variance, in particular that due to the interviewer,
and nonresponse, are probably the largest contributors to the total
mean square error outside of variable sampling errors; they, in
fact, may outstrip even this component.
Because of their importance
in the total mean square error and to this research, they are discussed in more detail below.
•
1.2.1
Response Variance
Response variance consists of two major components - simple
4
response variance and correlated response variance.
Simple response
variance is due to trial-to-trial variation in individual responses.·
•
That is, a respondent may not always give the same answer to the
same question were he asked at different times under the same gen,
eral survey conditions.
Some questions are subject more to this
type of variability than others.
For example, questions on educa-
tional attainment may be subject to larger simple response variance
than questions on age.
This was the case in the 1970 Census.
For
the five age categories 15-64 the estimated rel-variances (estimated
variance divided by the square of the estimate) varied from .22 to
.73, whereas those for the eight educational attainment items varied
from .51 to 7.02.
The ratios of simple response to sampling vari-
ance were considerably higher for educational attainment than for
age.
The ranges were .22 - .47 for the former and .04 - .07 for the
.-
latter (Bailey, 1975).
The correlated response variance is attributable to correlation
between the random component of the response error for one individual
and that for another individual.
This is the result of the effect of
coders, supervisors, interviewers, and the like on survey units for
which each is responsible.
Frequently, it is the interviewer who
makes the greatest contribution to this component. of error.
Kish
(1965) describes the interviewer effect as follows:
Each interviewer has an individual average "interviewer bias" on
the responses in his workload, and we consider the effect of a
random sample of these biases on the variance of sample means.
This effect is expressed as an interviewer variance which decreases in proportion to the number of interviewers. Its contribution to the variance of sample means resembles other variance terms; it is proportional directly to the variance per
interviewer and inversely to the number of interviewers (Kish,
1965, pp. 522-523).
•
•
5
For certain statistics this item can represent a substantial
portion of the total mean square error.
The ratio of this compo-
nent of error to sampling variance is generally much higher than for
the simple response variance.
For example, for the item educational
attainment in the 1970 Census, the ratio of correlated response variance to samplin9 variance for the characteristic "elementary 8" was
1.37, whereas the ratio of simple response variance to sampling variance was .42 (Bailey, 1975). Whereas for some characteristics the
difference is even more dramatic, it should be pointed out that some
are less so.
Further, it must be emphasized that the 1970 Census 20
percent sample on which these data were based has much smaller sam-
•
pling variances than most surveys.
In an interviewer variance study
conducted during a 1975 National Crime Survey, a national sample
survey, the ratio of correlated response variance to the sampl ing
variances were less dramatic, though they in many instances, still
represented substantial contributions to the total error.
The sig-
nificance of the ratio of these components and others to sampling
variance is that it dramatizes how much the sampling variance, the
usual measure of accuracy that is estimated by survey practitioners,
understates the total error.
Simple response is usually included
among sampling variance estimates, whereas the correlated component
of response variance generally is not.
1.2.2 Nonresponse
•
Nonresponse is a continuing problem in surveys and has received
more attention in the literature. than response variance.
It may be
6
described as the failure to obtain information for a unit which belongs in the sample.
Thus, for example, in a household survey,
•
units which are not considered fit for habitation are not eligible
for inclusion in the sample and are not considered as nonresponse
units.
There are two basic types of nonresponse - total nonresponse
in which no information is obtained from the sampling unit, and
item nonresponse in which some information is obtained for a unit,
but no information is obtained for the particular item(s) in question.
Both types of nonresponse and some methods of dealing with it
are briefly discussed below.
Total nonresponse may be classified into numerous different
categories depending on the type of survey.
For example, the Bureau
of the Census classifies the nonrespondent households from its
personal visit household surveys into four categories - the "no one
•
home", the "temporarily absent", "the refusal", and all "other".
The "no one homes" are those households in which no eligible
respondent is found at home after repeated calls, but in which
members are not away for any extended period of time.
This would.
include those whose household members eligible to respond happen
to be out shopping, at work, or visiting a neighbor when the interviewer calls. The "temporarily absent" are those whose household
members are away during the entire survey period, for example, on
a vacation or business trip.
"Refusal" households are those house-
holds which are contacted but whose eligible respondent members
refuse to respond (Brooks and Bailar, 1978).
In panel surveys,
where households remain in sample for several months, households in
•
•
7
which eligible respondents refuse repeatedly for some specified
number of months are designated "hard-core refusals" and may not
even be contacted during the remaining months in which these households are in sample.
Even in one-time surveys the "hard core
refusal" can be an appropriate terminology.
In explanation, an
individual may refuse to respond because he is busy or ill at the
time. Were repeated efforts to be made a response could very well
be elicited from these individuals.
The hard-core nonrespondents,
however, would be those who because of strong personal feelings
would refuse to respond even after repeated calls.
The last category
used by the Bureau of the Census, "Other", coul d i ncl ude those which
••
could not be reached by the interviewer because of road conditions,
death in the familY,and any other reason which cannot be classified
into one of the three above categories.
Though there are some
differences in classification schemes used by survey practitioners,
this scheme is fairly representative of the tyr.es of.nonresponse
categories in use today, at least for personal visit surveys of this
type.
The consideration of different types of nonresponse is important for survey practitioners mainly for two reasons.
One is as a
tool to improve response rates through quality control of interviewers (particularly in the case of personal visit surveys) and
subsequent training in procedure or the establishment of new interviewing schedules to decrease "not-at-homes". Another is that the
•
classification of nonrespondents may be used in studying biases that
result from nonresponse.
Nonrespondents may differ from respondents
8
in respect to specific survey items and these differences may be
differential by type of nonresponse.
Further, when nonresponse is
considered as a stochastic process, i.e., when each person is considered to respond according. to a particular probability, these
probabilities may differ by type of nonresponse ..
Item nonresponse usually results because the respondent considers particular questions as too sensitive, as for example, questions on income, though missing data may also result because of re. cording errors and oversights.
For a particular questionnaire item,
item nonresponse presents the same problems as total ·nonresponse.
However, the two are differentiated because not all methods of adjusting for nonresponse are applicable to both types of nonresponse.
There are many·ways of handling nonresponse.
These include,
but are not restricted to, a double sampling scheme in which a subsample of nonrespondents is followed up, adjustment based on differential response probabilities discussed in Section 1.4.3 below and
several imputation procedures.
Hansen and Hurwitz (1946) used an adaptation of a double sampling scheme developed by Neyman (1938) in handling the nonresponse.
In their case they were making use of double sampling in a mailpersonal visit situation in which mail questionnaires were used
because of economy in the original survey; then personal interviews
•
•
9
nonrespondents, then an unbiased estimate of total X is
where:
N
n
k
=
=
=
total population size,
total size of the original sample,
the reciprocal of the subsampling fraction of the nonrespondents to the original survey,
= estimate of total for the original respondents,
X2 = estimate of total from the subsampled nonrespondents .
Xl
~
.The variance of this estimate is given by
-.
cr2X~ = N2 N-n
N
2
Ln +!in (k-l) N22'
52
where:
52
=
the variance in the entire population between the original
sample units,
the variance among the nonrespondent portion of the population,
number of respondents (establishments, etc.) in the population who would not have responded had a complete census
been taken.
An imputation procedure frequently used in total nonresponse
i~
a weighting method in which the inverse probabilities of sample selection are inflated by the inverse of the response rate in a particular
class or stratum.
The classes or strata are formed so as to be homo-
geneous with respect to certain characteristics such as ·race, urbanrural area, which are expected to reduce the bias due to nonresponse.
•
For item nonresponse the weighting method is not generally used .
Use of the weighting method for partial nonresponse would result in
10
different weights being applied to different questions according to
the response rate by question.
This would result in inconsistent
•
data among different questions (Platek and Gray, 1980). As such,
one method frequently used in imputing for item nonresponse is
duplication of records, i.e., the missing item is substituted with
another value contributed by a similar sampling unit.
The source
may be external to the sample, for example, data from a previous
survey; Platek,
et~.
on historical data.
deck procedure.
(1977,1980) refer to this as imputation based
It is also frequently referred to as the cold-
A more popular procedure is the duplication of values
from the same survey for missing data items from records which have
similar cbaracteristics; for example, a value may be substituted
from another record in the same age group and of the same race and
.,
sex. The selection of the value could be based on a random selection
from a specific subclass of records or it could be based on some
other procedure, say, the next available record within the subclass
could be used.
This duplication procedure in which values are used
from the same survey is frequently referred to as the hot-deck procedure.
Other imputation methods are discussed by Kalsbeek (1980) and
Chapman (1976).
Nonresponse and the resulting imputation procedure
used to adjust for·it contribute to both the bias and variance components of the mean square error and is an explicit part of some of
the mathematical models discussed below.
1.3 Mathematical Models
The mathematical models used to represent the mean square error
•
•
11
that are discussed here include those of Hansen, Hurwitz, and
Bershad (1961), Platek, Singh, and Tremblay (1977), Less1er (1979),
and Platek and Gray (19BO).
The Hansen, Hurwitz, and Bershad model
has been widely used, particularly by the Bureau of the Census.
The
three other models are more recent developments and differ from the
first in that they treat nonresponse as a separate component of the
model which contributes to both bias and variance.
Table 1.3.1 gives the model components for each of the four
models discussed here; where deemed necessary, assumptions and/or
clarifying comments are given.
-.
This table may be referred to as an
easy reference point in comparing the models throughout the discussion in this section.
Hansen, Hurwitz, and Bershad, in a 1961 article, perhaps the
most influential in the development of mean square error models,
present a model for estimating the mean square error in the context
of a complete census or simple random sample in which interviewers
are used in the data collection process.
Further, they regard a
particular survey as one trial, i.e., one survey from all possible
repetitions of the survey under the same general conditions G.
These
general conditions mayor may not be under the control of the survey
designer and include such things as the political and economic situation at the time of the survey, training of interviewers, the data
collection vehicle, and survey controls.
The model is developed for proportions and is presented here
•
as such; however, it is applicable to other than 0,1 variables.
a population consist of N elements such that
Let
TABLE 1. 3.1
COMPARISON OF COMPONENTS OF MEAN SQUARE ERROR MODELS
Model Components
Mode 11
Included?
Assumptions
HH & B
PS & T
Assumes complete census.
Lessler
P&G
Yes
No
Yes
Yes
HH & B
No
PS & T
Lessler
P&G
No
Yes
Yes
HH & B
No
PS & T
Lessl er
P&G
No
Yes
Yes
HH & B
Yes
Yes
Yes
Yes
Variance Terms
Sampling Variance
Observed values
Imputed values
Correlation between observed &
imputed values
Nonresponse not explicitly included
in the model.
Assumes complete census.
Nonresponse not explicitly included
in the model.
Assumes complete census.
Response Variance
Simple
PS & T
Lessler
P &G
•
•
~
N
•
'.
•
•
•
TABLE 1.3.1
(Continued)
Model Components
Mode 11
Included?
HH & B
PS & T
Less1er
P& G
Yes
No
Yes
Yes
Assumed zero.
HH
No
Nonresponse and imputation not considered in the model.
Assumptions
Response Variance
Correlated
Imputation Variance
Simple
Correlated
Correlation between response &
imputation errors
&
B
PS & T
Less1er
P& G
Yes
Yes
Yes
HH
No
&
B
PS & T
Less1er
P& G
No
Yes
Yes
HH
No
&
B
PS & T
Less1er
P& G
Yes
Yes
Yes
Nonresponse and imputation not considered in the model.
Nonresponse and imputation not considered in the model.
~
w
TABLE 1.3.1
(Continued)
Modell
Included?
HH & B
No
PS & T
No
Lessler
Yes
P&G
No
Variance due to variation in the events of
responding or not
responding
HH & B
No
PS & T
Lessler
P&G
Yes
Yes
Yes
Covariance component
due to events of responding or not responding pertaining to
pairs of units
HH & B
No
PS & T
Lessler
P&G
No
No
Yes
Model Components
Assumptions
Interaction Variance
Simple and
Correlated
This term. though not included in
the final model. was discussed
and shown in the development of
model, but then assumed zero.
Assumes complete census. so not
appl iCilbl e. ~
Assumes x. t- xi because of imputation. 1S
Nonresponse and imputation not considered in the model.
Nonresponse and imputation not considered in the model.
Cov(oi.Oj) assumed: O.
Cov(oi'o.)
J assumed. - o.
...
~
•
•
•
•
•
•
TABLE 1.3.1
(Continued)
Model Components
Modell
Included?
HH &B
PS & T
Lessl er
Yes
Yes
Yes
P &G
Yes
HH &B
No
PS & T
Less 1er
Yes
Yes
P &G
Yes
Assumptions
Bias Terms
Res ponse Bias·
Imputation Bias
Terms combined differently for response and imputation bias, but
still included.
Nonresponse and imputation not considered in the model.
Terms combined differently for response and imputation bias, but
still included
lHH &B = Hansen, Hurwitz,and Bershad.
PS &T = Platek, Singh,and Tremblay.
P &G = Platek and Gray.
~
01
16
if the i th member of the population has a particular
characteristic,
e
= 0 otherwi se,
-
1 N
~ Xi be the "true value" of the population to be
i=l
estimated. Further, let YitG be an observation on the i th unit in
the survey where,
and let X = N-
if the i th unit in the ~Hrvey is said to have the
characteristic on the t trial,
= 0
otherwise,
1 n
and let ptG = n.~ YitG be an estimate of X. Let E(PtG) = average
1=1
value over all possible trials including all possible samples and
all possible responses under the general conditions under which the
census or survey is taken; let
e·
and the bias
Then the total variance of the 'estimate PtG is denoted
and the mean square error of the estimate is
The subscript G may be dropped for convenience for a particular
survey. Then let the conditional expected value over all possible
measurements on a particular unit under general conditions G be
e
•
17
E(Y it ) + Pi
= Ui + Bi where Bi is the bias term for the i th individ-
ual, and let the difference between the observed value on a particular survey (tth trial) and the expected value for that unit be
denoted as eRit = Yit - Pi where eRit is referred to as response
.
1 n
deviation. Let p =- l P"
It should be noted that the condin i =1 1
tiona1 expected value for the i th unit in the population over all
",
possible repetitions of the survey with a fixed sample s is
Etls (Y it ) = Pis' which is not necessarily equal to Pi because the
responses obtained in a sample may be influenced by other units in
Koch (1973) shows that the interaction term is zero if
the expected response for the i th population element is no different
for those samples which contain both the i th and i ,th element than
for those ~hich contain the i th but not the i,th,
the sample.
•
The authors show that the total variance may be divided into
components such that
. o~
= E(pt - p)2 = E[(Pt _ p) + (p _ p)]2
t
= E(Pt
- p)2 + 2E(pt _ p)(p _ p)2
+E(P_p)2
where
E(Pt - p)
2
1 n
2
= 0R2 = E(- l eR1't) = E(e Rt )
t
n i=l
2
is the response variance of Pt;
E(p _ p)2
•
= 02 = E[l ~
(P. _ p)]2
1
p
n i';l
is the sampling variance of Pt and 2E t [(pt - p)(p - P)], twice the
covariance of eRt and" p, is assumed zero in the further derivation
18
of the model.
2
The 'response variance term 'oR can be decomposed into simple
•
t
and correlated response components such that
",
= 1.n vR
~2
0i\2 = E(_2)
eRt
+ n - 1 p~2
= ~ oR [1
where
oR = E(eRit )
n
vR
+ p(n - 1)],
which is the simple response variance, and
i .,. i
I
is the intrac1ass correlation among response deviations in a survey
or trial.
Its possible impact on the response variance in many
instances, can be substantial even for small values.
Platek, Singh, and Tremblay (19771 developed a simple response-
••
nonresponse error model and components of the model under various
imputation procedures.
The model is presented in the context of a
complete enumeration and using the concept of a complete enumeration
and using the concept of response probabilities.
It is based on
the assumption that every unit in the population has a given, but
genera11y unknown probability of responding if selected into the,
sample and given the survey conditions.
In this context nonresponse
bias is directly proportional to the differences of response rates
between units having a characteristic in the population.
The authors' model is concerned with the estimation of a popu1aN
tion total using complete enumeration such that X = >. Xi' They
i=l
identify two steps in obtaining responses. The first step is
•
•
19
identifying and contacting a respondent and soliciting a response.
This step is identified with the response-nonresponse vector
cussed below.
°dis-
The second step is obtaining a current response or
in the case of nonresponse, imputing. This step is associated with
all possible trials t of the survey.
For a particular unit let Xi denote the true value, Yit the
observed value for responding units on trial t, and lit be the
imputed value for nonresponding units on trial t.
random variable 0i
Further, let the
= 1 if the unit is responding and. 0, otherwise.
Then the estimate of Xi obtained at the tth trial is
A
. e.
Xi
= 0iYit
+ (1 - 0i)lit
We let Yit = Xi + ERit , where ERit = eRit + BRi and lit = Xi + ENRit
where ENRit = eNR + BNR . Then
A
Xi
= 0i(X i
+ ERit ) + (1 - 0i)(X i + ENRit )
where ERit and ENRit are random errors due to response and nonresponse,
respectively.
E(oi) = Pr(0i = 1) = Pi' the response probability for the i th
unit; Var(6 i )
= Pi(l - Pi)· The expectation and variance of ERit is
the response bias, and
e
the simple response variance for the i th unit.
The authors assume
20
that both the Cov(oi,oi ,) and COVt(8Rit,8Ri't)' the correlated response variance = 0 for i 1 i' for model simplification.
•
The expectation and variance of 8NRit ,the random nonresponse
error component is
the imputation bias for unit i, and
the imputation variance for unit i.
Letting the estimate of total at trial t be
where Zit differs under different imputation procedures, the authors
e·
give general expressions for the bias and variance of the estimate.
The bias of the estimate was found to be
A
ANN
Bias(X) = E(X) - x = I P.B R· + I (1 - Pi)B NRi
' .
i=l 1 1 i=l
where the first term is the response bias and the second term is
the nonresponse bias.
The variance is
A
A
A
Var(X)' = Var o Et1o(X) + Eo vartlo(X)
where the first term is the nonresponse variance and the second term
can be shown to consist of the response variance, the imputation
variance, and the covariance between the response and imputation errors.
The authors develop the bias and variance of the estimates under
different imputation procedures:
a) no adjustment for nonresponse;
e
•
21
b) a uniform correction factor; c) weighting class, usually used
with total nonresponse; and d) the use of external data as a substitute for, the missing responses.
Lessler (1979) expanded the survey error model by drawing upon
the work of Hansen and Hurwitz (1946), Politz and Simmons (1949),
Hansen, Hurwitz, and Bershad (1961), and Platek, Singh, and'Tremblay
(1977).
The model is presented in the context of a panel survey
employing simple random sampling where each individual has a particular unknown probability of responding if selected in the sample.
For simplicity it is assumed here that the survey is one-time; thus
the subscript for survey time or panel is dropped.
••
L~
= the measurement of response obtained for the i th
population unit at the tth trial
where Xi is the true value, BRi is the fixed bias term, and eRit
is the random error component.
Zits
= the imputed response for a particular individual or
unit that does not respond dependent on a particular
sample-so
Letting 0i again be the random variable associated with responding
or not responding, i.e., 0. = 1 if the i th individual or unit responds
1
'
and 0 otherwise, then
A
Xi
•
= 0iYit
+ (1 - 0i)Zits'
Lessler develops the MSE of Xt = 1 ~ X. which is an estimate
'_
n i;l 1
of the population mean X. She considers the following components:
22
measurement or response variance due to simple and correlated mea-
e
surement or response variance; sampling variance; interaction variance due to interaction between sampling errors and measurement or
response errors and bias.
In addition to the two steps associated
with the response-nonresponse random variable e, repeated trials t,
Lessler considers the step of drawing the sample s.
Let
and
Further, 1et
E(x t )
= Es{E els
[Et(xtls,e)]}
= X+ BNR
+
e·
ri,
so that the bias is
''"
~~.
BNR
= net
-
1 N
bias due to imputation of missing data,
.
D = N ~ Pi(B Ri - BNRi ), a functlon of the response probabili1
ties and the difference between the expected error of
measurement and the expected error of imputation for the
i th individual.
The variance of xt is
var(x t ) = Es Eels Vart(xtls,e) + Es varel s Et(xtls,e)
+ Var s Eels Et(xtls,e).
Es Eels vart(xtls,e)
*
= {oR + (n-l)PRoR} +
+2(n-l)
n
OR ,NR '
*
{oNR + (n-l)PNRoNR}
e
•
23
where the first term on the right is the effect due to measurement
error in the observed values or response error, the second is due
to measurement error on the imputed values, and the last is due to
an interaction (or correlation) between the two.
*00
Es Varo!s Et(xtls,o) =
due to the variability of the difference between the response and nonresponse bias term;
Vars Eols Et(xtls,o)
=
+
(* -~)
*
[oi
+
[oRS + 2o R,NR,S + °NR,S]
(n-l)PIoI]'
where the,first terms on the right are sampling variance terms and
the ,last terms are the interaction variance terms due to interaction
••
between sampling errors and measurement (or response) errors .
Platek and Gray (1980) looked at the concept of total survey
error in the context of missing data due to nonresponse.
Their
paper "considers the bias, sampling and nonsampling variances, and
resultant mean square errors of estimates under various imputation
procedures." ,
The various estimates are developed in the context of probability proportionate to size sampling where the estimate can be expressed
as a weighted sum of observed or imputed values of individual units;
the weights for these Horvitz-Thompson estimates are the inverse of
selection probabilities.
As with the earlier Platek,
et~.
and
Lessler model, each unit in the population is assumed to have a
certain unknown response probability attached to it depending on
•
the conditions of the survey .
Assuming that the adjustment cells used in the imputation
24
procedures are mutually exclusive then "the expected value of an .
e
estimate under any imputation procedure and its bias and variance
is derived by taking expected values in the following stages:
a)
expected value over all possible samples; b) expected value over
all possible sUbsamples of respondents, given the sample; and
finally, c) the expected value over all possible responses, given
the sample and the sUbsample of respondents."
The authors developed the MSE of the estimate of a total XG
under different imputation procedures where G indicates imputation
in general.
Table 1.3.1 indicates which components are included in
thei r mode1.
e·
where Xi + ERit ~ Yit and Xi + ENRit ~ Zit· These terms, ERit, 8i
and ENRit,are as defined in the Platek, Singh, and Tremblay paper
discussed above. Here TIi is the probability of selection of the
i th unit. The imputation procedures studied are the a) weighting
method, b) duplication method, c) historical data substitution
method, and d) zero substitution method.
1.4 Development of Estimators
Estimators for model components including those of Hansen,
Hurwitz, Bershad (1961), Hansen, Hurwitz, Pritzker (1964), Fellegi
(1964), and Lessler (1974) are discussed. These involve the estimation of components from models which do not consider nonresponse
as a separate component.
Presented in this section also are methods
•
•
25
for estimating the probabilities of responding developed by Politz
and Simmons (1949) and Thomsen and Siring (1979).
1.4.1
Response Variance - Simple and Correlated
Hansen, Hurwitz, and Madow (1961) present estimators of the
response variance components of their MSE model through two alternative designs of experiment.
The first method is that of repetition
in which the survey procedure is repeated on the sample or population, but using different processors, crew leaders, and interviewers.
Considering other survey conditions as fixed, then an estimator for
-.
response yariance is (Pl - P2)2/2 where Pi is the estimate of the
proportion of the population with a specific characteristic on the
-1,th repe t't'
1 10n. The expected value of the estimate is approximately
2
if Pli and P2i are approximately equal. Then if a~ = a~ (total re1
2
sponse variance from repetition 1 and 2) and the correlation term is
0, the estimate is unbiased.
The most serious limitation of this
method of estimating the response variance is the assumption of no
correlation between the two trials since, for example, the respondent
may remember his first response and tend to respond the same on the
-
-
2
next trial; the other two assumptions that Pli equals P2i and a~l and
a~ are equal will affect to a lesser extent the estimate also, since
•
2
the trials are conducted at two different points in time and conditions
may change.-
26
e
The second design presented by the authors was that of interpenetrating samples first suggested by Mahalanobis for surveys in
India.
Briefly, workloads are randomly assigned to interviewers,
crew leaders, processors, etc.
Then let Pl be the estimate of the
proportion· for work of one processor and P2 the proportion of work
of another. Then
C=
and
(Pl - P2)2
2
~
1 Piqi
i=l
0 =
(n-l) Mn
where Pi is the proportion estimated to have the characteristic in
the i th assignment, l-Pi = qi' M is the total number of interviewers,
n- is the number of units in an interviewer's assignment; and where
e·
n is the number of units in a stratum (one processor's assignment).
Then since E(C-O) = oR (Pw-Pb)' C-D is an approximate estimate of
PwoR' the correlated response deviations within processor assignments
(including joint effects of processors, crew leaders, interviewers)
where PboR' the correlated response deviations between processor
assignments, is generally considered small in comparison to PwoR"
In
practice, this design would be conducted in a large sample of strata
and the results averaged over strata.
Hansen, Hurwitz, and Pritzker (1964) presented an estimator
for the simple response variance of an estimate of a proportion PG
where G indicates the general conditions under which the survey is
taken.
The estimator requires the use of repetition of the survey
t' under conditions G'.
Then we have the following:
•
•
27
~
with
attribute
without
attribute
Total
with attribute
it'G' = 1
a
b
a +b
without attribute
it'G' = 0
c
d
c +d
a +c
b+d
n
Repetition
Total
where a, b, c, and d are frequencies observed in the pair of trials
and nt
=
n
=
a+b+c+d.
9=
••
,1 ~
Then it can be shown that
n Li
(Y itG - Yit'G')
2
b+c
= --n--
where YitG , Yit'G' take on the values 0 or 1. Assume the following:
a)
that G and G' represent the same survey conditions, and
b)
that t and t' denote independent repetition on a single
element;
then 9/2 is an unbiased estimate of the simple response variance (SRV).
However, trials t and t' are not often independent.
Still assuming
G = G', then
E( g)
where
which is equal to the between trial covariance of response error.
When PRG,RG' is positive, the expected case, g/2, where 9 is 'referred
•
to as the gross-difference rate, understates the SRV .
Fe11egi (1964) looked at the model developed by Hansen, Hurwitz,
28
and Bershad and extended it to cover the repetHion of the original
e
With the model extended, the author was able to develop
survey.
estimators of the model with the use of repetition and interpenetrating samples.
Briefly, a simple random sample of np units is
selected and this sample partitioned randomly into p subsamples of
n units.
These are in turn, randomly paired so that each subsample
appears in two pairs; each is part of the original sample and the
repeat sample.
pairs.
The p enumerators are randomly assigned to the p
Then the first subset in each pair constitutes the inter-
viewers' first assignment and thus the original sample; the second
subset in each pair constitutes the interviewers' second assignment,
and thus the repeat survey.
Fellegi defines the following 15 parameters:
"
0Rt
= simple response for the tth sample or trial, t = 1,2,
O~t
= sampling
TT
variance, t
e-
= 1,2,
= ~orrelation coefficient between y'land Yj? where YJ-t
1S an expected measurement on res~onse from the
.
jth unit in the tth survey over all possible samples,
sUbsamples, and pairs,
correlation of response deviations obtained by the same
enumerator in the same survey, t = 1,2,
PR t
3
= correlation
PRs,t
= correlation
of response deviations obtained by different
enumerators in the same survey, t = 1,2,
of sampling and response deviations within
enumerators, t = 1,2,
= correlation
of response deviations obtained in the two
surveys for the same unit,
82
= correlation
of response deviations obtained by the same
enumerator in different surveys,
e
I
•
29
and
S3
= correlation of response deviations obtained for differ-
S4
= correlation
ent units by different enumerators in different surveys,
but in the same subsample,
of response deviations obtained for different units by different enumerators in different surveys
and different subsamples.
Fellegi further defines the following sums of squares:
a) for·
each of the two surveys - between and within enumerators; b) between
the two surveys - within sampling units, within subsamples but
between enumerators, and within enumerators but between subsamples
(which is linearly independent of the previous sums of squares if
and only if k
-.
>
2).·
Fellegi makes use of linear combinations of these sums of
squares to estimate the parameters.
Since it is not possible to ob-
tain unbiased estimates of all the parameters, Fellegi assumed certain
properties of the parameters to provide biased estimates of the more
important parameters in terms of less important parameters that are
thought smal; in magnitude.
If the survey conditions of the repeat·
survey are similar to·the original, then it can be assumed that aRl
and a R2 ; PR2t and S2; PR t and S3; and a~t and 7Ta a are close.
sl S2
3
to one another. Also PR t' S3 and S4 may be considered small. Then
3
Fellegi provided biased estimates of the following parameters:
a)
b)
•
c)
2
aRt '
2
+ a
a Rl
R2 ,
2
2
PR 1 a Rl + PR 2 a R2
2
2
2
30
d)
PRst °st °Rt '
e)
PRsl °51 °Rl + PRs2 °52 °R2
2
f)
'IT
and for p
e
051 °52 '
>
2,
(1)2 - 1>4) °Rl °R2·
1.4.2 Estimation of Mean Siuare Error Components in a Double
Sampling Scheme to E iminate Bias
Madow (1965) suggested a double sampling scheme to reduce or
eliminate bias in which a simple random sample is selected and from
which a simple random subsample is then selected.
For the subsample,
e-
observations are obtained from both the respondents and "true values"
from records.
,
,
"
The observations obtained for the subsample allow the
estimation of mean square variance components while the "true values"
allow the elimination of bias from the sample estimates.
Lessler (1974) developed estimators for this double sampling
scheme under two different sample designs - a self-examination survey
and a survey design using interviewers.
Using her development of
Madow's model, let
Let:
Xi
th individual,
= true value for i
Yiat
= recorded
th
Ui
measurement of individual i on the tth trial and
the a phase where phase 1 is the original sample and
phase 2, the subsample.
= 1 if the i th element is in the original sample,
= 0
otherwise,
e
•
31
Vi
= 1 if the i th element is in the subsample,
= 0 otherwi se .
. It is assumed that the expected value over trials of y.1at does not
depend on a; i.e .• Et(y iat)= Yi and there is no interaction between
the sampling and measurement (or response errors).
for X =
where
_
Ylt
_
•••
Y2t
N
~
i=l
Then an estimator
Xi is given by
1 N
=n
,2
Ui Yilt' the average of the elements in the original
sample,
1=1
1
N
=n t
1 i= 1
Vi Yi2t , the average of the elements in the subsample;
N
~
Vi Xi' the average of the true values for the ele-
i =1
ments in the subsample.
The estimator·w was shown to be an unbiased estimator of X.
n + n
l
nn l
where oR = variation in Yiat due to trial-to-trial and phase-to-phase
variation;·o2 = population variance, and 0B is due to the variations
of the individual bias terms around the net bias.
Lessler developed the following estimators:
1
•
N
5 X = ----!--;-l ~
nl i =1
where E(5~) = 0 2,
52
B
=
n
1
l - 1
32
where E(SS)
variance
= IT R+ ITS under the assumption that correlated response
= O.
S2
where E( 2w
)
n
l
=
•
ITR.
So the following estimators resulted:
IT 2 = S2X
_ S2w
2
ITR - 2n '.
l
S2
W
2 = S~
ITS
- 2n
l
A
A
and
A
Lessler assumes the general model to hold in the case of the
interviewer model.
.,
In addition, she assumed that each individual
.-
in the original sample is assigned at random to one of k interviewers
such that each interviewer interviews 0 individuals with n
= ok.
Likewise, for the subsample, each individual member of the subsample
is assigned at random to one of the k interviewers such,that each
interviewer has 01 individuals and nl = ko l . With wt again the estimator of the sample mean, Less1er developed estimators for five
variance components of w
t:
1.
Interviewer random effect variance;
2.
Interviewer individual· interaction random effect variance;
3.
Interviewer individual interaction fixed effect variance;
4.
Sampling variance; and
5. Systematic error variance.
•
•
33
1.4.3 Response Probabilities
There has been limited research in the development of estimators
involving the probability of responding or not responding.
Politz
and Simmons as far back as 1949 considered the treatment of nonresponse as a stochastic process.
They suggested this concept as an
alternative to the use of callbacks to get "not at homes" into the
sample. Other types of non respondents such.as refusals, which at
present make up a large proportion of total nonresponse, incapacitated, etc., were not considered in the development.
In their de-
velopment of an estimator for some characteristic total, considering
-.
the probabil ity of "being at home". they used the following plan
(Politz and Simmons, 1949):
a)
Each person in the sample is visited once, at a time deter-
"
mined at random from all times during which interviews are
to be conducted.
b)
Each person interviewed is questioned as to whether or not
he was at home at six specific instances, determined at random, including the instance of the ~interview. This permits
an estimate of the proportion of time he is at home during
interviewing hours.
c)
Questionnaires are divided into six groups according to the
estimated proportion of time persons in each group are at
•
home, i.e .•
t, i, ... ,* of the time for groups one to six,
respectively.
34
d)
The sample estimate, for any variable under study, is produced by weighting the results for each group by the reci-
•
procal of the estimated percent of time persons in the
n x.
group are at home, i.e., X = 6 ~ ~ where xj is the value of
A
the variable under study for
th~
jth person, n is the number
of persons found at home, and i is the number of nights any
individual is at home out of the six nights, including and
just preceding the night of the interview.
The population'to which the sampl-e estimate relates is restricted
to those individuals who are at home at least some time during the
interviewing hours; further, as stated above, refusals, etc., are not
considered.
Thus, under these restrictive assumptions the estimator
produces unbiased estimates.
However, in practice since these assump-
.-
tions will not hold, the estimator will produce biased estimates to
the extent that interviews cannot be obtained because of refusals,
incapacitation, "not at home" at any time during the interview period,
and that these persons with zero response probabilities have different
responses from those with greater than zero response probabilities.
Thomsen and Siring (1979) considered a probabilistic model for
nonresponse.
In development of the model three possible outcomes to
an interviewer's attempt to interview a household were considered:
a)
The interviewer gets a.response;
b)
The interviewer,gets no response and decides to call back;
c)
The interviewer gets no response and decides to categorize
the household as a nonresponse or refusal.
Let p be the probability that (a) occurs in the first call, and
•
•
35
f the probability that outcome (c) occurs on the first visit.
Assume
that f is constant in the successive visits, but that (a) occurs
with probability
~p
in the second and following calls, where
~
is,
expected to be larger than 1 because of efforts on the interviewer's
part to determine availability.
Then letting
e denote
the outcome
that the interviewer gets a response from a selected household in the
cth visit,
p(e = 1) = p,
p(e = 2) = (l-p-f) ~p,
p(e = 3) = (l-p-f)(l-M-f)
so
~p,
p(e = c) = p, if e = 1
= (l-p-f)( 1_~p_f)c-2 ~p, if C ~ 2.
Then p,
~,
and f are estimated from the sample, assuming information
is available concerning the number of calls for each selected household; p can be generalized to vary between households or persons.
The authors mention the possibility of generating p from a betadistribution and the possibility of assuming p is constant within
certain subclasses in the population, but varies among them.
In
his work he chose the latter, where in one case the subclasses were
age groups and women with i live births, i
= 0,1,2, ... ,6. The sur-
vey with which the authors worked was the Norwegian Fertility Survey.
The authors used their model to adjust for nonresponse bias
and in studying the relationship between the mean square error and
•
the number of callbacks.
Using three callbacks and women with i
live births as subclasses, the authors found bias was reduced little
36
more than 40 percent, but felt the reduction could be decreased.
In
studying the relation to MSE of number of callbacks, they found"that
•
their model tended to overestimate the number of responses in later
calls.
It should be noted that in looking at the MSE, the authors
did not look at separate components of bias or variance, but merely
nonresponse bias and overall variance.
1.5 Summary
Nonsamp1ing errors are a major part of total error of both censuses and surveys.
Whereas they have long taken a back seat to
they have come to claim
sampling errors, within the more recent years
,
their rightful place in survey literature.
A major paper was written by Hansen, Hurwitz, and Bershad in
•
"'
•
1961 which has provided the impetus to the development of a number
of mean square error decomposition models and the development of estimators for the components of these models.
That particular model
contained a sampling variance component, simple and correlated response variance, and bias. Some more complicated models by Platek,
et
21.
and Lessler have been more recently developed which treat non-
response as a stochastic process and allow for imputation of data,
whereas some earlier models did not treat nonresponse
explicitly.
,
In the area of development of estimators for nonsamp1ing components of the mean square error, the work on the estimated correlated
response variance appears to be the most dominant with actual
empirical results from U.S. and Canadian censuses, National Crime
Survey, and the like.
These sometimes suffer, however, from high
•
-"
•
37
variances. Simple response variance is estimated through repetition of the survey, but these estimates suffer from the correlation
of independent trials which almost always exist.
Bias is an even more difficult component to estimate and/or
eliminate as the "true value" is not known.
Madow (1965) suggested
a double sampling scheme to eliminate (and also estimate) bias with
the use of records for a subsamp1e of units or decrease bias with
the use of intense interviewing.
His scheme and the estimators de-
veloped by Less1er of the mean square error components necessarily
suffer from the all but impossible task of obtaining "true values".
-.
In the area of the development of estimators of mean square
error, components for models in which non response is considered is an
area where more research is- needed.
Some work has been done in esti-
mating the probability of responding where nonresponse is treated
stochastically, but neither of the two papers reviewed here were in
the context of mean square error components other than for total variance or nonresponse bias.
Further, though the double sampling scheme
has been used to correct for nonresponse, it too has not been developed in the context of a full MSE model.
As such it is noted that in the area of total survey error,
knowledge is growing but due to the inherent difficulty of estimating
nonsamp1ing error components, this area of study is still a fertile
one.
•
1.6 The Research
In the chapters that follow the mean square error of the Hansen
38
and Hurwitz double sampling estimator is extended to include measurement error in sample surveys.
In addition, the model is developed
•
in a probabilistic setting; that is, nonresponse is treated as a
stochastic process rather than deterministically, in the manner of
the Lessler (1979) and Platek et!J.. (1977,1980) models.
Estimators for the mean square error components under this double
sampling scheme are developed using the methods of repetition and
interpenetrating samples as discussed in this chapter.
However, be-
cause of the need 'for experimental design to study the interrelationship , of components, the expected mean square error (E(MSE)) is developed
assuming the observed values and response probabilities are drawn from
a superpopulation.
Using presumed parameter values, then the E(MSE)
of this model is studied and compared to two other models - a no
adjustment model and Politz-Simmons model.
"
.
",
"v
.-
Finally, we study the -estimation of response probabilities
empirically.
•
•
CHAPTER II
THE SURVEY ERROR MODEL AND OPTIMUM SUBSAMPLING SCHEME
2.1
Introduction
Most. if not all. surveys of any scope are plagued with the
problem of
-.
nonrespo~se.
and the survey statistician must decide on
the method of handling it.
In 1946 Hansen and Hurwitz suggested a
double sampling scheme; in 1949 Politz and Simmons suggested a procedure to be used in lieu of callbacks in obtaining an estimate.
that is. estimating the probability of response for each sampling
unit.
If the response probabilities were known or could be reasonably
estimated for each individual and response bias were the only concern
in estimation, then a Politz-Simmons type estimator of the sample
mean in which the responses for a particular unit are weighted up
by their response probabilities would be sufficient.
However. if
there is wide variation in these response probabilities -
som~
near
o and some near 1 - then the variable weighting could substantially
increase the variance (Platek and Gray. 1980). Thus. even when nonresponse is a stochastic process. other forms of correcting for nonresponse are generally used.
•
This research looks again at the Hansen and Hurwitz treatment
of non response in which double sampling is used.
It incorporates.
40
however, the probabilistic aspect of the Politz-Simmons model and
the more recent models of Platek,
(1979).
et~.
e
(1977,1980) and Less1er
In addition to this difference, the mean square error of
the estimate, which in this case is the sample mean, will be developed in terms of "total error", that is, sampling error and measurement error, in addition to nonresponse error is considered. However,
it is assumed that the correlation of sampling and response deviations, also referred to as interaction variance, is zero.
As in
the Hansen and Hurwi tz 1946 research, the optimum subsamp] ing rate
of the nonrespondents is developed.
2.2 The Estimator and the Model Assumptions
The estimate under consideration in this research is the sample
mean.
e·
Response from the initially selected sample is assumed to be
incomplete; a subsamp1e of non respondents is selected and responses
obtained from all of the subsamp1ed units.
This assumption is sim-
plistic as even with double sampling schemes there is usually a
hard-core group of non respondents who will not respond.
The design
of surveys in which a high degfee of accuracy is important would
probably incorporate methods to reduce the bias resulting from even
this residual nonresponse.
I
Data collection is through the use of
interviewers in both the initial sample and the subsamp1e of nonrespondents.
However, some difference in procedure used in the two
phases of the survey is assumed.
For example, the response may be
solicited from the initial sample through telephone and by personal
visit in the subsamp1e of non respondents since personal visits tend
e
•
41
to elicit more complete response.
An alternative is personal visit
at both phases of the survey, but with more intensive interviewing
techniques employed in soliciting response from the subsample of
nonrespondents.
This combination may be used in surveys in which,
because of length or complexity of the survey, personal visit is
warranted to acquire data of high quality.
An effort to obtain in-
formation from neighbors on the best time to call may solve the problem of not-at-homes or a telephone call, soliciting cooperation and
arranging for time of visit, may change the refusal unit to respondent.
-.
In anticipation of the development of estimators of the mean
square error components, the error model is developed for a design
in which sampling units are randomly assigned to interviewers; that
is, the Mahalanobis method of interpenetrating samples is used.
The assignment of units to interviewers is unrestricted in the sense
that the number assigned to each interviewer is random.
This was
done to simplify development of the model.
Specifically, the survey involves the. following steps:
a)
A random sample s of n units is selected from a population
of size N without replacement.
b)
The n sample units are randomly assigned to J interviewers
where J is assumed fixed.
c)
•
A response is solicited from each household by a particular
interviewer, such that the probability of responding or not
responding is not dependent on the interviewer.
42
d)
A subsample ss is randomly selected from the nonrespondents
at the rate of 1 in k, where k is fixed.
e)
•
The subsample of non respondents is randomly assigned to
the J interviewers independently of the initial sample.
f)
The process is assumed repeatable from trial to trial and
the probability of responding remains the same over each
tria 1.
Then the mean square error is developed for the sample mean for a
particular trial.
_
x
where
t
1 n
=- l
J
l
n i=l j=l
k
n
J
o.Y··
+- l
l
1 lJ lt
n i=l j=l
Yijlt = response obtained from the i th sample unit by interviewer
Yij2t
uij
vi
wiJ'
j in the initial sample at trial t;
= response obtained from the i th unit in the subsample of
nonrespondents by interviewer j at trial t;
= 1 if the i th sample unit is interviewed by interviewer j,
= 0 otherwise;
= 1 if the i th sample unit is selected in the subsample of
nonrespondents,
= 0 otherwise;
= lif the i th subsample unit is interviewed by the J.th_
interviewer,
= 0 otherwise;
n = sample size out of N total units;
k = the inverse of the subsampling rate.
k = n /r , where n is the number of nonrespondents to
to th~ s&mple, and 2r 2 is the number of nonrespondents
selected in the subsample, given 01 ,
The sample mean xt is an estimate of the true mean X.
The indicator variable O.1 is assumed to be a random variable;
-
•
•
43
and as stated above, the probability of responding or not responding
is assumed not dependent on the interviewer, that is, P(oij
= l/s), where P(oi = l/s) = Pi'
E(oi/s) = Pi' It is further assumed
P(oi
P(oi
= O/s) = 1-P i
= l/s) =
and thus
that the event of responding or
not responding is independent from individual to individual; that is,
P(oi
= 1,
0i'
= l/s)
=
PiP i , and E(oioi')
= PiP i ,.
These are simpli-
fying assumptions only; in actual practice, it would be reasonable to
assume that the probability of a unit responding would be related to
the interviewer assigned to the unit, that is, E(oij)
~
E(oi)'
Further,
units observed by the same interviewer would probably have correlated
-.
response probabilities. Then we would have E(oij,Oi') = cov(oij,Oi'j)
+ [E(oij)]2, under the more realistic model.
Since the covariance
term would almost surely be positive, then mean square error components of this model involving E(oi,oi') will be understated.
Conditional on unit i being included in the initial sample s,
the indicator variable Vi is random;
1-P.
.P(v i
= l/s) = ~
and P(v i
= O/s) = 1 -
l-P .
~
Conditional on both the sample and 0i' Vi is fixed; that is,
P(v i
= l/o i = l,s) = t
and P(v i
= l/o i = O,s) = 0;
The interaction variance term or correlation of sampling and
response deviations is assumed zero, which again may not follow in
•
practice.
Define:
44
e
Then
where
Bijl and Bij2 are the biases, or constant error terms,associated
with a response obtained by interviewer j if unit i responds in the
initial sample and if unit i responds in the subsample of nonrespondents, respectively.
The two biases are assumed different due to
possible inherent differences in response were unit i to respond in
e-
the initial sample as opposed to responding in the subsample of nonrespondents; differences in the methods of soliciting response in the
initial sample and the subsample; and any interaction of response
patterns and collection methods.
Then£ijlt and £ij2t are, respectively, the random error associated with the i th sample unit and jth
interviewer in the initial sample and the subsample of nonrespondents.
Then the following relationships are assumed:
cov(£ijlt'£i'jlt) = Et [(Y ijlt - Yijl)(Yi'jlt - Yi'jl)];.!
a
COV(£ijlt'£i'j'lt) = Et[(Yijlt - Yijl)(Yi'j'lt - Yi'j'l)] = O.
Likewise, the same relationships hold for the subsample of nonrespon~
dents.
The first relationship simply indicates that the responses or
measurements of any two units within the same interviewer. assignment
e
•
45
area are correlated, whereas the second relationship indicates that
there is no correlation between units within different interviewer
assignment areas.
Also the following relationships are used in the development of
the mean square error:
E(u .. Is) = J1
lJ
=1E(u lJ
.. u.,
.• !s) = E(u lJ
.. ls) E(u·,·,ls)
1 J,
1 J
•
E(WijW i ,j Iv i = 1, 0i= a,s) = E(w ij Iv i = 1, 0i= a ,s) x
E(w .. lv.= 1, 0.= a,s) =_1
lJ
1
1
J2
E(Wijwi'j,lv i = 1, 0i= a,s) = E(wijlvi= 1, 0i= a,s) x
.
E(w i 'j ,Iv i = 1, 0i= a,s) =
Jz1
E(WijWij,lvi= 1, 0i= a,s) = O.
Certain large sample approximations are used in the development
. of the model.
These are indicated as used.
2.3 The Mean Square Error
•
Both the bias and the variance, the two components of the mean
square error, are developed below.
46
2.3.1
e
The Bias of xt
The bias of xt is defined as the difference between the expected
value of the estimate and the true value. First, the expectation of
xt
is taken over the six survey steps, that is,
E( x)
t = EEl
SIAl s E0IIA1,S Ess I0,IA l ,s EIA 2Iss ,0 ,IA1,S
For simplicity of notation let the following relationships hold:
E2 = EIAlls
e·
E3 = E01 IA1,s
E4 = Ess I0,IA
. ,s
l
E5 = EIA /ss,o,IA s
l
2
.
E = E
tlIA 2 ,SS,o,IA 1 ,s
6
Then
n
J
J
u 0 Y
+ ~ '.
n
n i~l j=l. 1 (1-0 1
i=l Jo-_l.l ij i ijl
1 n
=
1
0
u
0
ij i
Y
ijl
n
+L,
)
vi wiJ· YiJ 2 ·
0
J
,
nJ i~l j~l
J
l.
j=l
e
47
•
J
J
n
'py
+_1 ,
,
j~l i ijl
nJ i ~1 j~l
=-
1
NJ
N
1
J
L L
i=l j=l
N
+ NJ.l.
J
L
1=1 j=l
Now if Yijl = Xi + Bij1 and Yij2 = Xi + Bij2 , then
1 N
+ -NJ >:
J
1
P.B.
'1
+
NJ
1 1J
N
L
i=l j=l
1
1 N
L
. N i =1
= -
X'+NJ
1
N
J
2 L
i=l j=l
.2
(l-P.)(X.+ B. '2)
1
1
1J
J
1-
1=1 j=l
The bias is then
1
~.
Bias ( xt ) = NJ
N J
l' L= 1 L
j=l
N J
+_1 '..
'
i ijl NJ i~l j~l
PB
2.3.2 The Variance of xt
The simplifying notation of Section 2.3.1 above is retained; in
addition,
Var 1 = Var
. s
Var 2 = varIAll s
Var3 = var 6!IA ,s
1
Var 4 = varSSI6,IA1's
VarS = varIA2Iss,6,IA1's
•
Var6 = Var t IIA ,ss,6,IA ,s
2
1
Then the Var(x t ) is defined as:
48
Var(x t ) = E1E2E3E4ES Var6(x t ) + E1E2E3E4 Var S E6(x t )
+ E E E Var
123
+
e
4 ESE6(x t ) + E1E2 Var 3 E4ESE6(x t )
E1 Var 2 E3E4EsE6(Xt) + Var 1 E2E3E4EsE6(Xt)
The six terms of the variance of xt represent, respectively, the response variance; the vari.ance due to the random assignment of interviewers to the sUbsamp1e of nonrespondents; the variance due to the
selection of a subsamp1e of nonrespondents; the nonresponse variance;
the variance due to the random assignment of interviewers to the'
initial sample; and the sampling variance of the population.
2.3.2.1 The Response Variance
e·
The variance over all possible trials is defined as
1
n
J
k
n
J
= E6{[(-" ". u.,o. V" lt + - I I (1-o")v,,w"J,v"J'2t)
n i~l j~l ' J " J
n i=l j=l
I. ° viJ'l + ~n ,'=l'J'=l
~ I (1-0 ',)
- (l ~
n i=l
L j= 1 uiJ' i
x
2
viw ij vij2)] }
This may be written as:
. n
J
°
Var6(x t ) = E6{[l"
" uij i (V ij1t- V)
n i~l j~l
ij1
k
+ -
n
I
J
I
(l-oi)viw iJ· (V ij2t - ViJ'2)]
n i=l j=l
1 n
J
[n.I,=1 J=l
.I
= E6
2
uijo i (V iJ' lt - ViJ'l)]
.
2
•
49
•
k
+ 2[=n
n n J
L'
L u 'J'01,(1-o")v,,w
.. ,
iA' j=l 1
1
1 1 J
E6[(Y, 'It- y, '1)
1 J . 1J
X
k n n J J
(Y "'2t- Y'I '2)] + 2[=' Y. y.' u, ,0, (1-0 1, I)V 1, 1 W1" ' 1
J
1 J
1 J
n ~;l' j;J' 1J 1
X
E6 [(Y ijlt - Yijl)(Yilj'2t- Yi l j , 2 )]·
Now
lnJ
21 nJ
E6 [- y. L u, ,0, (Y, 'It- Y, '1)]
= = L ,L u1'J,01'
n i';lj=l 1J 1
1J
1J
n i=l J=l
X
1 n n J
E(Y
Y )2+=
LU"U,I,o,o"E [(Y 1'J'lt- Y1'J'l)x
6 ijlt- ij1
n
fJ'h=l 1J 1 J 1 1 6
5.'
••
and
n
E6
[* i~l
J
j~l
(1-0 i )V i Wij (Y ij2t - Yij2)]2
=
k2 n J
2
k2 n n J
=
L
L
(l-o,)v, w" E (Y"2t- Y"2) + = L' L (1-0,)(1-0'1)
n i=lj=l
1 1 1J
1J
1J
n
1
1
irl~=l
6
V,V"
1
1
X
W"W'I'
Y"
Y"'2)]
1J 1 J E6[(Y"2t1J
1J 2 )(Y"'2t1 J
1 J
k2 n n J J
+= L
L (l-o,)(l-o")v,v,,
WiJ'WiIJ"
. n i;l' J;j 1
1
1
1 1
??
E6 [(Y iJ'2t- YiJ'2)
X
The expectation is taken over randomization steps 2 through 5
including all possible assignments of the subsample to interviewers;
•
all possible subsamples where E(v,v, I) is assumed to be approximately
.
1 1
equal to 1/k 2 ; all possible response solicitations; and all possible
50
interviewer assignments of the initial sample.
Then
_
. 1 n J
2
E2E3E4E5 Var6 (x t ) = .Y. ,1. Pi Et / s (Y, 'It- Y"l)
n2J l=lJ=l
lJ
lJ
1
+
n2J2
1
+
n2J2
J
n n
'2
1.
t;f'j=l
P,P"E t / [(Y"lt- Y"l)(Y"'lt- Y"'l)]
1 1
S
lJ
lJ
1 J
1 J
J J
n n
2
"
1. P ,p , lEt/ S [( Y"1
t- Y, '1)( Y, I ' ,1t- .Y, , ' , 1) ]
t;t, jJlj' 1 1
lJ
lJ
1 J
1 J
~.L ~L (l-P ) E / (Y '2t- Y '2) 2
i
t s
iJ
iJ
2
n J i=l j=l
+
_.k
+
-
1
n2J2
,
+
•
Et / [(Y, '2t- Y, '2)(Y 'I '2t- Y, I '2)]
s
lJ
lJ
1 J
1 J
.
~
(1-Pi)(1-Pi,)Et/s[(YiJ'2t- YiJ'2)(Y i 'J"2t- Yi ' J"2)]
,n 2J 2 1;1IJ;J '
.
-
,
n n J
1.f;i'j~'
2 1. (l-P,)(l-P,,)
1
1
2
nnJJ
l? ?
n n J
+
" 1. P,(l-P,,)
n2J2tA'j=111
+
-n2J2 1.'
"
i;t, 5;3
2
nn
JJ
'
Et / [(Y"lt- Y"l)(Y"'2t- Y"'2)]
s
lJ
lJ
lJ
lJ
.
P,(l-P,,) Et / [(Y, 'It- y, ,,)(Y, I "2t- y, I' '2)]'
1
1
S lJ
lJ
1 J
1 J
.
.'
And finally. we take the expectation over all possible samples.
such that
•
•
51
We then have eight response variance terms:
'.
'.
•
S2
Under the assumptions of the model, terms c, f, and hare equal to
So,
zero.
2,3.2.2 Variance Due to Random Assignment of 5ubsample Units
to Intervi ewers
The expectation is first taken over all possible trials such
,..
Following this with the variance over all possible interviewer assignments of subsampling units gives
••
':,. This results in,
_ . {k[n.L
n
Var s E6(x t ) = Es
1=1 J=l
k nL JL
n i=lj=l
- J
k2
J
.L
n
(l-oi)viw iJo YiJ' 2
(l-oo)v. Y. 02]
1 1 lJ
2}
1 2
J
= -::7 >. L (l-oo)v. Y~02 Es[(w .. - -J) ]
1 1 1J
1J
n i~l j=l
k2 n n
1
J
1
+ liT t~~'jIl (l-oi)(l-oi')viv i , Yij2 Yi 'j2 ES[(w ij - J)(wi'j- J)]
k2 n n
+
tA'
-::7 '\ '\
n
J J
>.
j~j'
'\
.
1
1
(l-o.)(l-oo,)v.v., Yi °2 Yo "'2 ES[(w .. - -J)(w 1·, 0'- -J)]'
1
.1
11
J 1J
1J
J
Now, with the assumption of unrestricted assignment of interviewers,
the last two terms are equal to zero.
Further,
•
53
•
So, it follows that
n
J
2
I I (l-o,)v.
Y"2'
i=l j=l
1 1 lJ
The expectation taken over the remaining randomization steps
finally yields:
E1E2E3E4 Var 5 E6(x t )
=
k(J-l)
nNJ 2
2.3.2.3 Variance Due to Sampling
E1E2E3 var 4E5E6(x t )
and VarlE2E3E4E5E6(Xt), the variance among those that are in the
The sampling variance consists of two terms:
'.
class of non respondents and the variance in the population, respectively.
Variance Among the Population of Nonrespondents:
The expecta-
tion taken over all possible trials and then all possible assignment
of subsampled units to interviewers is:
_
lnJ
k nJ
E5E6(x t ) = - I I u"O. Y"1 + '::"T I I (l-o.)v. Y002'
n i=lj=l lJ 1 lJ
n... i=lj=l
1 1 lJ
Now the variance over all possible selections of subsamples is
developed.
Var4 E5E6(x t ) = E4[E 5E6(x t ) - E4E5E6(x t )]2
lnJ
k nJ
= E4{ [nJ .I } (1-oi)v i YiJ' 2 - nJ .I ,I (l-oi)
1=lJ=1
1=lJ=1
•
k2
n
J
2
1 2
= nzJ2'.I ,I (l-oi)Y iJ'2 E4(v i - I)
1=lJ=1
54
1 2
Now, E4{v i - k)
= k1 -
•
1
1(2"
and E4[{v i - t)(v i '- t)J
1 n n J
-1 ~
(1-oo)(l-oo,)YO"2 Y,,02
n 2 - lri'j=l
1
1
lJ
1 J
I l:
n22
n 2 J2
_ k- 1 \"n
- =rr I.
n J
r 2 n2-r 2)
n2 r.
J
n
2
2
~ (1-oo)Y.0
i';l J
1
lJ 2
1 n n J J
n-:T ~ ~ ~ ~ (1-oi){1-oi')Y ij2 Yi 'j'2
2
r
1 l'
J
rJ '
J
n n
~ ~
.-
.
l: (1-0.)(l-oo,)Yo0
Y·'"2
lrl 'j=l
1
1
lJ 2 1 J
n2 -r 2
r 2 n 2 J2
n 2 -1
( 1-0 i ) (1~ 0i ' )Yi j 2Yi ' j , 2
n 2 -1
Now the expectation is taken over all possible response solicitations~
The expectation of a ratio estimate, which is used in the
two trial steps, is assumed approximately equal to the first term in
the method referred to as statistical differentials which makes use
of the Taylor approximation; that is,
E(r)
= E{f) ~
H*l
°
This is a large sample approximation only.
For small samples, the
ratio of the expectations can be far removed from the expectation of
•
•
55
the ratio.
Then, using this large sample approximation gives
k-l n n J (1-Pi)(1-Pi')Yij2Yi'j2
- nzJ'Z" I, ~ I
n
111' j=l
I (l-P.) - 1
i=l
1
n n
-~II
. n J iW
J J
I>j1j'
(l-Pi)(l-P i ,)Y mY i 'j2
n
I
i =1
(l-P.) - 1
1
Now taking the expectation over all possible interviewer assignments of initial sample, which will have no effect on this variance
-.
term, and over all possible samples, using again E(r) = ~~~l, then
_
k-l N J
2
E1E2E3 Var4E5E6(x t ) = nNJf .L .L (1-P i )Y iJ' 2 l=lJ=l
(k-l)~n-l)
nN(N- )J 2
NN J
~, I(1-P.)(1-P")Y·· 2Y·'·2
111';=1
1
n N
1
1J
-N •I 1(l-P.)
1
- 1
1J
_ (k-l)~n-l)
x
nN(N- )J 2
1=
(l-P.)( l-P . ,)Y.. 2Y. , . ,2
111'J1J'
1
1 1J 1 J
NNJ J
1. ~ ~ 1.
n N
N ill (l-P i ) - 1
.
Letting
1
N
N I (l-P.) = T=P and n(l-P) - 1
i=l
1
~
n(l-P) ,
yields
k- 1
2
'"
k- 1
~ ~
(
)
2
n ass - nNJ 2 J .L l-P i YiJ'2
l=lJ=l
•
(k-l)(n-l) 1 N J
- n2 N(N-l)J 2 - . I .>- (1-Pi)(1-Pi')Yij2 Yi 'j2
I-P l=lJ=l
. k- 1 n- 1
1 NNJ J
2
- 7 N(N-l)J l-P ~~~,j~I, (l-Pi)(l-P i ')Y ij2 Yi 'j '2 .
56
The Sampling Variance Attributable to the Population: The
E(xtls) = E2E3E4E5E6(Xt) can be shown to be
_
lnJ
lnJ
E(xtls)=nJ jjPiViJ"l+J! t(l-P.)Vo T
,=lJ=l
n i=lj=l
"J
The sampling variance is then:
VarsE(xtls) = Es[E(xt[s) - Es E( Xt ls)]2
Now:
•
•
57
n-l
N J
+ nN(N-l)J 2
No~
L I
i=lj=l
since the last two terms to the right of the equality sign can
be shown to be equal to
n-l
nN(N-1W
N J
1
Ul=lJ=l
0
(PloY iJOl
where the first of these two terms is zero, then the population variance is
Var s E(xsls) =
~~n
N J
(N-\)J 2
_
I I
In like manner,
-.
and
Letting
1
N J
NJT i~lj~l
1
•
PY]i
(P;Y;jl -
N J
I L
2
PJ i=lj=l [(1-P lo)Y loJ 2 - (l-P)Y 2]
0
2
(PoYoo l - PY 1) .
1 lJ
i=lj=l
2
58
Then,
1
2
1
2
2
Var s E(xtls) = - (a + as + 2a s ) = - a
n sl
2
12
n s
•
2.3.2.4 The Nonresponse Variance
The expectation of the sample mean is taken over all possible
trials, over all possible interviewer assignments of the subsample,
and over all possible selections of the subsample, giving
u
°
ij i
J
n
+ _1 \'. \'
ijl nJ i~1j~l (Hi)Y ij2'
Y
Now:
n
J
Var 3 E4E5E6(X t ) = Var [1 1: 1:
3 n i=lj=l
e·
1 n J
+ Var 3 nJ ,L
(1-oi)Y iJ'2
l=lJ=l
.L
1 n J
1 n J
+ 2 Cov 3 [(- L L u.. o'Y"l) (J 1: L (H i )Y iJ'2)]'
n i=lj=l lJ 1 lJ
n i=lj=l
Recalling that Cov(oi,oi') = 0 and noting that Cov(oi,l-oi) =E[oi(l-oi)]
-E(oi) E(l-oi)= -Pi(l-P i ), we can easily show that
n J
1
2
Var 3 E4E5E6(X t ) = ii7 ,L .L uij Pi(l-Pi)Y ijl
.
l=lJ=l
2 n J
-::'2"1"J L L u" P,(l-P')Y"l Y1'J'2'
n
i=lj=l
lJ
1
,
1
lJ
The expectation is taken over all possible assignments of units
to the interviewers and all possible samples such that
•
59
•
N
+
J
n~J2 i~lj~l Pi(1-Pi)Y~j2
2
N J
- -=t[12"J I I p.n- p ')Y··l Y" 2
n
i=lj=l 1
1 1J
1J
1, 2
2
2)
1 2
= n'°NR + °NR - 2 °NR
= n °NR
1
2·
12
This may also be written as
1
=-rr
nNJ
N J
.. - Y·· 2)
2> >. P.(l-P·)(Y
1
l ' 1J l
1J
i=lj=l
1
- nNJ'
••
1
==-rr
nNJ
N J
2
+
1
N J
NJ ~ I
n idlj=l
2
LIp .(l-P 1')Y"1J l
i=lj=l 1
N J
.I l' I 1 P.(1-P.)(B··
1
1 1J 1
1= J=
B. '2)
1J
2
J-l
n
+ -=t[12"J
N J
2
L
I
P1·(1-P1·)Y1·J·1·
i=lj=l
(2.3.2.4.1)
The first term is due to the difference in response bias. of sampling
units responding in the initial sample and in the subsample; the
second term contains the term due to the random assignment of sampling
'units to interviewers.
2.3.2.5 Variance Due to the Random Assignment of the' Initial
Sample Units to Interviewers
The expectation over all possible trials, all possible assignments
of subsample units to interviewers, all possible subsamples, and all
possible response patterns is given below:
•
P Y
uij i
ijl
.
n
J
+_1
I'
'.
nJ i~lj~l
60
Then,
•
var2E3E4ESE6(Xt}
1 n J
=
+
E2 [(Tl
1
fi7
1: 1:
i=lj=l
?nnJJ
1. ??
l~i'J~J'
p ,p "
1 1
1
y "1 Y. I • '1 E2 [( u ,,- J)( u . I
lJ
1
J
lJ
1
1
' 1-
J
J}]
Since the last two terms vanish and
1 2
J-l
E2 [(U ij - J) ] = ~ ,
then
.'
The expectation taken over all possible samples gives
N J
=~
1: 1:
nNJ i=lj=l
1
2
= Tl OrA 1
The combination of this contribution to variance and that of Section
2.3.2.4, that is, equation (2.3.2.4.1) yields
E1E2var3E4ESE6(Xt} + Elvar2E3E4EsE6(Xt}
1 N J
2 J-l
N J
= nNJ 2 j j Pi(l-Pi}(B iJ'l- BiJ'2) + nNJ 2 1: 1:
l=lJ=l
i=lj=l
•
61
•
2.3.2.6 Summary of Variance Terms
The six terms contributing to the varianceoof the sample mean
at trial t are summarized below:
Response Variance Terms:
1 N J
2
= ~n L L P1, Et (V 1 J'lt - V1'J'l)
'liN i=lj=l
°
=
N N Jo
n-1
2
~ j=l
L
nN{N-l)J ~1;11'
••
2
0
=
e , 2(n-1)>1
2 12
n
Sampling Variance Terms:
k-l 2
f. -n- °ss
°
•
k-l 2 ~ ~ (1 P ) V2
nNJ i~lj~l
- i
ij2
k-1
- ---nz-
•
n-1
1 NNJ J
N{N-l)J 2 1 P
(1-P i )(1-P i ,)V;j2 V;'j'2
-
?JJ,l
1r 1 JrJ
N-n
1
~ ~ (
-PV 2
= Nn{N-l}J2 i=lj=l
L I.
P,V"l1)
1 1J
62
h.
i.
I
N-n
I
= -N- n(N-l)J 2
2
n Os
2
2
2
n °S12
N J
I I
i=1j=l
2
[(I-P.)Y, '2 - (I-P)Y 2J
1 lJ
•
N J
= N-n
2
l: l:
~ n(N-l)J 2 i=lj=l
J}
- (1-P)Y 2
Nonresponse Variance Terms:
,
1
2
I
2
J.
n °NR 1
k.
n 0NR 2
2
1
NJ
2
= NJ 2: l. P,(l-P')Y" I
n i=lj=1 1
1 lJ
1
N J
2
N N
2
= nNJ'7' l: l: P.(1-P')Y" 2
n
i=lj=l 1
1 lJ
••
1. - 0N R = ~J 2: l. P1·(1-P1·)Y1'J'1 Y1'J'2
n .12 n
i=lj=l
2
which may also be written as
1
N J
2
J-l
N J
"=NJT
l: l. P,(l-P,}(B"lB"2) +"=NJT l. I. Pi(l-Pi)Y~J'l'
n
i=lj=l 1
1 lJ . lJ
n
i=lj=l
Variance Due to Random Assignments of Units - Subsample and Initial to I ntervi ewers
1
N J
2
m. nOlA
1
=~
I. I.
nNJ i=lj=l
k 2
~.
J-l
N J
n. - alA =
J
l. 2:
n
2
n
i~lj=l
2.4 Optimization of Initial Sample Size and Subsampling Rate
In order to make as efficient use as possible of resources in
this double sampling model. it is necessary to determine the optimum
•
•
63
values of nand k for expected cost.
In keeping with the model,
·these optimum values will be determined in the context of total error
and response probabilities.
In keeping with the Hansen and Hurwitz cost model, let
where
c = total cost associated with the survey,
Co = overhead cost,
cl = cost associated with soliciting response and processing data
from the respondents in the initial survey,
-.
c2 = cost associated· with soliciting response from the nonrespondents,
c3
= cost associated with soliciting response and processing re-
nl
n
2
r2
=
sults from the subsample of nonrespondents,
number of respondents in the initial survey,
= number of nonrespondents in the initial survey,
=
number of units in the. subsamp1e of nonrespondents.
Now, conditional on a particular sample, the four costs in the model co' c 1' c2 ' and c3 - are constants whereas the three sample sizes n1 , n2 , and r 2 - are random variables. Thus, total cost C is a random
variabl e.
The cost function may be rewritten in the following form:
C=
Co
n
+ c1
L o.
i=l
n
n
+
C
1
L (l-o.)
2 i=l
1
+ c3
I
i= 1
(1 -0 . ) /k
1
where
•
n
=
total sample size
if unit i responds in the initial sample
otherwi se
64
and, .k = the inverse of the subsampl ing rate, considered fixed.
and the costs are as defined above.
given in the sample is random where
n
.I
1=1
(1-8 i )/k
•
The indicator variable, 8.,
n
n
1
I 8. = nl , I (1-8.) = n2, and
i=l 1
i=l
1
= r 2·
Interest is in minimizing the mean square error for expected
n
n
n
P. + C2 t (l-P.) + c 3 t (l-P.)/k
i'; 1
1
i=1
1
i=l 1
I
and
Es E8Is (C) =
Co
n N
n N
n N
+ c l N .I Pi + c2 N .I (l-P i ) + c3 N .I (l-Pi)/k
1=1
1=1
1=1
This may be more succinctly written, with l-P =
6,
as
e-
To obtain the minimum mean square error of xt for expected cost,
the method of Lagrangian multipliers is used. In the notation of
Hansen, Hurwitz, and Madow (1953), the Lagrangian F is defined as
where FO is the function to be minimized and Fl is the.relationship
determined by the constraint, in this case the expected cost.
Now since the bias does not involve n or k, then minimization
of the mean square error involves minimization of the variance only.
If it is assumed that n-l
~
n and that the finite correction factor
is negligible, then the variance of xt from Section 2.3.2 above may
be wri tten as
-
1
2
2
k
2
2
var(x t ) = FO = - OR + P1 0 R + - oR + P2 0 R
n 1
1
n 2
2
•
•
65
2
+ P12 a R +
12
k-l
nk a IA2 + --n-ass
2
2
Now we let
2
2
A2 = 0R
.. 2· + 0 ss
2
+
°I A2 '
and
then the Lagrangian F may be written as
••
Al
k
F = n + n A2 + A3 + A{C-C O - n[clP + c 2Q + (C 3Q)/k]} ,
where
Now using the usual method of optimization of F it can easily be
shown (Appendix A) that the optimum k is given by
k =
QC
2 -
+ -Q
C
3
1
J
Substituting this value into the cost function, the optimum n is
obtained,
In Chapter IV, optimum values of nand k are looked at by
•
assuming certain superpopulation values for mean square error
parameter values.
•
CHAPTER III
THE DEVELOPMENT OF ESTIMATORS
OF THE MEAN SQUARE ERROR COMPONENTS
3.1
Introduction
The estimation of mean square error components developed in
Chapter II can be a valuable asset not just in assessing the quality
of the data for a particular survey, but in decisions on the design
and data collection methods.
In any survey the decision on the
method of data collection is major, with cost and reliability'of
the data two of the most important and sometimes conflicting concerns.
."
Data may be collected by either one or some combination of
three methods - personal visit, mail and telephone; neither may be
regarded as optimum for every survey even if cost were ignored.
Whereas personal visit may elicit the most complete response and
thus keep to a minimum the contribution of nonresponse bias in a
particular survey, in that same survey the use of say, mail, in
which self-enumeration is used, may keep to a minimum the correlated
component of response variance to which interviewers are often the
greatest contributors.
An enumerator variance study in the 1960
Census revealed the large effect that census enumerators had on the
correlated response component; these results were influential in
the decision to greatly increase the use of mail questionnaires in
•
•
67
the 1970 Census.
In this chapter, estimators of the mean square error components
of Chapter II are developed using combinations of sums of squares
from survey data.
3.2 Additional Design Assumptions and
of Estimation
~ethods
\
To develop estimators for the mean square error components, repetition of the survey is incorporated with the ·design of Chapter II.
It is assumed that the original surveyor Trial 1 is conducted independently of the repetition or Trial 2.
-
•.
Further, it is assumed that
sampling units are assigned randomly to interviewers in Trial 2 in
the same manner, but independently of the assignment of Trial 1.
However, it is assumed that the response-nonresponse patterns are
the same at Trial 1 and Trial 2, that is, that a sampling unit which
responds at Trial 1 will also respond at Trial 2; inversely, a samp1ing unit which does not respond in Trial 1 in the initial sample
will not respond in Trial 2.
Even assuming 0it
= 0i'
where 0i is defined as in Chapter II, the
use of repetition for the subsamp1e of nonrespondents in the estimation of simple response variance can pose problems.
For those
respondent units in the subsamp1e that were initially refusal units
rather than not-at-homes or temporarily absent, the use of repetition
embodies three personal contacts - 1) the original visit in which
response is solicited but the unit refuses, 2) the subsample visit in
•
which a response is obtained, and 3) the· repetition visit in the subsample.
Under the assumption
°it °i '
=
it is not necessary to visit
68
the unit in the repetition, in the initial survey.
Not only is
there a danger that the respondent will refuse to respond on the
•
third visit, but there is a danger that the quality of the response
will be affected by a negative attitude of the respondent, jeopardizing the integrity of the estimates.
It may be preferable to
assume that the simple response variance for the subsample of nonrespondents is the same as that for the initial sample respondents
and extrapolate accordingly.
However, this research assumes that
the simple response variance will be estimated directly for both
the initial sample and sUbsample of nonrespondents.
It is assumed that for samples which cover a large geographic
area, for example, regional or national surveys, that the randomization of interviewer assignments will be carried out in a number of
smaller areas because of cost and travel time.
.-
For example, if the
Northeast region were the sample area, then the sample could be
':, .
interpenetrated within each state.
Thus, estimators developed here
may be assumed to be for an entire surveyor for some subset of the
sample.
Though the emphasis is on the estimation of variance components,
bias is also considered.
The estimation of bias requires knowledge
of the "true values" for each sampl ing unit.
In practice
it
is
assumed that the values obtained from records or through special
interviewing techniques are the "true values" when realistically
they may only be regarded as "better" on the average than those obtained through regular interviewing practices.
To obtain the bias
for this particular study, it is assumed that records can be obtained
•
•
69
for each sampling unit, whether respondent or nonrespondent.
In the development of estimators of variance, no assumption is
made that "true values" are obtained.
Thus, in studies which are
not concerned with the estimation of bias, the estimators will remain
valid.
The estimators of variance involve variations on six basic
sums of squares used by Koch (1973) and again by Lessler (1974):
a)
The between-trial-within-interviewer (BTWI) sum of squares
which is the sum of squared differences between the response
from unit i at Trial 1 and at Trial 2 obtained by the same
. interviewer.
'.
b)
The between-trial-within-interviewer-interaction sum of
squares (BTWII) which is the sum of squared differences of
the following two differences:
i) the difference in response
from unit i at Trials 1 and 2 obtained from interviewer j,
and ii) the difference in response from unit i'1'i at Trials
1 and 2 obtained from interviewer j.
c)
The between-trial-between-interviewer sum of squares (BTBI)
which differs from (a) above in that a different interviewer
is assumed at Trial 2.
d)
The between-trial-between-interviewer-interaction sum of
squares (BTBII) which differs from (c) in that a different
interviewer is assumed to have obtained the response at
•
Trial 2.
e)
The within-trial-within-interviewer sum of squares (WTWI) .
70
which is the sum of squared differences within trialbetween response from unit i and i'li obtained by the same
•
interviewer.
f)
The within-trial-between-interviewer sum of squares (WTBI)
which is the sum of squared mean differences between responses obtained from interviewer j and interviewer
The basic method used to develop the estimators is:
j'~j.
a) find the
expectation of the basic sums of squares, and b) find some combination of these sums of squares which provide unbiased estimates of
variance components developed in the previous section.
In keeping with the assumptions and design of Chapter II, some
additional relationships are defined:
Let
.'
Yijlmt = Xi +B ijl + £i j lmt , m = 1 , 2
, m= 1, 2
+ B"2
Y"2t=X,
lJ m
1
lJ + tij2mt
where m represents the repetition of the survey,
Et(Yijlmt) = Xi + B.1 J'1
Et(Y ij2mt) = Xi + Bij2
m = 1, 2
I
and
m=l,2,
that is, the expectation of the response does not· depend on the repetition of the survey.
Thus, it should also be noted that Et(Yijlmt)
may be written as Et(Yijlt)' when convenient.
COV(£ijlmt' £ijlm't) = Cov(£ijlmt' £i 'jlm't'
= Cov(£ijlmt' £i 'j 'lm't) = 0
following the assumption of independence of the two repetitions or trials.
•
•
71
All of the above relationships hold for the subsample of nonrespondents.
let
u ... = 1
lJJ
if the i th unit in the initial sample is interviewed by interviewer j in Trials 1 and 2
= 0 otherwise
w... = 1
lJJ
if the i th unit in the subsample is interviewed
by interviewer j in Trials 1 and 2
= 0 otherwise,
and let Uijj ' and Wijj ' have similar interpretations but with the i th
unit interviewed by interviewer j at Trial 2; then assuming an inde-
••
pendent assignment of sampling units to interviewers at each trial,
E(Uijjls)
=-fr
1
E( Uijj ' I)
s -- JT
, J'''J''
r
1
E(u ijj ui'jjls) =J"
E(u ... , u·,··,ls) =J"l
lJJ
Also
1 JJ
.
1
E(wijjIVi=l, °i=O,s) = JT
.
1
E(W ijj , IV(l, 0i=O,s) = JT
1
E(w lJJ
... w·,·
·Iv.=l, ° 1.=O,s) = J"
1 JJ 1
1
E(w ijj ,wi 'jj' Ivi=l, °i=O,s) = J"
For simplicity it is assumed that the area in which the sampling
units are interpenetrated is large enough that J
•
3.3 The Estimation of Components
~
J-l .
72
3.3.1 The 'Estimation of Bias
e
It is assumed that records are obtained for a subsample of the
original sample.
We let
if a record was obtained for the i th sample unit
= 0 otherwise
and
c = the inverse of the subsampling rate, assumed fixed.
We have previously defined xt and Xi as the sample mean and true value,
respectively. Then
n
xt - £n i=l
l:
A
e·
Now it can easily be shown that E[Bias(x t )]
A
_
c
_
n
E[Bias(x t )] = E(x t ) - En ,l:
,=1
where
n
n
,,
E(£ l: '"X,') = Es E I £ l: ,.X.
n i=l
ss s n i=l
1 N
l:
= -
N i=l
It has been shown in Section 2.3.1 that
_
E(x t )
1 N
= ~N
~ X, +
i=l '
1
NJ
N J
l:
~ P.B "1 +_1
N J
l: l:
NJ i=lj=l
i=lj=l ' 'J
1
+ NJ
N J
l: ~ (l-P,) B.. 2
"J
i=lj=l
which is the bias of the estimate x . It should be noted, however,
t
that the estimate of bias is biased to the extent that the Xi's are
not "true val ues".
e
•
73
3.3.2 The Estimation of Variance
3.3.2.1 The Estimation of Components Involving the Initial Sample
The" sums of squares and their expected values are given below
for initial sample estimates.
The expectations of sample estimates
follow the steps outlined in Sections 2.2 and 2.3.1 of Chapter II,
that is ,for example
where the Ei's are defined in Section 2.3.1.
a)
••
The Between-Tria1-Within-Interviewer Sum of Squares:
BTWI
n
J
n
J
= I Iu",c, (Y"11t-Y"12t)
1 i=lj=l
lJJ 1
lJ
lJ
=
I I
n
+
n
,c,
i=lj=l lJJ 1
2
u"
J
2
J
I I
u ..
,c,
i=lj=l lJJ
1
2
I I u, ..c.Y lJ.. 12t ·
i=lj~l lJJ 1
Taking the expectation over all possible trials and combining
like terms yields:
n
Et(BTWI 1) = 2
I
J
2
n
J
I u.. ,c, Et (y 1'J'lt) - 2 I I u.. ,c. Et(Y"11tY"12t).
i~lj=l lJJ 1
lJ
lJ
lJJ 1
i~lj=l
Continuing then
2n N J
E(BTWI 1) = NJT I I P.
i=lj=l 1
Now this may be rewritten as:
•
2n
N J
2
N J
2
= rrrd I I P, Et(Y .. lt ) - I I P'Y"l]
"v- i=lj=l 1
lJ
i=lj=l 1 lJ
2n
- NJT
N J
[I
I P,
i=lj=l 1
N J
2
Et (Y , 'lltY" '12t) - I ~ p, Y, '1] .
lJ
lJ
i=lj=l 1 lJ
74
The last term is assumed to be = 0, because of the assumption of
independence of tri a1s.
To the extent that th i sis not true,
•
the expectation of this term, that is
N
J
- i=lj=l
I I
2
P.V··
. , , J lt ]
is biased, generally understated.
b)
The Between-Trial-Within-Interviewer-Interaction Sum of Squares:
n n
J
.
BTWII l = \' I }: u... u., .. [o,(V"'lt- V·· 12t ) - 0,,(v"'11t- V"'12t)]2
lfi 'j=l 'JJ , JJ
"J
'J
"
J
, J
Expanding this term, taking the expectation over all possible trials
and combining like terms, gives
Et(BTWII,) = 4
n n
~
I
J
••
2
I
u... u., .. o.Et(V .. lt )
if' 'j=l 'JJ , JJ'
'J
nn J
- 4
I;fi'j=l
I I u...
u., .. o.Et(V"'lt
'JJ , JJ'
'J
n n
- 4 ~ I.
'f"
n n
V.. 12t )
'J
J
1. uiJ·J·u i 'J'J' 0io;, Et(V iJ·lt Vi 'J·lt)
j=l
J
I u... u., .. 0;0;, Et(ViJ'l1t Vi 'J·12t).
'f,'j=l 'JJ , JJ
+ 4 ~ ~
The final expectation gives
•
75
•
This may be written as
N
J
2
I I P'Y"l]
i=l j=l 1 lJ
_ 4n (n- 1)
NJ
'.
4
N J
Y"12t)
[Ii=lj=lIP,E1 t (Y"llt
lJ
lJ
N J
2
- i=lj';l
I 1. P,Y
.. ]
1 lJ l
The second and fourth terms to the right of the equality
sign are assumed = O. Whether the resulting expectation of BTWII l
might be considered an overestimate or underestimate depends on
the relative sizes of these components under more realistic assumptions,
Then this expectation is
_ 4n(n-l)[
~ ~
(2
)
NJ 4
. l . l Et YiJ'lt
l=lJ=l
-
NN J
4n(n-1) 4 [t t
- N(N-l)J
N J
2
P,Y .. 1]
i=lj';l 1 lJ
I 1.
t. P
fl'h~l
E (Y
Y
)
;Pi' t ijlt i 'jlt
NN J
- I 1. I PiP i , YiJ' l Yi 'J'l]'
iti 'j=l
•
c)
The Between-Trial-Between-Interviewer Sum of Squares:
n J J
2
BTBI l = I I1. u... , o,(y .. 1lt - Y"'l2t)
i=l JtJ' lJJ
1
lJ
lJ
76
When BTBI l is expanded and the expectation taken over all possible trials. then
n
J J
J J
n
2
e
Et(BTBI 1) = 2 L L Y. U ••• ,0 1, Et(Y. 'llt) - 2 L L L u... ,o. x
i=l Hi' 1JJ
1J
i=l Hj' 1JJ 1
Expectation taken over the remaining steps gives:
NJ J
- i=lj1j'
L I L P.1.lJ
Y. '1 Y.. ,1]
1J
2n N J 2 2
LIP. Y"
i=lj=l 1 1J l
+ NJ
N J
2n
2
N J
2
2
2
[L
Y. P. Y. '1 - L Y. P.y·· ]
i=lj=l 1 1J i=lj';l 1 1J l
+ NJ
N J J
-~
I '\ Y.
,,~C i=l 31j'
p.y. 'lY"'l'
1 1J
1J
e-
Under the assumption of independence of trials. then
N J
- L L
2
2
p. Y·· ]
i=lj=l 1 1J l
N J
+.f!!. " " 2 ijl
2
NJ i~lj~l Pi Y
d)
2n
N
-ff'Z"J
i =I 1
J J
jy.~
1J ' P'Y"lY"'l"
1 1J 1J
The Between-Trial-Between-Interviewer-Interaction Sum of Squares:
J J
BTBII l =
~ I u... u., .. , [o.(Y·· llt - Y"'12t)
1~1' J1J' 1JJ 1 JJ
1 1J
. 1J
n n
~ ~
2
- °i'(Yi'jllt- Yi 'j'12t)]
Taking the expectation over all possible trials and combining
like terms gives:
•
•
77
n n J J
Et(BTBll ) = 4 I I
u ... u., .. ,
l
il'i' JI'J' 'JJ , JJ
II
n n
- 4
.
JJ
I1;1'
I IJI'J'
I u...
u., .. ,
'JJ 1 JJ
6.E t (Y·· 11t Y··'12t)
1
lJ
lJ
n n J J
- 4'\ I '\ I u ... u., .. , 6.6.,E (Y·· lt Y·'·lt)
.
ll'i' 31'j' lJJ 1 JJ
1 1 t
lJ
, J
.
The final expectation is then
••
NNJ J
+ N(N-l)J
4n(n-1)4 .... ,1,... , P1·P 1·, Y'·J·1 Yl·'J·'l •
1 r1 JrJ
I I '\ I.
which may be written as
N J
2
4 ( 1) ['\N J'\ P E (Y 2
) - '\f. '\[. P . Y "1 ]
E(BTBll ) = n 33
[.
[.
.
t
··It
1
N
i=lj=l 1
lJ
i~lj=l 1 lJ
N J J
NJ J
4n~3:1) [I
I I P.E (Y·· llt'Y··'12t)
-I.f IJl'j'
I. P'Y"
Y"'l]
i=l JI'J' 1 t l J
J
1 lJ l lJ
•
4n~n-1)
+J3
~ ¥
2
~ ¥
2 2
L P 'Y"
- L LP.Y··
i=lj=l 1 lJ l i=lj=l 1 lJ 1
[L
]
4n~n-1) ~L ¥
2 2
[.P'Y··
+J3
i=lj=l
1
'J l
78
4n(n-1) ~ ~ ~
4n n-1)
NJ 4
LLLP'Y"l Y"'1-NN-1J
i=ljfj' 1 lJ lJ
4n(n-1)
NN J J
??
+ N(N l)J4?}'
.
l~i' J~J'
-
NNJ
': 1.1. P,P"Y"l Y"'l
f~l'j 11 lJ 1 J
•
PiP i , YiJ'l Yi'j'l'
The second term is assumed zero,
e)
The Within-Tria1-Within-Interviewer Sum of Squares:·
N N J
WTWl 1 = ?? ~ u "u ' " (0, Y''It lh'j=l lJ 1 J
llJ
The expectation over trials is
J
2
~ u"u", 8.E t (Y"lt)
l~i'j=l lJ 1 J
1
lJ
n n
= 21. ~
J
n n
- 2
I.l~i~ 'j=l
~
u, ,u",
lJ 1 J
°,°
1 1"
Et (Y 1'J'lt Y1" 'J"lt)
.0
The expectation over additional stages of randomization is
E(wrwI ) = 2n( n; 1)
1
NJ
N J
2
~
P,Y .. 1J
i=l j;l 1 lJ
~
I.
i;lj=l
-
~
+
2n( n-1 )
NJ 2
+
2n( n-l) N J 2 2
~n
n- 1
NN J
NJ 2 ,1:.l.PiYiJ'l-NN-1J2~?,1: PiPi'YiJ'lYi'J'l'
l=lJ=l
1~1 'J=l
NnN~.
1
N J
[I.
J2
NNJ
[~1.
I PiPi,Et(YiJ'ltYi'J'lt)
l~i~=l
N J
[I
I
i=lj=l
NNJ
-
~ ~
I
P,P"Y"lY"'lJ
lJ 1 J
1~1 'j=l 1 1
2
N J
2 2
P, Y"1 P, Y"1 J
1 lJ
i=lj=l 1 lJ
I I.
•
•
79
f)
The Within-Tria1-Between-Interviewer Sum of Squares:
J J
_
_
= ~J,(Yj1t - Yj'lt)
JrJ
2
J J
1 n
1 n
2
= \: \: [(- I u.. O. Y"lt - - I u",o'Y"'lt) ].
515' n i=l ' J " J
n i';l'J "J
The expectation over all possible trials yields
'.
The expectation taken over the remaining stages of randomization
yields
E(WTBl 1)
This may be written as
NN J
•
Y"'l]
- \:~11'j=1
I. 1 P,P"Y"l
"
'J , J
2
+::IT [
n"
N J
2
1 1 P'Y"l
i=lj=l ' 'J
N J
2
IIp·
i=lj=l'
2
Y"l]
'J
· 80
1
2f n- 1
NN J
2 N J 2 2
+-N ~ ~PoY"l+ NN1JI? ~P,P"Y"lYo,ol
n i=lj=11'J
n
,jI,'j=l 1 1
'J , J
e
,
Now the expectations of six sums of squares may be rewritten
in the notation of Section 2.3.2.6 of Chapter II.
a)
2n 2 (1
2)
(
)
EBlVIl
1 =-J- nOR
1
2
b}
c}
E(BTBl } = 2n 2
1
2
(1 0 2
E(BTWII } = 4n (n-l}
1
JJ
R
n
(1n 0 R2 ).
)
_
~ (n-l Pl OR )
J
1
+ 2n 2
1
(1n
0
n
2
1
)
NR 1
2n 2 J 1 2
2n
N. J J
+ -J-l (- °IA ) - N'7J ~ I \" P'Y" l Y"'l
n
1
i=l jjlJ'l lJ
lJ
e·
This may also be written as
E(BTBl ) = 2n 2
1
:
..
(1 o~
n
1
) + 2n 2
(1 o~R
n
1
)
NNJ
2n 2 (N-l}J2 (1 2
2n
+
NJ-l·
n °Sl + NJ(NJ- 1) iFi'j 1
2n
NN J J
+ NJ(NJ-l)" ~ I; I; I; P;Pi' YiJ' l Yi 'Jo'l
lFl' JFJ'
'
IIIp·p·,1
2n N J J
- N'7J
P'Y"
Y"'l'
i=l jF3' 1 'J l . lJ
I I\"
d}
E(BTBIl } =
1
2
4n (n-l}
J2
+
(102 ) _
n
R
1
2
4n (n-l) (1 2 )
J2
n °NR
1
4n
J
2
(n-l P
n
1
+ 4n 2(n-l)
J(J-l)
_4n~~;l}I
~~P'Y"lY"'l
;=1 JFJ' " J
lJ
nnl
NN J
~nnl NNJJ
- N N - J3
I P.P.,Y·· 1Y.,ol + N N - J4 ~
I; P,P"Yo' Y""l
;jI;'j=l 1 1 lJ 1 J
IF,'JFj' 1 1 lJ l 1 J
~
II
I
I.
e
81
•
••
e}
E(WTWl 1 ) =
2n 2 (n 1)
J -
1
2n 2 (n-1}
+
J
2n(n- 1)
-
2
(n1 °R2 ) - 2n 2 (n-1
--n-- P1 °R
(l
2
+ 2n 2 (n-1}
J-1
)
n °NR
1
NN J
~2 tr~'j~lPiPi'
)
1
(-l02
n
IA
)
1
Y
Yij1 i'j1
or equivalently:
2
E(WTWI ) = 2n }n-ll
1
(~O~
)
_
2n 2
1
(n~l P1 0 R )
.
1
2
1 o~R )
+ 2n (n-1) (-n
J
1
2
•
NN
NN J
+ 2nfn-1)
I. I I. P.P., Y1' ' l Yi 'J'1
n °5 1
NJ z NJ-l} f;l{ 'j~l 1 1
J
(1
2n (n-1)(N-l)J
+
NJ-1
J
2
)
_2n(n-1) \ \ ~P,P"Y"lY"'l
N(N-1 }J" rrh=l 1 1
lJ 1 J
82
2nf n-1) N N J J
'
+ NJ 2 NJ-l) IiFl'jfJ'
~ I: ~ p, p" Y"l Y, , ' '1'
1 1 lJ 1 J
•
or equivalently:
1
E(WTWI 1) = 2J(* (JR ) + 2J(n~1 P1 (JR ) + 2J(n (JNR )
1
1
1
N
N
J
2(N-l~J3 1 2
2
Y"'l
IP,P"Y"l
(n (Js1) + nN(NJ-1) 'I:
+ NJ~Ff'j=111
lJ,lJ
2
II N N
NN-1J'1:
2f n
+n
_
J .
Y
IP,P"Y"l
~#i'j=l 1 1
lJ 1"'l+
J
-
2 (n- 1) 2
nN(N-l}J
NNJ
J
\
\
tA,jA, P,1P"1
\
\
Y
2
2
NN J
Y
IP,P"Y"l
nN(NJ1}1:'
iA'j=l
1 1 lJ 1"'l
J
.'
Y
ij1 i'j'l .
Based on the expectations of the six sums of squares, estimates
of variance components are derived.
Greater detail on the combina-
tion of sums of squares used to derive the estimators is given in
Appendix B,
Definitions of variance components given in Section
2.3.2.6 of Chapter II are repeated here for convenience.
However,
throughout it is assumed that the finite population correction factor
1-f; 1, that n ; n-1, NJ-1 ; NJ, and that J-1 ; J.
a)
1 2
1 N J
2
- (JR = =rT I I: p, Et (Y'"1 t - Y"l)
n 1 n"" i=lj=l 1
lJ
lJ
Estimator:
2~2 (BTWI 1)
Then
•
•
83
n-1
NN J
= N(N 1) J 2 \ \ L P, P" Et [ (Y 001 t-Y001 )( Y, , '1 t-Y, , '1) ]
fA, j=l 1 1
lJ
lJ
1 J
1 J
n -
b)
,
Estlmator:
1 n- 1
J2 .
2n [--n-- (BTWl 1) - 2n (BTWll 1)]'
2 1 NJ
2
= ::'NJZ' L L (p, Y, '1- J5'i]l +::'NJZ' L I. P, ( 1-P . ) Y001
n i=lj=l 1 lJ
n
i=lj=l 1
1
lJ
1
c)
NJ
A
i
ii
'.
= WTBl 1 -
J2
.
(n-1
2)
(BTWl 1) - 2J --n-- P1 cr R
1
n<
+
2~~
-
J2
J2
n 2 (n-1) [WTWl 1 - '"2 (BTWII 1)]
{BTBIl
1
- J(BTWIl 1)
_2(~;l)
[BTBl 1 - J(BTWl1lJ}
J (BTBIl ) + --J-n-1 [ BTBl - J(BTWl )]
= WTWl 1 - 2'
1
1
1
_
n 2 (n-1) (')
Estimator:
J
3
1·
2n 2 (ln_1) (ii)
1
N J
2
J-1 N J 2 2
= NJ I. L P .(l-P')Y'·l + ~J L I. P, y, '1
n
i';lj=l 1
1
lJ
n
i=lj=l 1 lJ
d)
A
Estimator:
1 {
J2
. n-1
2
2J WTBl 1 - n< (BTWl 1 )] - 2J -n- P1 crR +
1
+
2~~
{BTBII 1 - J(Bnm 1) -
2(~;1)
[BTBl 1 - J(BTWl1lJ}}
Thus. an estimator of the sampling variance of the initial sampling
unit contains a portion of the nonresponse variance of the initial
sampling units .
•
84
e
3.3.2.2 The Estimation of Components Involving the Subsample of
Non res pondents
The sums of squares and their expected values are given below
for variance components concerned' with the subsample. As in Section
3.3.2.1 above, the expectation of the sums of squares follows the
survey steps outlined in Sections 2.2 and 2.3.1 of Chapter II.
Throughout, the assumption E(viv i ,) = 1/k 2 is used.
a)
The Between-Trial-Within-Interviewer Sum of Squares:
n J
BTWI 2 =
.
I: I: (1-6.)v.
w•.. (Y. '2lt
i=lj=l
"'JJ'J
- y. '22t)2
.'J
It can be shown, following the procedures of Section 3.3.2.1
above that the expectation of this sum of squares is
e'
2n N J
E(BTWI 2) = l<l{j7 01: j (l-P i ) Et(Y~J'2t)
F1J=1
2n N J
-l<NJT .1: .1: (l-P i ) Et (YiJ'2lt YiJ022t)
,=lJ=l
which may be written as
N
J
2
- I: I:
(l-Po)Y 0"2]
i=lj=l
"J
. 2n
N J
- KNJT [j 01: (l-P i ) Et(YiJ'2lt YiJ'22t)
,=lJ=l
N J
2
-.I.,=lJ=l
01: (1-P i )Y iJ02]
where the last term vanishes under the assumptions of the model.
b)
Between-Trial-Within-Interviewer Interaction:
e
85
•
Then the expectation of this term is given by
NN
J
)' (l-P.)(l-P. ,)
il'i'j;;l
'
,
~ ~
+ 4n(n-l)
k'N'J 4
NN J
~ ~ (l-P.)(l-P,")
il'i'j=l
'
I.
This may be written as
2
N J
2
Et(Y. '2t) - ~ ~ (1-P')Y"2]
'J
i=lj=l
"J
•
NN J
- ~) ~ (l-P,.)(l-P,' ,) Y"J'2 V"~ 'J'2]'
il'i' j=l
'
under the assumption of independence of trials.
c)
The Between-Trial-Between-Interviewer Sum of Squares:
N J J
2
BTBI 2 = ~ ~ I. (l-.s.)\!. w""(Y"21t- Y"'22t)
i=l jl'j'
"'JJ'J
'J
The expectation is:
2n
E(BTBI 2) = kNJ
•
2
N J
2
(l-P i ) Et(YiJ'2t) - .~ .l. (l-Pi)YiJ'2t]
,=lJ=l
,=lJ=l
N J
2
N J
2 2
[~ ~ (1-P')Y"2 - ~ ~ (l-P.) Y"2]
~"u i=lj=l
' lJ
i=lj=l
1
lJ
2n
N J
2 2
'2n
N J J
+ kNJ .l. ): (l-P i ) Yn - l<NJT): ~ ~ (l-Pi)Y n Yi' '2
'
,=lJ=l
J
,=1 JI'J'
J
J
+
2n
N J
[.1. .l.
r.trT
86
d)
e
The Between-Tria1-Between-Interviewer-Interaction Sum of Squares:
NNJ J
BTBII 2 = '\ '\ I '\ v,v"
tA'j=l' 1 1
w.. ,w" .. , [(1-o')(Y 1"21t- Y"'22t)
lJJ 1 JJ
1
J
lJ
- (1-o i ')(Y i 'j21t- Yi 'j'22t)]
2
The expectation is:
N J
I I
i=lj=l
(l-P i )
e·
4n(n 1) N N J J
+ k 2N 2j 4
1; 1; y, 1;
lrl'JrJ'
(l-P i)( 1-P , ,) Yi J' 2 Yi 'J' '2'
1
This expectation may also be written as:
N J
2
I I (1-P')Y"2]
i';lj=l
1
lJ
,
4n ~ n-1 )
NN J
- 2: 2: 2: [(1-P,)(l-P")Y"2 Y" ,1+ k NJ' [
iri 'j=l
N J
- i=lj=l
I 2: (l-P,. 1
1
)2
1
2
Y"2] +
~J
lJ
1
JC'
N J
._
I 2: (1
i=lj=l
2
Pi )Y iJ'2
4n~n-1) N J
2 2
k NJ'
2: I (l-P i ) Yij2
.
i=lj=l.
.
e
•
87
e)
Within-Trial-Within-Interviewer Sum of Squares:
NN J
2
WTWI 2 = ~ ~ I. VoV., w..w.,. [(l-o.)V 0'2t - (1-0 0 ,)V 0' °2t]
i~i 'j';l "
'J 1 J
l'J
,
1 J
The expectation is given below.
It is rewritten below:
-
••
N J
.
2
~ )" (l-P.) V002t]
i=lj;l
"J
NN J
~ ~ (l-P 0)( l-P,o , )
i~i 'j=l
'
CL
NN J
- '\t~l'j=l
I. ~ (l-P.)(l-Po
,)
'
,
V0'2 V0' 02]
lJ
, J
N J
2 2
}" ~ (l-P.) V00 ]
i~lj=l
'
'J 2
+ 2n~n-1)
k NJ z
2n~n-1\ ~ ~
+ kNT L.L(l-P.)
o
l=lJ=l
'
2
2
Vi · 2
J
V.0 V.,02"
lJ 2 1 J
f)
The Within-Trial-8etween-Interviewer Sum of Squares:
_
_
2
WTBI 2 = ~ ~ (Y J'2t - yJo'2t)
J~J '
J J
•
JJ
k n
k n
2
= '\ '\ [(- L (l-0.)v.w oo V.0 - - L (l-0.)v owoo ,V.o'2t) ].
j~j, n i=l
" ' J ' J 2t n i=l
l"J'J
88
Then the expectation can be shown to be
2k N J
2
2 NN J
E(WTBI 2 ) = nN
,2. (l-P i ) Et (V iJ'2t) + N2 J { ~ ,2. (l-Pi)(l-P i ,) x
,=lJ=l
'J=l
.I.
•
,ji,
2 NNJ J
Et (V"2t V " '2t) - fj'TITJ
2. (l-P,)(l-P")V"2 V,, "2'
'J
, J
lji{'Jjij,"'J , J
'1.'
It is equivalent to
2k
E(WTBI 2 ) = nN
N J
U,=lJ=l
,2. (l-P i )
2(n-l)
+ nN(N-l)J
N J
2
Et(V iJ'2t) -
,2. ,~
,=lJ=l
2
(l-Pi)V iJ'2]
NN J
[ht'j~l(l-Pi)(l-Pi') Et(Vij2tVi'j2t)
NN J
- t: t: t: (l-P,)( 1- P , ,) V.. 2V' , '2]
ijii'j=l
'
,
'J' J
2k
N J
2
N J
+ N [t: 1. (l-P,)V" 2 - t: t: (l-P,)
n
i=lj~l
"J
i=lj=l
'
2
2
V"2]
'J
.0
2k N J
2 2
+-N l.
(l-P,) V" 2
n ;=lj~l
'
'J
1.
+
2~ n 1 N N J
NN-1J"
t:(l-P,)(l-P,,)V"2 V"'2
n
lA' j=l
'
,
'J' J
1
The six sums of squares are rewritten below in the notation of
Section 2.3.2.6 of Chapter II.
a)
b)
c)
•
89
•
2n
+ kNJ
N J
2 2
2n N N J
I.
I.
(l-P.) Y" 2 - l<NJ"Z" 1. ~ I. (l-P.)Y. '2 Y' "2
i=lj=l
1
lJ
ij/l'j=l
1 lJ
lJ
This may also be written as:
2
2
E(BTBI ) = 2n (1 2) + 2n J (1 2 ) + 2n2~N-l )J2(1 2 )
2
k n °NR 2
P- n °R 2
k NJ -1 ) n (1 s2
2n
NN J
.
+ kNJ(NJ-l) I I .1. (l-Pi)(l-P i ,) y.lJ'2 Y1" J'2
lj/l'J=l
••
d)
N J
Again the term 4~~~j1) I. I. (1_P.)2 Y~'2 may be written in terms of
i=lj=l
1
lJ
.
the sampling variance and additional terms, but is not shown here .
•
90
f)
•
Estimates of the variance components based on the·subsample of
nonrespondents are presented below.
Appendix B.
Details are presented in
It is assumed here that NJ-l ; NJ. N ; N-l. hf ; 1.
and J-l ; J.
a)
k
-
2
0
n R2
k
= nNJ
Estimator:
b)
n-l
2
-n- P20R
N J
I I.
i=lj=l
k2J
~
(l-P.)
.0
1
(BTWI 2)
1
NN J
= N(N-l)J2 ~ ~
(l-Pi)(l-P i ,) Et [(YiJ'2t- YiJ· 2 ) x
l;l'J=l
2
.I
(Y i 'j 2t- Yi ' j 2) ]
Estimator:
.£. [n-l
2n
n
(BTWI ) _ kJ
2
2n
2
(BTWII )]
2
c)
•
•
91
Est imator:
d)
••
iii
= (i i) +
~~f
(i)
k 3 J2
V
= (iii) - n2 (n-1) (iv)
~k~" (iii)
Estimator:
k2 (k_1)J 2 [(i) + 2n 2 (n-1) (v)].
4n 2 (n-1) r:P
k3 (N_1)J 3
e)
k 2
nOlA
2
_
k( J- 1)
-
nNJ"
N J.
.I.I (l- Pi)
1.=lJ=1
2
y iJ"2
Estimator:
~
The estimator of
*o~s
contains the quantity r:P, which is the average
92
response probability.
It may be estimated from the nonresponse rate,
n
~
i=l
(l-o.)/n, where
•
1
3.3.2.3 The Estimation of Components Involving the Correlation of
Initial and SUbsamp1e Respondents
The sums of squares and expected values necessary to obtain estimators for variance components involving the correlation between the
initial and subsample respondents are given below.
a)
Between-Tria1-Within-Interviewer Sum of Squares:
n n
J
{
BTWI 3 = '\ '\ I u,.,v., w., .. [o'(Y"llt - Y·· 12t )] x
f"t'j=l lJJ 1 1 JJ
1
lJ
lJ
[(l-Oi')(Y i 'j21t - Yi 'j22t lJ }
.0
2n (n-1) N N J .
= kJ"N(N-l) >: ~ ,I. Pi(l-Pi') Et (Y iJ· lt Yi 'J'2t)
lrl'J=l
which may be written as
_ 2n(n-l)
NN J
E(BTWI 3) - kN(N-l)J4 q J. ,~ Pi(l-P i ,) Et (y iJ,lt Yi 'J'2t)
lrl 'J=l
N N J
-~1.
!:P,(l-P")Y"l Y"'2]
ir;'j=l 1 ,1
lJ 1 J
NN J
- ~irl'j=11
I. ~ P, (l-P.1 ,)
Y001 Y., '2] .
lJ
1 J
•
•
93
Under the assumption of independence of trials, then
NN J
-1.ifT'j=l
\ ). P,(l-P,,)
1
1
b)
V"l V"'2]'
lJ
1 J
Within-Tria1-Within-Interviewer Sum of Squares
NN J
{
WTWI 3 = \ \ ). U .. \J."w." [0.V·· 1t - o"V"'lt] x
1fT 1 j=l lJ 1
1 J
1 lJ
1
1 J
[(1-0;)V;j2t - (1-0;I)V;'j2t]}
NN
J
= - 2 \). 1. U •• \J. 1 w." 0, ( 1-0 ' , ) V. '1 tV ' 1 '2t
11i'j=1 lJ 1
1 J
1
1
lJ
1 J
'.
_ -2n(n-1) N N J
- kN(N l)J 2
1. 1. P.(l-P,1 ,) Et(V,lJ'ltV,1 1 J'2t)·
i1i'j=1 1
1.
NN J
- 1.ih'j=l
Y 1. P.(l-P. ,)V "lV"
1
1
lJ
1
'2]
J
2n(n-1) ~ ~ ~
- kN(N l)J Z L L L P.(l-P,,) V"
1 J l V"'2
1 J
i1i'j=1 1
1
c)
•
Within-Tria1-Between-Interviewer Sum of Squares
J J
=Y
i13'\
{1[n
n
1. u" o.
i=l lJ 1
n
[-kn i=l
1. (l-o.)\J.w
..
1
1 lJ
.
1 1. u", o. V.. '1 t] x
n
V"1 t "lJ
n i=l
lJ
1
lJ
k n
}
Y" 2t - - ). (l-o.)\J,w" , V"'2t]
lJ
n i=l
1
1 lJ
lJ
94
e
2f n-l ~ N N J J
- nN N-l J
~ ~ Pi(l-P i ,)
1f1'J1J'
I.
I
which may also be written as
2fn-l~
E(WTBI 3) = nN N-l J
qN111'N~ J=l
.1:J Pi(l-Pi')
Et(YiJ·ltYi'J·2t)
NN J
-~I.
I:P.(l-P")Y·· l Y·'·2]
111'j=11
1
1J 1 J
2
+ [NJ
N J
I I P.(1-P')Y··
Y··
i=lj=l 1
1 lJ l lJ 2
2fn-l~ N
J
I.i1 IN'j=ll
I: P.(l-P.,)
1
+ nN N-l J
-
i
2
- NJ
N J
)"
I
i~lj=l
P.(1-P.)Y·· 1Y·· 2]
1
1 lJ
lJ
Y. 'lY '·"2
1J 1 J
2f n- 1~
NNJ J
N N-l J2 ~ ~ ~ ~ P.(l-P.,) Y" l Y""2
n
1fl 'J1J' 1
1
lJ 1 J
e·
Based on the expectations of the three sums of squares, estimates of
variance components are derived.
a)
It is assumed that NJ ~ NJ-l.
2( n-1)
n
e
•
95
3.3.2.4 Summary of Estimators of Variance Terms
The estimators of variance terms are summarized below.
First,
we define the following combinations of terms which correspond to
those of Section 3.3.2.1, 3.3.2.2, and 3.3.2.3.
a1
= WTB1 1 +
2~~
J2
~
n-1
2
(BTW1 1) - 2J(--n-- P1 0 R )
1
A
{BTBII 1 - J(BTWII 1) -
2(~i1)
[BTBI 1 - J(BTWI 1)]}
J2
J2
- n2 (n-1) [WTW1 1 - 2' (BTW11 1)]
b1
••
= WTW1 1 -
(BTBI1 1) + nj 1 [BTB1 1 - J(BTW1 1)]
_ n2(n-1) (a )
J3
a2
i
= WTBI 2 +
k~~:
1
.
k2J2
~
n-1
2
(BTW1 2) - 2J --n-- P2 0 R
2
A
{BTBII 2 - J(BTWII 2) -
2~~;1)
[BTB1 2 - J(BTWI 2)]}
J2
k3 J2
- n2 (n-1) [WTWI 2 - 2' (BTWI1 2)]
. b2 = WTWI 2 -
i
(BTBII 2)+ k~ [BTB1 2 - J(BTW1 2)]
_ n 2 (n- 1) ( )
k 3J 3
a2
•
c2
)]
= BTBII 2. - J ( BTWII 2) - 2(n-1)
kJ 2 [ BTBI 2 - J (
BTWI
2
d2
k2J2
= WTBI 2 - ~
(BTWI 2)
e2
k2J 2
= d2 + 2fi7
(c2)
.
J2
h2 = WTW1 2 - 2' (Bl1JI1 2)
3 J2
(h 2) .
i 2 = e2 - n2 k( n-1)
n-1
2
2J -n- P2 0 R
2
A
96
e
Then we have for the variance terms
a)
b)
k2 J
c)
nk °R 2:
d)
n-1 2 k n- 1·
kJ 2 (
--n-P20R : 2n [--n-- (BTWI 2) - ~
BTWII 2) J
2
e)
kJ 2 (
2(n-l)
n Pl 202 : ~
BTWI 3) .
R12
2
2n7 (BTWI 2)
The sampling and nonresponse variance terms are
e·
a)
b)
c)
1
-0
2
n s2
1
+-0
2
n NR 2
d)
The variance due to the random assignment of subsample units to interviewers is
2
2
J-l
k J2
: 2JT [WTBI 2 - -nz- (BTWI 2)
nk °IA
2
- 2J
n-l
2
-nP2oR J •
2
A
and the variance due to the random assignment of the initial sample
to interviewers and nonresponse variance is
•
•
97
J2 {BTBII - J (
) - 2(n-l)
)]} } .
+~
BTWII
J2 [BTBI l - J (BTWI
l
l
l
These estimators of variance components of the double sampling
model, like those of other models, suffer from complexity and biasedness due to the incorrect but necessary assumption of independence
of trial.
However, they can be used to provide, if not entirely
accurate, estimates of mean square error components, at least estimates of components which to an acceptable degree of approximation,
can be helpful in assessing data quality for this degree; also comparisons of results from this design can be compared to that of other
designs to ascertain which design is most efficient under particular
••
conditions.
It should be noted that using the conventional sums of squares,
separating the estimate of sampling variance terms from the nonresponse variance, proved impossible. Were the model deterministic
rather than probabilistic, that is, were the response probabilities
either 0 or 1, then there would be no nonresponse variance.
Thus,
the estimator of. sampling variance plus nonresponse variance, if
used as an estimator of sampling variance only, will provide overestimates .
•
e
CHAPTER IV
THE EFFECT OF EXPECTED RESPONSE
PROBABILITIES ON THE EXPECTED MEAN SQUARE
4.1
Introduction
In Chapter III estimators of the mean square error components
under the double sampling scheme were developed.
However, a study
of the interrelationships of these components and the effect of
different distributions of response probabilities on the total MSE
and the individual components using these estimators would necessitate
the execution of an experimental design in the field.
e·
In lieu of this,
in ,this chapter we look at the expected mean square error of the model
of Chapter II and others under the assumption that the response probabilities are distributed according to a beta distribution 'and that
the responses from the population are distributed according to a
normal distribution.
Under these assumptions, then, the responses
and response probabilities of the MSE are considered random variables
rather than fixed.
Thus, the mean square error of the sample mean
which is generally considered a constant is in this case considered
a random variable.
Three mean square error models are considered based on three
estimators of the mean:
a)
Modell in which no adjustment is made for nonresponse.
e
•
99
b)
Model 2 in which a Politz-Simmons type estimator is used.
c)
Model 3 in which double sampling is used to correct for
nonresponse - i.e., the model of Chapter II.
The expected mean square errors are developed analytically using
the above assumptions.
Then, based on certain assumed·parameter
values, numerical examples of the effect of the beta distribution
on the mean square error and on the optimum values of nand k for
Model 3 are studied.
-.
More specifically, we look at·the following:
a)
The effect of different expected response r.ates on the
. total MSE and the MSE components as a percent of the total
MSE.
b)
The effect of the shape of the response curve on the mean
square error and its components.
c)
A comparison of the MSE of the three models for fixed cost.
d)
Optimum values of the sample size n and the inverse of the
subsamp1ing rate k for various cost models and expected
response rates for Model 3.
4.2 The Mean Square Error Models
The three estimators of the mean and their mean square errors
are presented below.
For simplicity later in looking at numerical
examples, it is assumed that the number of interviewers J
= 1.
Another way of looking at this is that the model is·deve10ped for
one interviewer assignment area.
This is consistent with the usual
presentation of estimates of mean square error components which invo1ve the correlated response variance component.
•
Again it is noted that
100
4.2.1 Model 1 - No Adjustment
We let
-
xt - NA
•
= _1 ~
l a'Y' lt
n1 i=l
1 1
-
where xt - NA is the sample mean under the no adjustment model and the
other variables are consistent with previous model definitions. Then
the bias is equal to E(x t _NA ) -
X,
where
Xis
the true mean and
,n
Now nl is a random variable since I a. = n1 and a. is random, rei= 1 1
1
sulting in a ratio estimate. For simplicity in the development of
the model and the related examples of Section 4.4, it is assumed that
nl is, fixed. This assumption can result in a significant bias for
small sample sizes and for moderate sample sizes with low expected response
rates.
.
However, for large sample sizes and moderate sample sizes with
moderate expected response rates, the bias should be if not negligible,
at least "acceptable" for purposes of this chapter; i.e., the study of
the interrelationship of mean square error components.
So under these
assumptions,
1 N
+ -N
I
i =1
(Y'l
1
where the first term is due to nonresponse bias and bias of the estimator and the second term is response bias.
For large N the squared
•
"
101
•
n
1 N N
1 NN
u:
+ 2 (-W ~ ~ P'Y'lB"l - W
Y'l B1' '1),
iF i' ,
nl
iF i' "
,
which may be further simplified to
,
2 _
• n2
B,as (x t - NA )
1 NN
= ~ W ~F~,PiPi' YilYi'l
1
'
n
iF"
nl
1 NN
+WL ~ X"X," - 2-W~
I.
+
P, Y'l X"
'Fl'"
1
The variance, consists of the sum of three terms, then
"
•
representing the response variance, the nonresponse variance, and the
sampling variance, respectively,
EsE cls Vartlc,S(Xt_NA)
"
+n(n-1)
n1
2
1
n
It may be easily shown that
1
N
= ~N ,L PiEt(Yilt - Yil )
1
1=1
N{N-l)
2
NN
LIp,p"
i Fl' , 1
and
(PiY il - P'f'il2
N-l
4.2.2
•
Model 2 - Politz-Simmons Estimator
Assuming the Pi's are known, we let
_
1 n c i Yi1t
xt _PS = n .1:
P,
1= 1
,
where xt _PS is the Politz-Simmons type estimator of the sample mean
102
and again the variables are as previously defined.
e
Bias(x t _ps ) = E(X t _PS - X), where
E(x t _PS ) = EsE ols Etlo,s(Xt_ps)
_ 1 N
\" .
- Ii i~lYil
1
N .
X.
- -N i=l
I
So
1
Then
eFor large N, then
.
NN
1 I: \"
B1as 2( x- t _PS ) .=W4./,B
il
1r1
Thus the response probabilities have no effect on the bias of the
sample mean when the Pi's or unbiased estimators of the Pi's are
known.
The variance of xt is again the sum of the three terms - response
variance, nonresponse variance, and sampl ing variance. The first
term is the response variance (RV) where
e
•
103
When the expectation is taken over all possible response solicitations and all possible samples, then the response variance is
'+
NN
n-l
nN(N-l) IiiiL
I
Considering. next the nonresponse variance (NRV), then
1
var o1s Etlo,s(Xt_ps) = ~
n (Var(oi))Vi,
p2
,
i=l
i
t
1
••
n
I
= -::-2"
n i =1
Then the expectation over all possible samples yields
(l-P i) 2·
EsVarol s Etlo,s(Xt_PS) = nN i~l
Pi Vil
_
1
N
The final term is the sampling variance (SV) where
- 2
l-f N (V il - Vl)
VarsE ols Etlo,s(Xt_ps) = --n-- .L
N-l
1 =1
The response probabilities, then, have an effect on the sample response variance and the nonresponse variance only.
4.2.3 Model 3 - Double Sampling Model
The Mean Square Error:
•
The estimator of the mean under the
double sampling model and its mean square· error are developed in
Chapter II and are summarized in Section 2.3.2.6.
are shown here for the mean
The components
104
_
l
nJ
k nJ
xtDs=-I Iu ..o.Y·· 1t +-I I (1-o.)v.w··Y·· 2t ,
n i=lj=l
1 1 1J 1J
n i=lj=l 1J 1 1J
•
where j = 1, such that in this case
_
1 n
k n
xt - DS = - I o. Y' lt + - I (l-o.)v. Y' 2t
n i=l 1 1
n i=l
1 1 1
Then the bias of xt - DS may be written as
.
_
1 N
1 N
B1as(x t _DS ) = N .I Pi Bi1 + N ,I (l-P i ) Bi2 ,
1=1
1=1
The squared bias is
•
N
Bias 2(x t _DS ) = [~ .I Pi Bi1
1 =1
.. 1 N N
- w II
PiP.,
iIi'
1
B'l B"l
1 1
••
2 NN
+
w Ii;/1'~ P.(l-P,
,)
1
1
B· 1B·'2 '
1 1
The variance of xt - DS is the sum of the response variance, nonresponse
variance, and sampling variance shown below.
•
•
105
such that RV = SRV 1 + CRV 1 ,+ SRV 2 + CRV 2 + 2CRV 12 .
Sampling' Variance Terms:
N
SSV = k~l
,.I=1(1-Pi)Y~2 -
k
1
N
=-NI(P'Y' l
n ;=1 "
1
k~l nN(N~')
_ py,)2
1
N
= nN I [(1-P')Y' 2 - (1-P)Y 2]
i=l
"
••
NN
1 I I (1-P i )(l-P;')Y i2 Yi '2
I-P iii'
2
2 N{
= -N I [P'Y' l
n i=l
"
where SV = SSV + SV 1 + SV 2 + 2SV 12 .
Nonresponse Variance:
1
N
2
1
N
2
2
N
-N I P.(l-P')Y· l
"
n ·,= 1 ,
= -N I P.(1-P')Y· 2
n i=l'
"
.
2NRV 12 = N I P.(l-P')Y· l Yi2
n i=l'
"
such thatNRV = NRV 1 + NRV 2 - 2NRV 12 ,
The variance terms due to random assignment of interviewers
listed in Section 2.3.2.6 of Chapter II vanish when considering one
•
intervi.ewer assignment area .
The Optimum Subsampling Rule:
The optimum subsamp1ing rule is
106
developed under the assumptions of this chapter; thus Al and A2 of
Chapter II are simplified in accordance with the final variance terms
e
above and assuming n ; n-l.
Al
=
n[SRV l + NRV
o~s + SV]
4.3 The Development of the Expected MSE's
The expected mean square errors of the three models presented in
Section 4.2 above are studied using the superpopulation model approach.
It is assumed that the vectors of population values
V = (Y1' Y2 •···• YN) and response probabilities P = (Pl' P2· .. ·• PN)
are the realized outcome of the set of random variables V = (V 1 .V 2 •·....VN)
"
.,'
e·
and P = (Pl' P2 ..... PN). The superpopulation model refers to a
specified set of conditions that define a class of distributions to
which these joint distributions belong; that is.
y
and p are treated
as outcomes of V and P with distributions about which certain features
are assumed known (Cassel. Sarnda1. and Wretman. 1977).
Cassel.
et~.
list different interpretations of the superpopu-
1ation concept with the most basic concept that the finite population
is drawn at random from a larger universe or infinite population.
In
this case it is assumed that the infinite population to which the Vi's
belong is the normal distribution and that to which the response probabilities belong is the beta distribution.
The beta distribution is a natural candidate for the infinite
population from which it is assumed that the response probabilities
•
•
107
are drawn, for like the response probabilities, the beta variable
is continuous and varies between 0 and 1.
The beta function .is de-
fined below:
f(P.)
=
1
r(a+8) Pi a-l ( l-P ) 8-1 ,a,B > 0,
i
r(a)r(8)
O<P.<l
- 1-
where E(P i)= a~8' Though it is unl ikely that every unit in the
population has response probabilities distributed according to the
.same beta distribution, for simplicity it is assumed so in the
development of the expected mean square errors of these models.
Thus,
E(P i ) can be considered as the overall expected response rate.
'.
The parameters a and 8 are the shape parameters and their values
indicate how the response probabilities are distributed in the population.
For example, if a < 8, then the response probabil ities are
skewed to the right with more of the population having low response
probabilities; if a
>
8, the opposite holds; and if a
= 8, then the
distribution of response probabilities is symmetric about 0.5.
Further, if both a and 8
at (a-l)/(a+8-2).
>
1, the density function has a single mode
If a < 1 and B < 1, there is a minimum value or
antimode at this value and we have a U-shaped distribution.
(a-l)(8-1) is
~
If
0, then for O<P<l the distribution is J-shaped
(Johnson and Kotz, 1970).
The normal distribution is, of course, one of the most frequently assumed distributions of continuous random variables.
•
normal function is defined as
f( Y,) =
1
1 e
/21TO l
,
0
2
> 0
The
108
where the location parameter
~Y
is the mean of the distribution and
2
the scale parameter a is the variance.
It is assumed that in the
•
infinite population the Vi's are independently and identically distributed.
It is further assumed that the Xi's, the true values, are also
drawn from a normal distribution with the same variance a 2 and mean
~X'
Then
E(Y i1 - Xi)
= E(B i1 ) = ~Yl-
~X
= C1 .
The observed value Yi and the true value Xi are jointly normal with
correlation PXY for all i. For i;i' the correlation is O.
It is reasonable to assume that the response probabilities and
the observed values are correlated.
Indeed, it is through this cor-
relation that nonresponse bias is realized.
For example, persons in
poor'health may tend to have low response probabilities which would
correlate with responses in a health survey.
••
Thus, throughout the
development of the E(MSE's) of these models, it is assumed that the
Pi's and Vi's are correlated for ani, and Cov(Pi,Y i ,) = a for all
ifi '. Though no assumption about the joint distribution of 'p and Y
is made, it is assumed that the conditional distribution of Ygiven P
is normal, with constant variance.
That is, it is assumed that
The use of this assumption in estimating the covariance between functions of P and functions of Y is discussed in Section 4.3.5 below.
Additional assumptions specific to Model III are discussed in
Section 4.3.3. below.
•
•
109
4.3.1
Modell - No Adjustment
The expectation of the MSE is developed below under the assump-
tions discussed above.
The covariance between functions of P and Yare discussed collectively
in Section 4.3.4 rather than with each model.
It is assumed through-
out that N-l ; N.
The expected squared bias is developed below.
.
2 _
n2 1 N N
= E[~W? I P1·P.,
nl
1; i '
1
E[Blas (x t - NA )]
•
n 1 NN
-W?
1. X.P.,
nl
- 2
l;i,ll
~
[E(Py )]2
n
l
=
Y"l]
1
+ [E(x)]2
l
2 Jl.
E(X) E(PY 1).
n
l
This may also be written as
~ [Cov(PY 1) + E(P) E(y)]2 + ~2
1
-
2 Jl. E(X) E(PY 1).
nl
The expectations of three variance terms are given below.
The Response Variance:
The expected response variance which is the sum of the simple
and correlated response is
E(RV)
= E(SRV)
E(CRV)
1 N
n
•
+
2
= "::7
E(-NoI Pi Et(Y' t - Yi) ]
nl
1 =1
1
+ n(~21)E{N(Nl_l)~~PiP1·,[E(Y
~ ~
t ilt1
1;1 '
Y)(Y
il . i'lt -Y)]}
i'l
110
Now it is assumed that the response error is not correlated with the
response probabilities for simplicity in.dealing with already difficult terms,
However, such correlation might well exist.
•
Had the
model been developed such that the response probabilities were dependent on the interviewer, this assumption, at least for the correlated
response variance, would be more difficult to justify. Then, letting
n ,; n-l,
1 N
2
2
aR~ = N iI/tty ilt - Yill , and
2*
1
NN
PlaR = N(N-l) ?4?,EJ(V ilt - Vil)(Vi'lt- Vi'l)]
1
lrl
.~
The Nonresponse Variance:
E(NRV)
which may be written as
~ {[COV(P,V;)
- cov{p 2 ,Vi)] + E{Vi) E[P{l-P)]} ,
The Sampling Variance:
N
E(SV) = E[1-f..!!.. I
nl nl i=l
(P .V'1- PV 1)
1 1
2
]
N-l
If we let l-f ,; 1 and write
1 N
i=l
2
2
1
1
= -N ~. p. V"l
•
•
111
which is also
2
=. ~
{[COV(p .V~) +>E(P 2 )E(V;)] - [Cov(P,V l ) + E(P)E(V l )]} .
n
l
4.3.2 Model 2 - Politz-Simmons
The expected bias and variance for thi.s estimator are given be-'
low using the same assumptions as those of Modell ..
The expectation of the squared bias is
1 NN
?~>.,BilBi")
.2-
= E(NT
E[B1as (x t _PS )]
'.
1r 1
= [E(B il )] 2 = c,2 .
The expectation of the variance is developed below for the three terms.
The Response Variance:
Then again assuming the response probabilities and the response variance are uncorrelated and n ~ n-l and letting 0R2* be defined as in
.
B.2 above
The Nonresponse Variance:
•
.
N
E(NRV) = E[.l.N )
n
.',
1=
1
112
e
The Sampling Variance:
The sampling variance may be written as
E(SV)
2
=~
,where 1-f ~ 1.
4.3.3 Model 3 - Double Sampling
Some additional assumptions of the superpopulation·model from
which it is assumed the finite population values are drawn are given
below.
These assumptions are used in the development of the expected
mean square error and consequently in finding the optimum nand k for
e---
fixed cost.
and the expectation and variance of Yil are defined as in Section
4.3.1 and 4.3.2 above.
Both cov(Y il 'Yi'l) and Cov(Y i2 'Y i '2) = a for i~i '. However,
Yil and Yi2 are assumed to be jointly normal. Further, f(Yil'Yi2IPi)
is assumed to be bivariate normal, such that f(Yil'Yi2IPi) - N(!::p'V)
where
!:P
=
Y1P
[Il ]
Ily P
2
Other assumptions discussed above are assumed to hold for this model.
•
•
113
The Expected Mean Square Error:
The expected value of both the
squared bias term and the variance are developed below.
Considering first the expected squared bias, then,
.2· l NN
E[Blas (x t OS)] = E[W ~ }" P.P., B· 1 B·'1
.i;l' 1 1 1 1
.
+
E[Bias 2(x t _OS )]
•
1
W
NN
~ ~
i;i'
(l-P i )(l-P. ,) Bi2 Bi '2
. 1
= [E(PB 1)]2 + [E(1-P)B 2)]2
+ 2{E(PB 1) E[(1-P)B 2]},
Now, if it is assumed that the response probabilities are only negligibly correlated with the response bias, then
B2 C2
2aB C C
. 2(- )].!. a 2 C2
E[ Blas
xt - (cx+B)" 1 + (cx+B)" 2 + (a+B)" 1 2'
The expected variance is developed next.
·The Response Variance:
The expected response variance terms are given below assuming no
correlation between the response probabilities and response variance
or random response error.
E(RV 2)
•
= 1n __
B__ 0 2* +
a+B R2
_ ·2aB
2*
2E ( CRV 12 ) - (a+B)< °R
where (a~~)2
12
= E(P)E(l-P).
114
•
Nonresponse Variance:
= [Cov(P,V l2) - Cov(p 2 ,V l2 ))
+
E(V l2 ) E[P(l-P)).
Likewise
E(NRV 2)
=1 [E(PV~) - E(p 2V2))
n
(
= [cov(P,V~)
2
- COV(p2,V~)) + E(V~) E[P(l-P))
•••
N
2E(NRV 12 ) = E[--N I P.(l-P.) V1' l V, 2)
n i=l 1
1
1
= ~ [E(PV 1V2) - E(p 2V1V2))
*
= {E[PE(V1V2Ip)) - E[p2E(V1V2IP))}
And
Sampling Variance:
E(SSV)
N
=
E[k-l I (l-P.)
nN i=l
1
k 1
- N(N- 1)
n
-
=
v:12
1 NN
l-P
~ ~ (l-Pi)(l-P i ,) Vi2 Vi '2],
1j!1'
which using the large sample approximation, is then
. {E[( l-P)
2
E(SSV) = k-l E[(l-P) V ) _ k-1
n
With 1-f
~
2
1 and n ; n-l, then
n
E(l-P)
v2)}2
•
115
•
E(SV 1)
and
Now
l-f 2 N
= --n-- N i1 {[P i Yil 1
2SV 12
~ [(1-P i )Y i2 - (1-P)Y2]}
may be written as
NN
2 1 ~
2
--N ~ P.(l-P.)Y 'lY' 2 - nN(N-l) I: I: P.(l-P. ,)Y 'lY,"2
n
i=l'
l;f"
",
,
,
Then
••
Then if the expected population variance E(PVAR) is defined as
'.
E(SV)
= E(SSV)
+ E(PVAR).
Optimum Subsampling Rule:
Al and A2 are shown in Section 4.2.3
above and the expectation of these terms is
.
E(A l )
= n[E(SRV 1)
E(A 2)
=f
+
E(NRV) - E(cr~s) + E(PVAR)]
[E(SRV 2) + E(cr~s)]
where E(k~l cr~s)
= E(SSV).
These variance terms are defined in Section 4.3.3 above and in Section
•
4.3.5 below..
Now E( AF l )
a
S
S
= AC - AC O - Acln a+S
- Ac 2n a+e
- A kn c3 a+e
.
116
So minimizing.
•
where
1
k
4.3.4
2"
=
The Covariance of Functions of P and Y
The joint distribution of P and Y can be defined as
m = 1, 2
where for purposes of this study it is assumed that
variate normal with mean
~Y
P and variance
2
0 •
f{~P}
is bi-
Now
m
E{PY m}
= E[PE{PIY m}]
= E{P~y p}.
m
Now since· Pis defined over a finite range, then
{4.3.4.1}
~Y
P may be .
m
•
•
117
approximated as a polynomial function of P; that is, for m = 1,
It is assumed here for simplicity that the conditional mean may be
approximated with a linear equation so that
Now then the
E{PY 1) can be defined as
E{PY 1)
~
E[P{8 0 + 8,P)]
= 80 E{P)
•
=
E{PY~)
+ 8,E{P2)
E{Y,lE{P) + Cov{P,Y,l.
(4.3.4.2)
= E[P{EYjIP)]
=
E[p{02 +
,
lJy2~]
; E{P[{8 + 8,p)2 + 02]}
0
which may be shown to be
(8 + ( 2) E{P) + 28 0 81 E{p 2) + e~E{p3)
0
= E{Y~) E{P)
+ cov{P,y~).
Li kewis e,
1) = E[p 2E{ Y1IP)]
E{p 2y
(2)E{p 2) + 28 08 E{p 3) + 8 E{P").
1
= E{Yj) E{p 2) + COV{p2,y~).
; (e~
•
and
+
1
118
1) ; E[} E(Y~IP)]
E{} Y
= (6~
+
2
0 )
•
1
E{}) + 26 061 + 6 E(P).
E{PY1Y2)~ E[P E(Y1Y2IP)]
= E[P{Cov{Y l , Y2IP) + E{Y1IP) E{ Y2IP)]
which under the assumptions discussed above is
Now it is assumed that E(Y1IP) and E{Y2IP) differ only in the constant term.
Since Yl and Y2 differ only in the degree of response
bias, then this follows the assumption that response bias is not
correlated with P.
.-
where 6 + 6 1P ~ ~Y P'
2
Then
0
Now ~~ where ~~ is the r th moment about zero is
~
where x[r]
= x(x+l)
I
r
=
, r an integer
(x+r-l), the ascending factorial, so that
E{P)
=~
a+B
E( 2)_
a\a+l)
~ - ( a+B ( a+B+ 1)
Further,
E(l) =a+I3-1
P
a-l'
•
•
119
4.3.5 Summary of Expected Mean Square Error Components
The E(MSE) components for the three models are listed below.
where the E(MSE) is the sum of the listed components.
No Adjustment Model:
E[Bias 2 (x t _DS )]
'.
= *f [E(Py l )]2
E(RV)
=
E(NRV)
=
n1 [E(PY 1) - E(p2y~)]
E(SV)
= n {E(p 2y2) _ [E(Py
"f
+"f
1
~2
l
)]2}
Politz-Simmons Model:
. 2( x
- _ ) ] = C,'2
E[ B,as
t PS
E(RV)
= 1 a+ 13i l E(oR2*) + E(PR 0R2*)
E(NRV)
=
n a-
*
[E(*
Double Sampling Model:
1
Y~)
-
1
E(Y~)]
-
2
n~ E(X) E(PY 1)
2
a
(a+13 p E( PRl °R2*l )
a E( 2*)
"fn a+13
OR,
E(SV)
•
n2
+
1
120
E(RV}
=
E(NRV 1) =
E(RV 1} + E(RV 2} + 2E(CRV 12 }··
*
[E(PY~} _ E(p2y~)]
•
E(NRV 2) = 1n [E(Py 22} _ E(p 2y22}]
2E(NRV 12 }=
*
[E(PY 1Y2) - E(P 2yl y2}]
E(NRV}
= E(NRV } + E(NRV } - 2E(NRV }·
2
1
12
E(SSV)
,
2
k-l {
2
[E(1-P}Y 2] }
= E[(1-P}Y 2] n
E(l-P}
E( SV l}
=
E(SV z)
=
1n {E(p 2y12} _ [E(PY 1}]2}
*
{E[(1_P}2y~] - [E(1-P}y z]2}
2E(SV 1Z } = ZE(NRV 1Z ) - E(PY 1} E[(l-P}Y Z]
E(SV)
4.4.
=
•
E(SSV) + E(SV 1) + E(SV Z} + ZE(SV 1Z )'
Hypothetical Examples of E(MSE)
In this section the expected mean'square error of the three
models is studied using certain assumed values of the parameters.
It is hoped that these hypothetical examples will give some insight
into the effect of differing response rates on the E(MSE} and its
components; how the distribution of P in the population relates to
E(MSE}; and when the double sampling scheme is preferred over,the
no adjustment and Politz-Simmons models.
•
.
•
121
4.4.1
The"Parameter Values
The choice of parameter values in looking at the E(MSE} can
affect the conclusions that are drawn.
Thus, though final choices
.are arbitrary, they at least should be kept within reasonable
bounds; such was the attempt here.
The following values were chosen:
means:
Jl
= 100;
Jly 1
= 110 ;
Jly
2
= 106
200, 2000
n:
nl : E(P}* n
2
50,000
0 :
••
SRV*:
CRV*:
PY1Y2
8
1
Cost:
=
0.5;
=
10;
Jly
Co = 5000
CC =
3
Now with Jl
2
50
= 100 then
and Jly were chosen such that the
Jly
1
2
relative biases would be 10 and 6 percent, respectively; the assumption then is that the improved procedure used in obtaining interviews
from the subsample of nonrespondents would produce lower response
•
biases.
The values, though arbitrary, are felt to be moderate .
The sample size.n
= 200
was chosen as a compromise between a
122
value that could be logically assumed to be the size of one inter-
e
viewer assignment area and one which for small values of E(P), nl
would not be exceedingly small. The sample size n = 2000 is admittedly, too 'large for one interviewer assignment area, but chosen to
show the changing percent the MSE components are of E(MSE) as n increases.
The variance 0 2 was chosen to give reasonable coefficients of
variation; for n
=
200 and 2000 the coefficients of variation are
4-5 percent and 14-15, respectively, for the three means.
The simple response variance was chosen such that 0R/02
= .15.
This was approximately the median ratio of simple response variance
to sampling variance presented in Table 1 of Bailey (1975) for several
1970 Census items.
Two sources were checked for the correlated re-
e-
sponse variance - Table 2 in Bailey (1975) and Table 3 in Bailey
(1978).
Using these tables as a guide a moderate, if not somewhat
c~nservative value was chosen such that for n
= 200, PRoR/(02/n) =
25 percent.
A moderate value of P = .5 was chosen as correlation between Yl
and Y2' As differences between the expected responses differ by
response variance only, this could possibly be low.
However, its
value affects the expected nonresponse and sampling variances of
Model 3 only, not the E(MSE) .. Thus, a table showing the effect of
differing values of p on the expected nonresponse and population
variance of the double sampling model is presented.
The parameter 61 is associated with the correlation between p'
and Ym through the estimate of E(Ym!P) = ~Y p' It may be recalled
m
e
•
123
that E{PY m}
Equa~ions
= E{P~y p} and
m
4.3.4.1 and 4.3.4.2 of Section 4.3.4 emphasize the rela-
tionship between 81 and the Cov{P,Y m}. 81 was set at 10 which gives
a low correlation of approximately 1 to 4 percent, depending on E{P}
and thus on values of a and S.
Since 81 has the most effect on the
expected squared bias of the unadjusted model and thus on its E{MSE} ,
Table 4.4.2.1 shows the effect of differing values of 81 on the
expected squared bias of this model .
••
4.4.2 Oiscussion of Results
Tables 4.4.2.2 and 4.4.2.3 show the effect of shape of the beta
distribution on the E{MSE} and its components for three different
values of E{P} - 0.2, 0.5, and 0.8 - for the no adjustment model.
both Table 4.4.2.2 where n
= 200,
and Table 4.4.2.3 where n
In
= 2000.
E{MSE} decreases as a and S increase, or in terms of shape, as the
beta ·distribution changes from U-shaped to unimodal.
however, is more gradual as a and 8 increase.
The decrease,
Each component of the
E{MSE} is also affected by the changing size of the a's and 8's. The
percent each variance term is of the E{MSE} increases as a and S increase, with the exception of E{SV} in both tables.
As E{P} increases the E{SV} and the overall expected variance
•
decrease in importance as the expected squared bias increases in
importance. Though not shown in these tables, the E{MSE} remains
124
the same over all a,S within E{P) grouping when the response probabilities are uncorrelated with the responses.
In fact. in that
•
case, as indicated in Table 4.4.2.1, the expected squared bias is
due to the response bias only.
Tables 4.4.2.2 and 4.4.2.3 differ in the importance various
components are of the E{MSE). As n increases, the E{SRV), E{NRV).
and E{SV) decrease in importance, while E{CRV) and E{SQBIAS) increase
in importance. This follows since'the latter two components are not
affected by increasing n.
Tables 4.4.2.4 and 4.4.2.5 show results from the simplified
Politz-Simmons model.
Entries of a and S were changed "for this
model since for values of a
negative.
<
1, some of the E{MSE) components were
We recall that the E{SRV) and E{NRV) involve the
E{fr) = (a+S-l)/{a-l).
.-
The E{MSE) decreases both with increasing a
and S and" with increasing E{P), and neither is due to the correlation
of functions of P and Y.
For E{P) ,= 0.2 and 0.5 there is a dramatic
drop in E{MSE) between a = 1.05 and a = 2, with the decrease tapering
off both for larger values of a and S and larger E{P).
tions of P shown here are all unimodal.
The distribu-
As a and S increase, the
variance of P decreases with the majority of units having response
probabilities concentrated around E{P).
= 0.2 and 0.5 and n = 200, the major contributor to
the E{MSE) is the E{NRV), while for this same n for E{P) = 0.8, the
For E{P)
largest contributor to the E{MSE) is the expected sampling variance.
= 2000 and E{P) = 0.2 the largest contributor to the E{MSE) is
again the E{NRV); however, for E{P) = 0.5 and 0.8 the largest
For n
•
•
125
contributor ;s the E(SQBIAS) with the E(CRV) the second largest contributor.
Tables 4.4.2.6 and 4.4.2.7 show the contributions to the variance of expected mean square error components for the double sampling
model for n
= 200
and n
= 2000,
respectively.
Only two terms were
affected more than negligibly by changing a and S - the E(NRV) and
the expected population variance (E(PVAR)).
As the E{NRV) and the
percent it is of the E{MSE) increases with increasing a and S, the
E{PVAR) and the percent it is of the E{MSE) decreases by an equal
amount; then, within each E{P) grouping, the sum of the percent
-.
E{PVAR) and E{NRV) are of the E(MSE) remains constant at approxi-
= 0.2, 46.4 percent for E{P) = .5, and
51.8 percent for E{P) = 0.8 when n = 200. For n = 2000 these same
percentages are 15.9, 14.9, and 13.9, respectively. For n = 200,
mately 41.9. percent for E{P)
as E{P} increases,'the percent the expected sampling variance is of
the E{MSE) decreases while the percentages the expected response
variance and the expected squared bias are of the E{MSE) increase.
For n
= 2000 the percent the E{SV} and E{RV} are of the E{MSE} de-
crease with increasing E{P}, while E{SQBIAS)' becomes a larger proportion of the E{MSE}.
In this model and in some cases in the no
adjustment model, the E{MSE) does not decrease as E{P) increases for
n
= 2000. This is because the decrease in expected variance is out-
weighted by the increase in expected squared bias which is not the
case for n
•
= 200.
Table' 4.4.2.8 shows the effect of correlation between responses
in the initial sample and in the subsample of nonrespondents on the
126
E{NRV) and E{SV).
For p = 1 the values are not exact1,}' 0, but very
close to it.
•
Table 4.4.2.9 shows the optimum values of nand k for various
cost functions.
were < 1.00.
Where k
= 1.00, the actual estimated values for k
The largest k shown is only 2.16.
As the cost ratios
increase, both nand k decrease as- expected; however, as E{P) increases, the values for nand k increase.
For-smaller E{P) there
are more nonrespondents and a smaller percentage-of the original
non respondents need to be subsamp1 ed to achi eve 'the same accuracy
For E{P) = 0.3, CR 1 = 0.5 (or c1 = $25) and CR 2 $5). Optimumn is the same as for E{P) = 0.9,
for fixed cost.
0.10 (or c2 =
CR 1 = 0.8, and CR 2 = 0.16.
Tables 4.4.2.10 and 4.4.2.11 compare the E(MSE's) for the three
models -for fixed cost.
For -the double samp1 ing model the nand k
used were calculated using the optimization methods of Section
above as were the nand k of Table 4.4.2;9.
.-
~.3.3
For the no adjustment
and Politz"Simmons models, n is based on
a-
where E(P) = a+8 and E(l-P)
Politz-Simmons model the
expected mean square
U-
~rror
e we may note again that for the
= a+8'
and J-shaped distributions had negative
-
-
1
components because of-the effect of E(.p) .
Two other factors stated previously are also wroth noting' again:
a.
The no adjustment model does not contain the variation due
to the random component nl' Thus, the E{MSE) is probably
understated.
b.
The Politz-Simmons model considers the response probabilities
•
•
127
known and thus does not contain the source of variation
due to the estimation of the Pi's.
In Table 4.4,2.10 the
E{t~E)
for the double sampling model for
fixed cost is preferred in most cases.
However, as a and S increase,
the E{MSE) of the other two models decrease.
For E{P)
= 0.2, a and
S would have to be very large for the'E{MSE) of the no adjustment
model or the Politz-Simmons model to be smaller than that of the
double sampling model.
However, for E{P)
= 0.5 and 0.8, the E{MSE)
of the Politz-Simmons model is approximately the same as that of the
double sampling model when a
~
a.
Table 4.4.2.11 with different cost ratios shows different re-
-.
lationships among the models.
When the cost of obtaining and pro-
cessing sampling units in the initial sample is only 20 percent as
high as for the subsample of nonrespondents, then the other two
models compare more favorably to the double sampling model.
E(P)
For
= 0.2 the E(MSE) of the no adjustment model is,approximately
the same as that of the double sampling model for the unimodal distributions; for E(P)
= 0.5,
the two models; and for E(P)
there is more of a difference between
= 0.8, the double sampling model is
even for this cost ratio, substantially better in terms of E(MSE).
In comparing the expected mean square error of the Politz-Simmons
type estimator with that of the double sampling estimator, for all
values of E(P), the Politz-Simmons generally have E(MSE's) comparable
to those of the double sampling model.
•
Some exceptions are a
=2
= 0.2 - here the E(MSE) of the Politz-Simmons estimator
is considerably higher - and a ? 8 when E(P) = 0.5 - here the
when E{P}
Politz-Simmons estimator is considerably lower.
128
In summary, then the E{MSE) of the three models are all affected
by the E{P); for n
= 200 the E{MSE's)
for each model tend to get
,.
•
smaller as E{P) increases due to an overall decrease in expected
variance which offsets the increase in expected squared bias.
n
= 2000,
For
though the E{MSE) itself was considerably smaller, it did'
not always. decrease with increasing E{P).
This is because for the
double sampling model, as well as in some cases the no adjustment
model, the E{SQBIAS) increases more with increasing E{P), then the
variance decreases.
The survey designer should be cautious of the
fact that not all components of error will decrease as n decreases
and allocate his/her resources accordingly.
There may be
a point
above at which it will be better to allocate resources for training
of interviewers, for example, than to increase sample size.
.~
Now the shape of the beta distribution; orin other words, the
size of a and S, affects the Politz-Simmons model and no adjustment
model when P is correlated with Y, but the double sampling model of
this research only negligibly.
The E{MSE) of the no adjustment and
Politz-Simmons models decreases as a and S increase.
The double sampling model, possibly the more time-consuming of
the three models, compares in most instances favorably with the
other models for cl /c 3 = .5, but for c l /c 3 = .2, the comparison is
not quite as good. Results indicate that it is important for the
survey designer to be aware not only of the expected overall response
rate, but to have some feel for how the response probabilities are
distributed in the population.
For if response probabilities could
be reasonably estimated, for unimodal distributions with relatively
•
•
129
large a's and B's, or in other words, when the response probabilities
are concentrated around E(P), then a Politz-Simmons type estimator
might be more appropriate than double sampling.
Also, when the cost
of the initial sample is considerably lower than the cost of taking
a subsample, then for small E(P) and large a and B, the no adjustment
model may be preferred over double sampling.
This, of course, would
. depend on the amount and kind of correlation between P and Y.
Further research is needed to look at the effect of the correlation of functions of P and Y on the E(MSE).
For other than a
positive linear relationship between P and Y, the results may have
-.
•
been different.
Also a look at the no adjustment model and a
Politz-Simmons type model without the simplifying assumptions is
in order .
, .TABLE 4.4.2.1
EFFECT OF 6~ ON THE EXPECTED SQUARED BIAS
IN T E NO ADJUSTMENT MODEL
(n = 200)
E{ P)
a
61
(3
=a
SQBIAS % of E(MSE)
.2
61
SQBIAS
% of E{MSE)
= 20.
SQBIAS % of E{MSE)
2"
1
100
5.4
286
13.9
569
23.9
2"
2
100
5.4
204·
10.4
345
16.2
2
8
100
5.4
162
8.5
239
11.9
8
32
100
5.4
149
7.8
207
10.5
1
1
1
8"
8"
100
12.5
361
33.8
784
52.1
2"
2"
1
100
12.5
306
30.2
625
46.6
2
2
100
12.5
256
26.6
484
40.5
8
8
100
12.5
234
24.9
424
37.4
1
.8
= 10
8"
1
.5
61
1
2"
8"
1
100
18.6
370
45.7
810
64.7
2
2"
1
100
18.6
345
. 44.. 0
737
62.5
8
2
100
18.6
331
42.9
695
61. 1
32
8
100
18.6
326
42.6
681
60.7
~
•
•
w
a
•
•
•
•
TABLE 4.4.2.2
E(MSE) COMPONENTS AS PERCENT OF TOTAL E(MSE)
FOR VARYING SHAPES OF THE BETA DISTRIBUTION
NO ADJUSTMENT MODEL
(n = 200)
Components as Percent of Total E(MSE)
E( P)
.2
.5
.8
ct
E(SRV)
E( CRV)
E(RV)
E(NRV)
E(SV)
E(VAR)
E(SQBIAS)
Total
E(MSE)
B
1 1
"8 "2
1.
"2 2
9.1
3.0
12.1
23.6
50.4
86.1
13.9
2060
9.5
3.2
12.7
45.7
31.2
89.6
10.4
1965
2
8
9.8
3.3
13.0
59.5
19.0
91.5
8.5
1916
8
32
9.9
3.3
13.2
64.2
14.8
92.2
7.8
1900
1
"8
1
~
1
7.0
5.8
12.9
5.9
47.5
66.2
33.8
1069
"2
"2
1
7.4
6.2
13.6
15.6
40.6
69.8
30.2
1012
2
2
7.8
6.5
14.3
26.3
32.7
73.3
26.6
960
8
8
8.0
6.7
14.7
31. 7
28.7
75.1
24.9
938
1
1
"8
1
5.8
7.7
13.5
3.8
37.0
54.3
45.7
809
"2
2
"2
6.0
8.0
13.9
7.2
34.8
56.0
44.0
784
8
2
6.1
8.1
14.2
9.4
33.4
57.1
42.9
770
~
w
~
32
8
6.1
8.2
14.3
10.2
32.9
57.4
42.6
765
TABLE 4.4.2.3 '
E{MSE) COMPONENTS AS PERCENT OF TOTAL E{MSE)
FOR VARYING SHAPES OF THE BETA DISTRIBUTION
NO ADJUSTMENT MODEL
(n = 2000)
Components as Percent of Total E{MSE)
E{P) a· 8
.2
.5
.8
SRV
CRY
RV
NRV
SV
VAR
SQBIAS
Total
E(MSE)
1 1
82
1
2 2
3.6
12.0
15.6
9.3
20.0
44.9
55.1
520
4.3
14.3
18.6
20.6
14.1
53.2
46.8
436
2
8
4.8
15.9
20.6
28.9
9.3
58.9
41.1
394
8
32
4.9
16.4
21.4
32.1
7.4
60.9
39.1
380
1
8
1
2
2
1
8
1
2'
2
1.5
12.8
14.3
1.3
10.4
26,.0
74.0
488
1.7
14.4
16.2
3.6
9.5
29.3
70.7
433
2.0
16.3
18.3
6.6
8.2
33.1
. 66.9
383
8
8
2.1
17.3
19.4
8.3
7.5
35.1
64.9
361
1
2
2
8
1.0
13.3
14.3
0.6
6.4
21.3
78.7
470
1
1.1
14.0
15.1
1.3
6.1
22.5
··77.5,
8
1
2
2
1.1
14.5
15.6
1.7
6.0
23.3
76.7
431
32
8
1.1
14.7
15.8
1.8
5.9
23.5
76.5
426
•
•
,
.. 445
~
w
N
•
•
•
. I
•
TABLE 4.4.2.4
E(MSE) COMPONENTS AS PERCENT OF TOTAL E(MSE)
FOR VARYING SHAPES ·OF THE BETA DISTRIBUTION
POLITZ-SIMMONS MODEL
(n = 200)
E(P)
.2
.5
0.8
Components as Percent of Total E(MSE)
a
B
E(SRV)
E(CRV)
E(RV)
E( NRV)
E(SV)
E(VAR)
E(SQBIAS)
Total
E(MSE)
1.05
4.2
10.7
0.2
10.9
87.9
O.B
99.7
0.3
29693
2
8
10.4
1.9
12.3
76.9
7.7
96.9
3.1
3245
4
16
10.2
2.7
12.9
71.9
10.8
95.7
4.3
2317
8
32
10.2
3.0
13.2
69.7
12.2
95.1
4.9
2052
1.05
0.2
0.8
11.4
84.1
3.2
98.7
1.3
7769
2
2
1.9
5.4
15.1
54.6
21.6
91.4
8.6
1157
4
4
2.7
6.8
16.2
46.0
27.0
89.2
10.8
925
8
8
3.0
7.3 .
16.6
42.6
29.1
88.4
11.6
859
1.05
8.7
10.9
19.5
19.7
43.4
82.6
17.4
576
8
2
8.6
11. 1
19.7
17.9
44.6
82.2
17.8
561
16
4
8.6
11. 3
19.8
17.0
45.1
82.0
18.0
554
1.05
4.2
32
8
8.6
11. 3
19.9
16.6
45.4
81.9
18.1
551
~
w
w
TABLE 4.4.2.5
E(MSE) COMPONENTS AS PERCENT OF TOTAL E(MSE)
FOR VARYING SHAPES OF THE BETA DISTRIBUTION
POLITZ-SIMMONS MODEL
(n = 2000)
Components as Percent of Total E(MSE)
E(P)
a
B
.2
1.05
.5
.8
SQBIAS
Total
E(MSE)
3.2'
3116
SRV
CRV
. RV
NRV
SV
VAR
4.2
10.2
2.0
12.2
83.8
0.8
96.8
2
8
7.2
13.3
20.4
53.0
5.3
78.8
21.2
471
4
16
6.3
16.5
22.8
44.1
6.6
73.5
26.5
378
8
32
5.9
17.8
23.7
40.7
7.1
71.5-
28.5
351
1.05
8.9
6.8
15.7
70.8
2.7
89.2
10.8
923
1.05
2
2
4.3
23.9
28.2
24.1
9.5
61.8
38':2
262
4
4
3.7
26.2
29.8
17.8
10.5
58.1
41.9
239
8
8
3.5
26.9
30.4
15.8
10.8
56.9
43.1
. 232
1.05
2.4
30.7
33.1
5.6
12.3
50.9
49.1
.204
2
·2'.4
30.9
.33;3'
4.9
'12.4
50.6
"49:4
202 .
4
2.4
31.0
33.3
4.7
12.4
50.4
49.6
202
4.2
8·'
16
~
w
.j:>
32
•
8
2.3 .
31.0
33.4
4.5
•
12.4
50.3
49.7
,
201
•
•
•
•
TABLE 4.4.2.6
E(MSE) COMPONENTS AS PERCENT OF TOTAL E(MSE)
FOR VARYING SHAPES OF THE BETA DISTRIBUTI ON .
DOUBLE SAMPLING MODEL
(n = 200)
E(P)
.2
Components as Percent of Total E(MSE)
a
1
E(CRV)
E(RV)
E(NRV)
E(SSV)
E( PVAR)
E(SV)
E(VAR) E(SQBIAS) ,
"2
1
6.3
10.5
16.8
2.6
33.5
39.4
92.2
92 .2
7.8
596
"2
2
6.3
10.5
16.8
4.8
33.5
37.1
92.2
92.2
7.8
596
2
8
6.3
10.5
16.8
6.1
33.5
35.8
92.2
92.2
7.8
596
8
32
6.3
10.5
16.8
6.5
33.5
35.4
92.2
92.2
7.8
596
8"
8"
1
7.0
11.6
18.6
2.3
23.1
44.1
88.1
88.1
11. 9
539
"2
"2
1
7.0
11.6
18.6
5.8
23.1
40.6
88.1
88.1
11.9
539
2
2
7.0
11.6
18.6
9.3
23.2
37.1
88.1
88.1
11.9
539
8
8
7.0
11.6
18.6
10.9
23.2
35.5
88.1
88.1
11.9
539
7.8
12.9
20.7
3.2
10.0
48.6
58.6
82.5
17.5
.483
7.8
12.9
20.7
5.9
10.0
45.9
55.9
82.5
17 .5
483
7.8
12.9
20.7
7.5
10.1
44.2
54.3
·82.5
17.5
483
1
1
.8
E(SRV)
Total
E(MSE)
8"
1
.5
B
1
2"
2
8
32
1
8"
1
2"
2
8
7.8
12.9
20.7
8.1
10.1
43.7
53.7
82.5
17.5
483
~
w
<11
•
•
•
•
•
•
TABLE 4.4.2.8
EFFECT OF CORRELATION BETWEEN Y AND Y
ON PVAR AND NRV IN THE DOUBLE SAMPtING M06El
(n = 200)
E( P)
.2
(l
1
NRV
=
p =
1
PVAR
1
"2
5.2
36.8
2.6
39.4
0.0
41.9
"2
2
. 9.6
32.4
4.8
37.1
0.0
41. 9
2
8
12.2
29.7
6.1
35.8
0.0
41.9
8
32
13.1
28.8
6.5
35.4
0.0
41.9
8"
1
4.6
41.8
2.3
44.1
0.0
46.4
1
8"
1
.8
P
8"
1
.5
B
Components as Percent of Total E(MSE)
0
P = .5
pVAR
PVAR
NRV
NRV
"2
"2
1
11.6
34.8
5.8
40.6
0.0
46.4
2
2
18.6
27.9
9.3
37.1
0.0
46.4
8
8
21.8
24.6
10.9
35.5
0.0
46.4
1
"2
8"
1
6.4
45.4
3.2
48.6
0.0
51.8
2
"2
1
11.8
39.9
5.9
45.9
0.0
51.8
8
2
15.0
36.7
7.5
44.2
0.0
51.8
32
8
16.1
·35.6
8.1
43.7
0.0
51. 7
~
w
"
TABLE 4.4.2.9
OPTIMUM VALUES OF nand k FOR SELECTED
COST FUNCTIONS AND SELECTED VALUES OF E(P)
C-c = 5000
c 3 = 50
O
E(P) 1
~----
0.1
CR
1
0.2
0.5
0.8
1
ct
CR
0.3
--
----
0.5
0.7
0.9
n
k
n
k
n
k
n
k
n
k
.04
137
1. 34
214
1.85
274
2.04
343
2.16
443
2.40
.10
99
1.00
169
1.52
236
1.83
315
2.04
430
.16
93
1.00
142
1. 32
208
1.67
292
1. 94
418
2.32
0.10
96
1.00
122
1. 17
145
1.29
166
1. 37
190
1. 52
.25
85
1.00
97
1.00
123
1.16
152
1.29
184
1.49
.40
76
1.00
88
1.00
108
1.06
140
1.23
179
1.47
0.16
88
1.00
95
1.00
103
1.02
112
1.08
122
1.20
.40
74
1.00 .
81
1.00
90
1.00
102
1.02
118
1.18
.64
64
1.00
72
1.00
81
1.00
95
1.00
114
1. 16
2
~
2.36
and fl need .not be specified since all combinations leading to a value of E(P) yield similar results.
~
w
00
•
•
•
•
•
•
TABLE 4.4.2.10
COMPARISON OF E(MSE) OF THREE MODELS FOR FIXED COST
C - Co = 5000. c 3 = 50. cl /c3 = .5. C/C3 = .1
E{P)
0.2
0.5
CL
B
1
n
DOUBLE SAMPLING
k
E(MSE)
,
NO ADJUSTMENT
n
E(MSE)
POLI TZ -S IMMONS
n
E(MSE)
1
"8
1
"2
106
1.06
672
555
966
555
*
"2
2
106
1.05
672
555
879
555
*
2
8
106
1.05
672
555
834
555
1273
8
32
106
1.05
672
555
820
555
843
32
128
106
1.05
672
555
816
555
788
1
"8
1
"2"
1
"8
1
145
1. 30
574
333
811
333
*
"2
145
1. 29
574
333
756
333
*
2
2
145
1.29
574
333
704
333
760
8
8
145
1.29
574
333
682
333
581
32
32
145
i.29
574
333
675
333
558
*Contains negative E(MSE) components.
~
w
'"
TABLE 4.4.2.10
(Conti nued)
E(P)
O.B
Cl
1
f3
n
NO ADJUSTMENT
n
E(MSE)
DOUBLE SAMPLING
k
E(MSE)
POLITZ-SIMMONS
n
E(MSE)
"2
8"
1
177
1. 41
495
238
749
238
*
2
"2
1
177
1.40
494
238
724
238
560
8
2
177
1.40
494
238
710
238
497
32
8
177
1.40
494
238
705
238
489
128
32
177
1.40
494
238
704
23B
4B7
*Contains negative E(MSE) components.
..,.
~
a
•
•
•
••
•
•
TABLE 4.2.2.11
COMPARISON OF E(MSE) OF THREE MOOELS FOR FIXED COST
C - Co ; 5000,
C 3 ; 50,
c1/C3; .2, C 1Ic 3 ; .1
E(P)
0.2
0.
DOUBLE SAMPLING
fl
NO ADJUSTMENT
E( MSE)
POL IlZ-S I MMONS
n
k
2"
135
1.29
621
833
760
833
*
2"
2
135
1.29
621
833
674
833
*
2
8
135
1.29
621
833
631
833
903
8
32
135
1.29
621
833
617
833
616
32
128
135
1.29
621
833
.613
833
579
1
8
1
1
E(MSE)
n
n
E(MSE)
TABLE 4.4.2.11
\
(Continued)
E(P)
0..8
Cl
1
a
n
DOUBLE SAMPLING
E(MSE)
k
NO. ADJUSTMENT
n
E(MSE)
Po.L ITZ-SIMMo.NS
n
E(MSE)
2"
2"
1
366
2.15
335
555
568
555
*
2
1
2"
366
2.14
335
555
543
555
333
8
2
366
2.14
335
555
529
555
30.6
32
8
366
2.14
335
555
524
555
30.3
128
32
366
2.14
335
555
523
555
30.2
*Contains negative E(MSE) components.
~
.".
N
•
•
•
•
CHAPTER V
AN EMPIRICAL INVESTIGATION OF THE
ESTIMATION OF RESPONSE PROBABILITIES
5.1
Introduction
In this chapter response probabilities are estimated from actual
survey data, using auxiliary information.
The mean is then estimated
using a Politz-Simmons type estimator and using the average of the
••
responses only.
The Politz-Simmons type estimator was chosen because
it makes direct use of the response probabilities to correct for nonresponse bias.
and compared.
The associated biases of both estimators are estimated
The results suggest, as might be expected, that if the
'variables used in the estimation of response probabilities are only
minimally related to response, then it may be better to use the latter
estimate of mean.
5.2 Estimation of Response Probabilities
Response probabilities were estimated from the Virginia Health
Survey (VHS) using linear and logistic regression.
The Virginia
Health Survey is a complex multi-stage survey conducted by the
Research Triangle Institute in 1976-1977 which involved 6268 house-
•
holds.
Of these 6268 households, 5605 were eligible to respond and
of those 5605 who were eligible, 5069 actually responded; thus the
144
response rate was actually 90.4 percent.
e
The method used to estimate the response probabilities requires
information on both the respondents and nonrespondents.
Some surveys
have background information on every el igible unit; for example, suppose the survey is drawn from records, say drivers' licenses or medicare, in which a limited amount of information is available, say age,
race, sex, and often other variables which might be related to response;
some large ongoing national surveys like the Current Population
Survey may make some effort to get age, race, and sex of household
head from non respondent households.
of callbacks from such
Further, information on number
surveys is often kept and has been used in
the estimation of response probabilities; for example, by Thomsen
and Siring in 1979. However, for many surveys, information by unit
e·
for every sampling unit, respondent and non respondent alike, is unavailable.
For such households it may be necessary to associate each
unit with characteristics from an aggregate of units.
In this case,
each sampling unit is linked to a minor civil division (MCD); there
are approximately 500 such MCD's in Virginia.
This is the smallest
geographic area to which each sampling unit in the VHS could be linked.
The variables to which an association with response was sort were
socioeconomic characteristics of the MCD obtained from the 1970 Census
fourth count tapes, including education, income, race, degree of
urbanization, and occupation.
Specifically, the data were manipulated
such that the following ten variables were candidates for independent
variables in the regression equation:
a)
The mean education of persons 25 years old or over· (MEANED) ;
•
•
••
145
b)
The percent of persons 25 years old or older who had eight
years or less of education (PED1) and the percent of such
persons with four or more years of education (PEDS);
c)
The mean total family income (MNTFINC);
d)
The per capita income (the ratio of aggregate income of
individuals 14 years old and over and the total persons 14
years of age and over (PCINC);
e)
The percent of families with incomes below the poverty level
(PPOV1) ;
f)
The percent of the population which was black (PBLK);
g)
The percent of the population living in rural areas (PRURAL);
h)
The percent of persons with white collar jobs (POCC1);
i)
The average total persons per household (AVGTPPH).
The following table gives the minimum, maximum, and unweighted mean
of the ten variables used in the estimation procedure over the 121
MCD's included in the sample.
TABLE 5.2.1
CHARACTERISTICS OF INDEPENDENT VARIABLES
USED IN THE ESTIMATION OF RESPONSE PROBABILITIES
•
Variab1 e Name
Minimum
Maximum
Mean
MEANED
PED1
PEDS
MNTFINC
PCINC
PPOV1
PBLK
PRURAL
POCC1
AVGTPPH
6.5
5.1
13.6
70.1
37.7
20390
5452
35.8
58.1
100.0
43.9
4.2
10.5
30.8
13.2
10865
3087
11.6
16.4
38.7
30.7
3.2
1.0
5256
1437
2.3
0.0
0.0
14.9
2.4
Though it seems reasonable that some combination of these variables
146
•
••
•
•
147
design - the primary sampling unit (PSU).
There were 210 PSU's in
the survey, of which 166 had at least one nonrespondent.
Both linear and logistic regression were used in modelling.
Linear regression had the advantage of simplicity and lower computer
costs; however, the possibility of obtaining estimates of response
probabilities greater than 1.0 existed, and was realized.
The
logistic regression was more expensive; however, all estimates of
response probabilities would be between 0 and 1.
The scatter plots
discussed earlier suggested that a model based on either approach
would be ·less than ideal .
.Now for the linear regression the estimate of Pi is
••
J
Pl' = a +
where
p.
=
a
= intercept term,
1
(.
=
~
I:
j
~l
S·Y .. ,
J
1J
the estimate 0fhthe response probability for all sampling
units in the i MCD, PSU or segment, depending on the
model,
the estimated coefficient corresponding to the jth variable in the model, including any interaction terms,
the observed value for the i th MCD, PSU or segment
corresponding to the jth variable.
For logistic regression the estimate of P.1 is
~
J
I:
P. =
1
a+ .1:
J lJ
-=------,---J
A
1 + e
•
J- 1
S.Y ..
a+ .1:
B·Y ..
J- 1 J lJ
where the variables are as defined above.
why Pi is always between 0 and 1.
It can easily be seen
148
All of the variables were not used in each model; rather, a stepwise procedure coupled with some subjectivity on what interactions to
•
consider and what variables to allow in the model at the same time was
used.
~,
Though the purpose of the study was not to compare models
~
but to find a good model, some of the" better attempts" are shown
here.
MODEL 1:
619 Segments - Linear Regression (LINSEG)
Parameter
Intercept
PREF
MNTFINC
POCCl
AVGTPPH
AVGTPPH*MNTFINC
AVGTPPH*POCCl
MODEL 2:
Estimate
Estimate/
Standard Error
2.057
-0.139
-0.000+
-2.688
-0.306
-0.000+
0.789
5.92
-12.50
- 2.76
- 2.71
- 2.71
- 2.39
2.47
•
619 Segments - Logistic Regression (LOGSEG)
Parameter
Intercept
PREF
MNTFINC·
PPOVl
PRURAL
POCCl
AVGTPPH
PREF*PRURAL
PREF*AVGTPPH
Estimate
Estimate/
Standard Error
2.095
1.879
-0.000+
-3.897
0.927
-6.127
1.248
-0.749
-0.933
2.263
2.004
-3.960
-2.932
4.070
-4.032
5.815
-2.570
-2.997
•
.
•
149
MODEL 3:
121 MCD's - Linear Regression (LINMCD)
Parameter
Intercept
PCINC
PPOVl
PBLK
PRURAL
AVGTPPH
PCINC*AVGTPPH
PPOV1*AVGTPPH
MODEL 4:
Intercept
MNTFINC
PPOVl
PRURAL
POCCl
AVGTPPH
MNTFINC*POCCl
MODEL 5:
Parameter
Intercept
PREF
PCINC
PPOVl
PRURAL
POCCl
AVGTPPH
PREF*PRURAL
•
Estimate/
Standard Error
2.045
-0.000+
-2.935
0.069
0.040
-0.324
0.000+
0.832
5.53
-3.70
-2.45
1.86
2.25
-2.83
3.35
2.31
121 MCD's - Logistic Regression (LOGMCD)
Parameter
••
Estimate
Estimate
Estimate/
Standard Error
-1. 774
0.000+
-2.785
0.775
9.718
0.674
-0.001
-0.962
1.982
-2.167
4.816
1.603
4.577
-2.569
210 PSU's - Logistic Regression (LOGPSUj"
Estimate
Estimate/
Standard Error
3.217
-0.116
-0.000+
-3.065
1.229
-5.182
0.545
-0.990
3.847
-0.757
-3.505
-2.463
5.279
-3.460
3.627
-3.305
Table 5.2.2 gives some descriptive information on the estimates
of response probabilities as estimated.
For the linear regression model LINSEG, 8 of the 619 estimated response probabilities were above 1.0000. These probabilities were
set to 1.0000.
The rather limited range of the estimated probabili-
ties is a definite·drawback to estimation at these levels.
Table 5.2.3 gives some descriptive information on the estimated
.-
probabilities of response after assignment to the 5605 reporting
units in the survey and to the remaining 4570 units after exclusion
of the nonrespondents and those units which did not respond to the
income question.
TABLE 5.2.3
THE ESTIMATED RESPONSE PROBABILITIES IN THE SAMPLE
N=5605
N=4570
Medium
N=4570
N=5605
.9005
.9044
.9088
.9044
.9044
.9058
.9091
.9112
.9067
.9066 .
.9354
.9233
.9145
. 9135
.9049
Mean
Method
LINSEG
LOGSEG
LINMCD
LOGMCD
LOGPSU
.9418
.9249
.9147
.9138 .
. 9084
•
•
151
5.3 The Estimation of Bias
The estimated response probabilities were linked to each samp1ing unit in the survey to be used to obtain a Politz-Simmons type
estimate of the mean and its relative bias.
mainly dichotomous in nature.
The VHS questions were
Fo"r example, "Has anyone in the family
ever had a heart attack, a stroke, or been treated for cancer?"
One
of the few questions which had enough categories such that it could
be considered a continuous variable was the question on income, which
had 14 categories ranging from income under $1000 to income of $50,000
or over.
'.
It was assumed that income was distributed uniformly within
each category, and each sampling unit was assigned the midpoint of the
category in which it fell.
The 14 categories and the assigned values
are given below.
Assigned Value
Category
under $ 1,000
1,000- 1 ,999
2,000- 2,999
3,000- 3,999
4,000- 4,999
5,000- 5,999
6,000- 6,999
7,000- 9,999
10,000-14,999
j) 15,000-19,999
k) 20,000-24,999
1) 25,000-34,999
m) 35,000-49,999
n) 50,000 and over
a)
b)
c)
d)
e)
f)
g)
h)
i)
500
1,500
2,500
3,500
4,500
5,500
6,500
8,500
12,500
17,500
22,500
30,000
42,500
60,000
$
The mean income and the proportion of families with an incidence
•
of heart attack, stroke, and cancer were estimated, using a Po1itzSimmons type estimator.
The responding sample units were considered
152
the population; since item nonresponse was not the major concern,
those units that responded to the survey but not to the particular
•
question of concern were not considered part of the population.
Thus in estimating income, 499 households for which no response to
the income question was obtained were excluded from the sample,
leaving 4570 sampling units.
In the case of the question on heart
attacks, stroke, and cancer only four households were excluded,
leaving 5065 households as ·the population.
Five hundred samples of
size 200 were selected randomly with replacement from the population.
The design of the original VHS was ignored.
Thus every unit selected
in the sample of 200 had .equal probability of being selected .. The
200 units within each sample were then designated as respondent or
nonrespondent randomly, according to their assigned response probabilities and assuming a uniform distribution.
.'
For each selected sample the mean (or proportion) was estimated
according to the following formula:
A
-
V
S
1 n
~
=-
L
n 1'-1
where
c,V'l
1 1
A
P,
1
estimated mean of income or proportion of persons having
HSCi n the family,
n
= sample size (200),
=
1 if unit responded in the sample,
= 0 otherw.ise,
= response for the i th sampling unit,
= estimated response probability of the i th sampling unit.
It was assumed that the population mean or proportion V was equal to
,
•
•
153
that of the 4570 sampling units in the case of income and the 5065
sampling units in the case of HSC which were regarded as the population.
Now the estimate of the population mean was:
:::
1 S :::
Y=SLY s '
s=l
Interest was basically in estimates of relative bias where
~
Relative Bias =
Bi as Ys
EY - Y
= ---,s,,--_ =
Y
Y
~
Now the estimated relative bias (RB i as) is given below:
S
L (Ys- Y) Y - Y
s=l
RBias =
= SY
Y
The standard error of the relative bias was also estimated.
A
A
-.
The variance of the relative bias is:
2
- E
°RBias
=---;:---
y2
which may be estimated by
S
2
°RBias
=
1
rr
~
A
l. (Y _ Y) 2
s~l
5
-'-=.,,-.......S(S-l)
the estimated standard error of the relative bias is 0RBias'
~
•
The estimated variance of Y =
S
I..
s=l
~
(Y s -
~
y)2
S(S-1)
In addition to finding the relative bias and its standard
154 .
error for estimates of the mean based on the response probabilities,
the relative bias of an estimate of the mean with no adjustment for
•
nonresponse was found; that is, the mean of each sample was taken as
the mean of the respondents.
n
Y
s
where n, 0i
spondents =
a~d
=__
1 IO'Y'
nl i=l ' , l
Yi are as defined above and nl = .the number of reEach unit was assumed to respond with probability
Io ..
i =1 '
.9044 which was the overall response rate of the sampl e.
The steps. discussed above and leading to the results of Tables
5.3.1 and 5.3.2 are summarized below:
1) The estimated response probabilities were merged by segment,
PSU or MCD depending on the level of estimation to the file
containing all the sample cases.
2)
Those units for which there was a response were kept as the
population. Thus, when the item in question was income,the
536 nonrespondents and the additional 499 item nonrespondents
were del eted.
\ .
3)
From the "population" a sample of size 200 was randomly
selected with replacement.
4)
Each unit was designated respondent or nonrespondent according to its response probability, using uniform random numbers.
5)
The mean (or proportion) and the relative bias were estimated
for the sampl e.
6)
Steps 3-5 were repeated 500 times.
7)
The overall estimate of relative bias based on the 500 estimates, its standard error, and the variance of? were estimated.
..
. '.
••
As can be seen from the two tables, there is little difference in
estimates of bias using any of the estimated response probabil ities.
The estimated standard error of the relative bias is of the same
•
•
155
magnitude as the estimated relative bias for some of the estimates,
so small apparent differences cannot be detected.
It may be noted
that the bias is negative for income and positive for HSC.
Comparisons should be made between the bias resulting from the
"no adjustment" type estimator and the Pol itz-Simmons estimators.
Since the purpose of the estimation of response probabilities and
their use in the Politz-Simmons type estimators is to reduce bias,
it would be desirable for these response probabilities to produce
smaller biases than the "no adjustment" estimator.
The point esti-
mates of relative bias of the estimate of mean based on the MCD data
-.
were lower than that of the "no adjustment" estimator, but this was
likely a chance occurrence.
The linkage of units to larger geographical areas and the use
of socioeconomic characteristics in estimating response probabilities
may help to reduce bias of nonresponse; however, it should be to an
area small enough to show a relationship between percent response
and the characteristics of the area.
Possibly even one or two vari-
ables with at least a limited relationship to nonresponse on a per
unit or small cluster basis, if available, in conjunction with socioeconomic characteristics for larger areas might be productive; otherwise, for surveys like the VHS, it might be better not to make any
adjustment or to stick to the more conventional methods.
In such
cases the estimated response probabilities could simply be used to
provide insight into survey design issues .
•
TABLE 5.3.1
THE RELATIVE BIAS OF ESTIMATES OF MEAN INCOME
(Based on 500 Samples of Size 200)
Relative Bias
Method of
Estimati on
Y
LINSEG
16.212
45
LOGSEG
16.219
LINMCD
A
A"
0-
A
Minimum
Maximum
Medium
Mean
Standard
Error
.002B
-.2052
.2028
-.0019
-.0039
.0028
45
.0028
-.2005
.2058
-.0014
-.0035
.0027
16.235
45
.0028
-.2049
.1865
.0011
-.0025
.0028
LOGMCD
16,236
45
.0028
-.1829
.1818
.0002
-.0025
.0028
LOGPSU
16.230
45
.0028
-.1965
.1809
-.0008
-.0029
.0028
No
Adjustment
16,231
41
.0025
- . 1815
.1688
- .0034
-.0028
.0025
Y
CV(Y)
U1
O"l
•
"
•
"
•
"
•
•
•
TABLE 5.3.2
THE RELATIVE BIAS OF ESTIMATES OF PROPORTION OF HOUSE- .
HOLDS HAVING INCIDENCE OF HEART ATTACK, STROKE, OR CANCER
(Based on 500 Samples of Size 200)
Relative Bias
Method of
Estimation
A
Y
AA
0-
y
A
CV(Y)
Minimum
Maximum
Medium
Mean
Standard
Error
LINSEG
.1201
.0011
.0093
-.4B22
.7050
.0266
.0189
.0094
LOGSEG
.1199
.0011
.0092
-.4825
.7040
.0219
.0170
.0094
LINMCD
.1200
.0011
.0099
- .4940
.7027
.0157
.0178
.0094
LOGMCD
.1199
.0011
.0094
-.5305
.7193
.0173
.0169
.0096
LOGPSU
.1198
.0011
.0093
-.4851
.7194
.0213
.0168
.0094
No
Adjustment
.1199
.0011
.0094
-.4955
.6771
.0162
.0173
.0095
~
.....
'"
•
CHAPTER VI
SUMMARY AND SUGGESTIONS FOR FURTHER RESEARCH
6.1
Summary
In this research the Hansen and Hurwitz (1946) formulati·on of
a double sampling scheme to correct for nonresponse is modified such
that nonresponse is treated stochastically and considering a more
complete error model; that is, measurement error is considered along
with sampling error and nonresponse.
In this double sampling model,
it is assumed, admittedly somewhat unrealistically for a sample of
.-
any scope, that a response is obtained from everyone in the subsample
of nonrespondents.
In Chapter II the mean square error of the sample mean based on
double sampling to correct for nonresponse is developed for a design
which uses interviewers at both phases of the sample.
Here it is
assumed that the sampling units are assigned randomly to J interviewers.
In Chapter III estimators are found for the mean square error
components of Chapter II by assuming that the sample is repeated independently of the first sample.
This is a necessary but invalid
assumption, resulting in biased estimates of the simple response
variance.
It was also found that each component corresponding to
each of six randomization steps could not be estimated.
Specifically,
•
•
159
the sampling and nonresponse variance can be estimated together and
the nonresponse variance and variance due to the random assignment
of units, initial or subsample, can be estimated together.
Thus, then
in the probabilistic model, estimates of sampling variance also contain the nonresponse variance or some fraction of it and thus are
•
overestimates of the pure sampling variance.
The nonresponse variance
is "due to the differences in response bias of a response from the initial sample and from the subsample of nonrespondents and an additional
term which vanishes when J=l.
Chapter IV looks at the E(MSE) of Chapter II under the assumption
that the response probabilities are drawn from a beta distribution and
••
the population values are drawn from a normal distribution.
The E(MSE)
of the estimator of the sample mean based on double sampling is compared
to that of a simplified Politz-Simmons and no adjustment model under
certain assumed parameter values and for fixed cost.
It is found that
though the double sampling model in many instances has a lower E(MSE)
than the other models, for certain values of a and 6 and E(P), the other
models are preferab"e.
~Ihen
the cost of obtaining response and pro-
cessing data from the initial sample is half that of obtaining response
and processing data from the subsample, then the Politz-Simmons model
is comparable to that of double sampling for E(P) i .5 and a
~
.8.
When the cost of obtaining and processing the initial sample units is
only 20 percent of that of the subsample of non respondents , then the
E(MSE) of the no adjustment and Politz-Simmons models compares more
•
favorably than when the cost of the initial sample is higher.
Thus
survey costs, the overall expected response probability or response
,
160
rate, and the distribution of response probabilities in the population all play a key role in the decision as to the best method of
•
handling nonresponse.
Also a study of the E(MSE) ,revealed that for the double sampling
model, the expected population variance and the nonresponse variance
•
vary together as a and S increase, that is, the population variance
decreases by the same amount that the nonresponse variance increases.
We recall that in estimation, the two terms proved inseparable.using
the usual methods of estimation.
In Chapter V response probabilities are estimated from the
Virginia Health Survey using 1970 Census data.
The minor civil
division level of estimation proves too large to provide estimates
of response probabilities that could be used in adjusting for nonresponse.
e·
However, they could possibly be used in providing insight
into the distribution of response probabilities that might prove
useful in survey planning.
Of course, since the outcome of a survey
is used in developing these estimates, then the estimated response
probabilities would be helpful in survey planning only in continuing
surveys or surveys with target populations and survey designs similar
to that from which the probabilities were estimated.
6.2 Suggestions for Further Research
The above work readily lends itself to further research; for
example, many of the simplifying assumptions could be relaxed.'
. It is assumed in the development of the model that the response
probabilities are a function of each unit only, that is, that
•
•
161
E(6;)
P; and that E(6 i 6; ,) = P;Pi" However,;t is logical to
assume that in a survey which uses interviewers, the event of respond=
. ing or not responding will be linked to the interviewers and correlated within interviewer assignment area, such that E(6 ij )
= Pj j and
E(6 ..6.,.) = P.. P.,. + Cov(P .. ,P.,.).
lJ 1 J
lJ 1 J
lJ 1 J
It is suggested that the model be developed under this assumption.
The assumption that all of the original nonrespondents respond
in the subsample should be relaxed.
Imputation could be used to
correct for the residual nonresponse.
It might be interesting and
useful to see for what cost functions in conjunction with nonresponse
•
bias and percent residual nonresponse, would the· double sampling
model be preferred over imputation for all nonresponse.
In Chapter IV a method of estimating the response probabilities
used in the Politz-Simmons model could be incorporated in the model
and in the no adjustment model nl , the number of respondents, should
be treated as the random variable it is.
A study of the effect of differing relationships between P and
Y on the E(MSE's) of Chapter IV would also be helpful.
This study
could be accomplished in conjunction with an empirical study of nonresponse patterns from panel surveys.
That is, response probabilities
for sampling units could be estimated from the. event of responding or
not responding over the panels in the study.
Then the observed value
could be averaged over panels for all sampling units with estimated
•
•
response probabil ity greater than zero, ·that ;s, for which a response
;s obtained in at least one panel.
Then the correlation between
162
A
P and Y could be estimated using Pi and
Yfor
A
all i, Pi
r O.
e
Last, the response probabilities could be estimated from a
survey with background information on all units, respondent and
nonrespondent alike.
e·
.
I
•
•
163
BIBLIOGRAPHY
Bailey, L., "Toward a More Complete Analysis of the Total Mean Square
Error of Census and Sample Survey. Statistics," proceedinTs of
the Social Statistics Section of the American Statistica
Association, 1975, 1-10.
Bailey, L., Moore, T.F., and Bailar, B.A., "An Interviewer Variance
Study for the Eight Impact Cities of the National Crime Survey
Cities Sample," Journal of the American Statistical Association,
73 (March 1978), 16-23.
Brooks, C.A. and Bailar, B.A., Statistical Policy Horking Paper 3 An Error Profile: Employment as Measured by the Current Population Survey, ,Office of Federal Statistical Pol icy and Standards,
U.S. Department of Commerce, September 1978.
Cassel, C.M., Sarndal, C.E., and Wretman, J.H., Foundations of Inference in Survey Sampling, New York: John Wi,ley and Sons, 1917.
••
Chapman, D.W., "A Survey of Nonresponse Imputation Procedures,"
Proceedings of the Social Statistics Section of the American
Statistical Association, 1976, 245-251 .
Fellegi, I.P., "Response Variance and Its Estimation," Journal of the
American Statistical Association, 59 (December 1964), 1016-1041.
Hansen, M.H. and Hurwitz, W.N., "The Problem of Nonresponse in Sample
Surveys," Journal of the American Statistical Association, 41
(1946), 517-529.
Hansen, M.H., Hurwitz, W.N., and Bershad, M.A., "Measurement Errors
in Censuses and Surveys," Bulletin of the'International
Statistical Institute, 38 (1961),359-374.
Hansen, M.H., Hurwitz, W.N., and Madow, W.G., Sample Surve{ Methods
and Theory, Vol. II, New York: John Wiley and Sons, 953.
Hansen, M.H., Hurwitz, W.N., and Pritzker, L., "The Estimation and
Interpretation of Gross Differences and Simple Response Variance,"
Contributions to Statistics, Oxford, England: Pergamon Pr,ess,
1964.
Johnson, N.L. and Kotz, S., Continuous Univariate Distributions - 2,
Boston: Houghton Mifflin Company, 1970.
•
Kalsbeek, W.D., "A Conceptual Review of Survey Error Due to Nonresponse," Unpublished manuscript, 1980 .
Kish,'L., Survey Sampling, New York:
John Wiley and Sons, 1965.
164
Koch, G.G., "Some Survey Designs for Estimating Response Error ~lodel
Components," Unpublished Technical Report #5, Research Triangle
Institute, Project 216-730, January 1973.
•
Koch, G.G., "An Alternative Approach to Multivariate ResponseError
Models for Sample Survey Data with Applications to Estimators
Involving Subclass Means," Journal of the American Statistical
Association, 68 (December 1973), 906-913.
Lessler, J.T .. "A Double Sampling Scheme Model for Eliminating Measurement Process Bias and Estimating Measurement Errors in Surveys,"
Ph.D. Dissertation, Institute of Statistics Mimeo Series No. 949,
University of North Carolina, Chapel Hill, NC, 1974.
Lessler, J.T., "An Expanded Survey Error Model," Symposium on Incomplete Data: Preliminary Proceedings, U.S. Department of Health,
Education, and Welfare, December 1979, 371-387.
Lessler, J.T., "Measurement Error in Surveys with Special Reference
to Subject ive Phenomena," Unpub 1i shed manuscri pt, 1981.
Madow, W.G., "On Some Aspects of Response Error Measurement," Proceedings of the Social Statistics Section of the American Statistical
Association, 1965, 182-192.
Neyman, J., "Contributions to the Theory of Sampling Human Populations,"
Journal of the American Statistical Association, 35 (1938),
101-116.
Platek, R. and Gray; G.B., "Imputation Methodology:
Error," Unpub 1i shed manuscri pt, 1980.
.'
Total Survey
Platek, R., Singh, M.P., and Tremblay, V., "Adjustment for Nonresponse
in Surveys," Survey Methodology, Vol. 3, No.1, 1-24."
Politz, A.N. and Simmons, W.R., "An Attempt to Get the 'Not-at-Homes'
into the Sample Without Callbacks," Journal of the American
Statistical Association, 44 (1949), 9-31.
Thomsen, 1. and Siring, E., "On the Causes and Effects of Nonresponse:
Norwegian Experiences," Symposium on Incomplete Data: Preliminary
proceedinys, U.S. Department of Health, Education, and Welfare,
December 979, 21-64.
•
165
•
APPENDIX A
Details of Derivation of Optimum nand k
The steps used in obtaining optimum nand k in Chapter II,
Section. 2.4 are outlined below, where
F = F0 + AF l
. A
k
l
= 11 + n A2
+
T:P = Q.
A3 + A[C-C O -
Differentiating F with respect to nand k and then setting
both of the resulting equations to zero yields
'.
(A. 1)
(A.2)
Now solving equation (A.2) for A gives
The substltution of A into equation (A.l) yields
=0
.
The quadratic solution of this equation yields for k
•
166
which may also be written as:
•
k =
To obtainn, k is substituted into Fl , giving:
Al
Q
A "'c-----c=-2--2 -lp + -Q
c3
c3
which on solution yields
n =
•
167
•
APPENDIX B
Details of Derivation of Estimators of MSE Components
The following combination of sums of squares were used in forming the estimators of variance components of Section 3.3.2 in
Chapter I II.
Initial Sample
1
2
= - a
n R
1
3)
a.
,
2
2n NI JI.. Po2 Y001+ NJ
i=lj=l 1 lJ
•
•
2
E[BTBI 1 - J(BTWI 1)] = NS
N J
2
U1=lJ=1
o! Pi Yij1
2
N-J J
-.".!h\'
I: \' P. Y 0 Y o
NJ 1o=L1J~"J~' 1 iJ 1 iJ 'l
T
168
c.
E{[BTBII
1
- J(BnJII )] - 2(3;1) [BTBI - J(BTWI )]}
1
1
•
2n
=b-JTa
4n( n-')
=-N(N-l)J 3
~ N ~
l~f'j~,r/i'
Yij,Yi'j,
4n ( n- 1) N N J J
+ N(N-l)J 4 i,),I),PiP i , Yij'Y i 'j "
JrJ
,
d.
'r,'
E{[WTBl 1 -
*
(BTWI,)] -
2J(n~' ~,O~l)}
2
N J
2
N J 2 2
2 NJ 2 2
=N[ t I: P'Y" l - I: t P.Y .. ,] +N I I P.y .. ,
n
i='j~' 1 1J
i~'j=l 1 1J
n i='j~' 1 1J
'l
'l
2f n
NN J
+ NN=1J I I IP'P"Y"'Y"'l
n
i~i'j=' l '
'J 1, J
-
e.
2f n,'N N J J
'
NN-1J 2 t\t\P.P"Y",Y.,."
n
irl'jjl3' 1 1
1J , J
E{[WTBI, +
2~~
= d
*
(BTWI,)] -
2J(n~' Ap10~,)
{BTBII, - J(BTWll 1 ) - 2(3;') [BTBI l - J(BTWI,)]} }
J2
+ 2rJ'2" c
2
NJ
2
NJ 22
2 NJ 22
=N[I Ip'Y"l- I tP'Y"l]+N t tP'Y"l
n
i=lj=l 1 1J
i=lj=l 1 1J
n i=lj=l ' 'J
f.
J
2n(n-l) ~ ~ ~
E[WTWl 1 - 2(BTBII 1 )] = NJ3
.L ~ l, Pi Y "Y "l
iJ iJ
1=lJrJ'
2n(n-') N N J J
+ N{N-1)J 3 ),+1, ,1,+)' ,PiP i , Yij 'Y i 'j'l
1r 1 JrJ
•
•
169
Now. assumi n9 NJ - 1 ; NJ. N ~ N - 1, and 1 - f ; 1. then
~r'::l,......-,- (J.) • 1 2
2nf(n-1)
= n °5
1
1
2
+ nJ °NR
1
N J
where
.L
Y. (P.V· 1 _ PV)2
]=lj=l
may be written as
•
"
170
4)
Assuming J-1 ; J, then
1
~ e
• 1
2
= n °NR 1
•
= -
1
+.1.
02
IA
n
N J
I'
I'
nNJ i~lj~l
•
1
2
P. Y"
1
lJ l
Subsamp1e of Nonrespondents
k2J
k
1) Pn E(BTWI 2) =-n °R '
2
2
•
•
•
•
171
c.
E{[BTBII 2 - J(BTWII 2)] 2n
=b-ya
2~~;11
[BTBI - J(BTWl 1)]}
172
J; (BTWII 2)]
h. E(WTWI 2 -
2~~~jP
=
+ 2n~n-P
k NJ
.N
N
J
Ul=lJ=l
,~(1-Pi)Y~J'2
-
J
.
•
~ ~ (l-p,h~,z]
i';lj=l
1
lJ
N J
~
I
i=lj';l
(l-p,h:,
1
lJZ
Zn(n-1) N N J
.
- k2N(N-l)J2 trL~l(l-Pi)(l-Pi') YijZ Yi 'j2
.s
Now, assuming NJ-1 ; NJ, N
k2
Zn 2 (
4)
,
= nNJ2
2
c +
N-1, and 1-f ; 1, then
n-1) J
k-1·~ ~
k-1
ZkJ2 e
~
,f. ,f.
l=lJ=l
(
) 2
1- Pi YiJ'2
4n(n-1) ~ ~ ~ ~ (1 P )(1 P )Y
Y
kZN(N-l)J" ~4?'I, ..?, - i
- i' ij2 i'j'2
lr 1 JrJ
NN J
3
h) = k-1
1
~ I I (l-P,)(1-P, ,)Y oozY" 'z
2 k J2
n (n-1)
n N(N-1)J21-P iri'j=l
1
1
lJ 1 J
Zn (n-1) ' k3(N-l)Jll 1 -
k-1
ZJ2k I-P
(e
-
•
.0
173
')
( c + k2n2(n-1~
3 {N-1JJ
1
k 2 (k_1}J2
4n 2(n-1} I-P
=
k-1
1
n N(N-1}J 2
NNJ J
I-P
t;t·~;~,
"
So
k-1
2kJ 2
e (c + 2n 2(n-1) i}
4n 2(n-1} 1-P
k 3(N_1}J3
k2(k_1}J2~
5}'
•
•
k
2
= nOlA
2
Correlation Between Initial and 5ubsamp1e Respondents
2}
.'
J-1
2J7 e
k-1 2
= -n- ass
a.
c.
a +
-
=
k
n
2 N J
,
b = [N"
L
P,(1-P.}v .. v .. ]
I~v i=lj=l 1
1 lJ 1 lJ 2
2( n-1
NN-1J
n
I
NN J
V ·"2
LL
I. P.(1-P,,}V"l
i;i 'j=l 1
1
lJ
1 J
. 174
where NJ
~
NJ-1. n
1
nJc ; .£n
a2
5
12
~
•
n-1
2 a2
- -n
NR
12
I
© Copyright 2026 Paperzz