The Use of Sample Overlap Methods in the Consumer Price Index

The Use of Sample Overlap Methods
in the Consumer Price Index Area
Redesign
William Johnson, Steve Paben, John Schilp
Bureau of Labor Statistics, 2 Massachusetts Ave., NE, Room 3655, Washington, DC 20212
Abstract
Approximately every ten years the Consumer Price Index (CPI) selects a new area sample.
Due to costs associated with the opening and closing of field offices, it is considered highly
desirable to maximize the overlap between the old and new areas sampled in the redesign.
This paper will outline the overlap maximization methods investigated in the selection of noncertainty Core Based Statistical Areas for the next CPI area redesign. Two of the methods of
overlap maximization investigated were the Perkins method and an optimal linear
programming method proposed by Ernst (1986). The results from these methods were
compared to a sample obtained independently.
Key Words: Sample Redesign, Overlap Maximization, Linear Programming
1. Introduction
The Consumer Price Index (CPI) is a measure of the average change over time in the prices of
consumer goods and services. The first stage of sampling for the CPI consists of selecting
geographic areas that are meant to cover the urban US population. Every 10 years the CPI
reselects the geographic areas or primary sampling units (PSUs) to make sure the survey
accurately reflect shifts in the American population observed in the latest decennial census.
The CPI is currently in the midst of updating their samples based on the 2010 Census. In
previous updates, a sample overlap procedure was used in selecting the geographic areas. By
using a sample overlap procedure the likelihood of retaining current geographic areas is
increased and thus we reduce the number of new areas that will be introduced with each area
redesign. This decreases collection costs which include the closing of offices and the hiring
and training of new staff for offices in a different geographic area. In other words, overlap
sampling allows the CPI to retain PSUs from the prior design while maintaining the
requirements of probability sampling.
We considered two different sample overlap procedures: Perkins (1970) and Ernst (1986).
Historically, the Perkins method was used for sample overlap purposes. It is a heuristic
procedure and the Ernst procedure uses linear programming. The two procedures have
different assumptions, but due to the optimization, the Ernst procedure should have a higher
overlap. Additionally, the population and other demographic characteristics of the continually
overlapped PSUs have changed over time. Therefore, we also investigated selecting the new
area sample independently. An independent sample would also serve as a baseline for a cost
benefit analysis comparing the expected number of overlaps from the Perkins and Ernst
methods.
Following the results of each decennial Census, the Office of Management and Budget (OMB)
releases a new set of area definitions. The changes can range from the addition or deletion of
a single county from an area definition to dividing an old area definition into multiple new
areas. There were substantial changes made to the area definitions that were first released in
2003 with data based on the 2000 Census. Additionally, a whole new set of terminology and
concepts were introduced. This is the current Core Based Statistical Area (CBSA) concept
that included the introduction of micropolitan areas, albeit the micropolitan areas are akin to
the current smaller metropolitan areas for the CPI. The CPI never implemented the area
selection based on the 2000 Census due to budgetary issues which have since been resolved.
Therefore, CPI faces two decades of area definitional changes. In latter sections, PSU
definition changes will be discussed within the context of the Perkins and Ernst methods.
There are currently three types of PSUs in the CPI: certainty metropolitan, non-certainty
metropolitan, and micropolitan areas. The largest metropolitan areas are selected with
certainty. Since certainty areas will definitely be included in the new sample there is no
reason to attempt to overlap them. We also do not attempt to overlap the micropolitan areas.
For the micropolitan areas, the issue was not having enough unique renters for the CPI
Housing survey. Each micropolitan area must have enough renters for two samples of six
panels each over the course of the decade between area redesigns. Thus, we concluded that
applying an overlap maximization procedure is not appropriate for the micropolitan areas.
Therefore, only non-certainty metropolitan areas will be eligible for the overlap procedures.
In Sections 2 and 3, we respectively describe the Perkins and Ernst methods. In Section 4, we
compare the results of the two methods to the expected overlap of an area sample drawn
independently. In Section 5, we present some concluding thoughts and issues for further
study.
2. Perkins Method
In the past we have used a modification of the Perkins method for overlap maximization to
increase the amount of sample overlap. The modification is due to PSU definition changes in
addition to the strata changing between area sample designs.
Let I j , j  1,..., J , denote the J old strata that intersect S. Perkins’ method requires that it first
be determined from which I j the new PSU is to be selected. To do this, we first calculate a
probability y j for each I j , by summing  i over all PSUs in I j . Then the selection among the
I j is made with probability proportional to y j .
Once a subgroup I j has been selected, the new sample PSU for S is selected from the set of
PSUs in I j , in the following manner:
If a PSU k in I j were selected in the old sample, the new sample PSU for S will be
selected from among the PSUs in I j with the following probabilities conditional on j and k:
  k
 y j pk
 k | jk  min

  k
 
,1 
 y j pk  
 i| jk  1  min



,1

(2.1)


max  i  y j pi ,0
,i k
 max   y j p ,0
(2.2)
I j
If none of the PSUs in I j were selected in the old sample, the new sample PSU for S will
be selected from among the units i in I j with a probability proportional to


max  i  y j pi ,0
(2.3)
The Perkins method of overlap maximization works entirely with the set of PSU definitions
used for the old sample. This is done because it is only in the set of old PSU definitions that
the concept of an overlap PSU is unambiguously defined. This is not necessarily the same
concept of overlap used for counting the expected number of overlap PSUs based on new PSU
definitions.
The modification to Perkins method occurs after the formulas have been applied to produce
conditional probabilities for PSUs based on old PSU definitions. The probability of an old
PSU is apportioned to the counties making up the old PSU based on the proportion of the
PSUs population contained in the county. Then the counties are added together to form PSUs
based on new PSU definitions and the probability of the new PSU is the sum of the probability
of the individual counties within it.
Perkins’ method is relatively simple to implement, but will not yield an optimal overlap.
3. Ernst Method
Overlap maximization starts with the equation
(3.1)
Where p is a PSU and S is a possible sample.
can be calculated as the population
at t=2 divided by the stratum population at t=2. This allows freedom is assigning values to
as long as the above equation holds for every PSU p. These
values can be assigned such that the conditional probability of selecting a currently sampled
area is increased.
The Ernst 1986 method uses linear programming to determine the set of conditional
probabilities that maximizes the expected unconditional number of current PSUs that will be
reselected. Like all sample overlap procedures the Ernst 1986 procedure does not alter the
unconditional (or independent) selection probabilities of selecting PSUs in the new sample.
Each stratum in the new design represents a separate overlap problem. The linear program is:
(3.2)
(3.3)
(3.4)
(3.5)
Where:
i = 1,…, r, where r = the number of old strata that have at least one PSU in the new stratum,
j=1,…, ui where ui = the number of possible outcomes for the set of PSUs in the i th old
stratum,
k=1,…, n, where n = the number of PSUs in the new stratum,
= the probability of selection for a PSU in the new stratum,
= the probability of selection for a PSU in the old stratum,
yi = the probability that the selection comes from the ith old stratum
xijk= the joint probability that the ith old stratum is chosen, that the jth possible outcome within
the ith old stratum that intersects with the new stratum is chosen, and that the kth PSU is chosen
in the new stratum (this joint probability is what is being maximized),
cijk= is a constant that equals 1, 0, or the current probability of selection for each new PSU.
This matrix is essentially a list of all the possible outcomes given the intersection between the
PSUs in the old strata and the new stratum.
Here
is the unconditional expected number of PSUs from the current
sample that are also selected for the new sample. To see this, the xijk’s give the probability of
every possible combination of old PSU selected and new PSU being selected and cijk is the
number of expected overlap PSUs in the event that the ith old stratum is chosen, that the jth
possible outcome within the ith old stratum that intersects with the new stratum is chosen, and
that the kth PSU is chosen in the new stratum. Thus xijk cijk is an expected number of overlaps
for a particular event.
The constraint
guarantees that the unconditional probability of selection,
, of
a new PSU k is unchanged. This can be described as saying the sum over all old PSUs ij
that could be selected of the probabilities of PSU k being selected given old PSU ij being
previously selected must equal the unconditional probability of new PSU k being selected.
The constraint
says that summing over all possible new PSUs that could be
selected, the previous probability of selection of the PSU ij is preserved. The factor of yi is
due to the fact that multiple old strata may intersect a single new stratum.
The constraint
says that the selected PSU must come from one of the old strata
which intersect the new stratum.
The procedure is a three-stage process. First, determine all possible sets of outcomes for the
old strata and old PSUs that intersect with the new stratum. Determine the corresponding
selection probabilities. This step includes creating the cijk matrix described above. Second, an
optimal set of xijk’s (also described above) is obtained by solving the linear programming
problem. Finally, a set of new PSUs in the new stratum is selected conditioned on the entire
set of old sample PSUs that are in the new stratum. This is done as follows:
The probability of new PSU k is selected and old stratum i is selected and PSU j within old
stratum i being selected (xijk) is equal to the conditional probability of new PSU k being
selected given old stratum i and old PSU j within i multiplied by the probability of selecting
old stratum i and old PSU j within stratum i. In other words
Thus, the conditional probability of selecting new PSU k for a specific old stratum and PSU
within the old stratum is
Finally we sum over all of the old strata which intersect the new stratum (here j represents the
PSU within old stratum i which was previously selected)
(3.6)
For the Ernst method, area definitional changes are assumed to be a one-to-one
correspondence between the PSUs in the old and new design. That is, each new area
corresponds to one and only one old area and each old area to exactly one new area. For
example, in the 1990 Census-based area design Greenville-Spartanburg-Anderson, SC was
considered one metropolitan area. This metropolitan area is now considered to be four
separate CBSAs, three metropolitan areas: Greenville, SC; Spartanburg, SC; and Anderson,
SC; and one micropolitan area, Gaffney, SC. Only the CBSA that makes up the majority of
the population from the old design would be matched using this approach. The other three
CBSAs would have “dummy areas” created with an old selection probability of zero. It
should be noted that it is possible to consider partial sample overlaps with the Ernst method.
In which case in the example mentioned above, all four areas would be considered as potential
partial sample overlaps. However, this greatly raises the number of possible outcomes for the
Ernst method and therefore increases the size and complexity of the problem. Therefore, we
decided not to pursue the partial sample overlap approach with the Ernst method.
3.1 An Example
In this example, which illustrates the method above, we examine one S or stratum in the new
design. This represents a separate overlap problem and uses current data. The example
stratum S contains 3 CBSAs that come from 3 initial strata. Old and new probabilities are
based on population relative importance.
Initial Strata
Orlando-Kissimmee, FL B338
Naples-Marco
Island, B360
FL
Key West, FL
C328
Initial probability of New Probability = πk
selection within initial
stratum
.7334
.8394
.1173
.1290
.0359
.0306
Here, k=1, 2, 3 correspond to Orlando, Naples and Key West respectively. While each initial
stratum contains numerous other previous PSU, we are only concerned with the intersecting
PSU from initial strata and current strata S and their initial probabilities. The new
probabilities denoted πk necessarily sum to 1 for k=1 to n.
Cijk Matrix is below, where i is the initial strata and j is 1 to ui , where ui is the number of
possible outcomes for the set of PSUs in the ith old stratum. In this example ui is coincidently 2
for all initial strata. When the Initial Strata is B338, j=1 when Orlando FL was previously
selected and j=2 when another PSU in B338 was previously selected.
(i,j)
(1,1)
k
1
1
2
.1173
3
.0359
(1,2)
0
.1173
.0359
(2,1)
.7334
1
.0359
(2,2)
.7334
0
.0359
(3,1)
.7334
.1173
1
(3,2)
.7334
.1173
0
The solution to the optimization problem is Xijk. It will maximize the objective function
(i,j)
(1,1)
(1,2)
(2,1)
(2,2)
(3,1)
(3,2)
K
1
0.2225
0
0
0.6149
0
0
2
0
0.0495
0.0817
0
0
0
3
0
0.0314
0
0
0
0
Pij is defined here
B338
B360
C328
j=1
.7335
.1173
.0359
j= 2
.2665
.8827
.9641
Finally an optimal set of conditional probabilities are below. It is the result of Xijk/ Pij
j1 j2 j3 K
1
1 1 1 .3034
1 1 2 .3034
1 2 1 1
1 2 2 1
2 1 1 0
2 1 2 0
2 2 1 .6966
2 2 2 .6966
2
.6966
.6966
0
0
.8824
.8824
.1858
.1858
3
0
0
0
0
.1176
.1176
.1176
.1176
Since no PSU in the intersection was previously selected in our example we necessarily have
J=2 for all initial strata i =1 to 3 and we will take the bottom row of the table above for our
conditional probabilities.
Say Orlando was previously selected to represent initial stratum 1. In that case J1= 1 in the
table above while J2 and J3 remain as previously unselected to represent their initial strata or 2.
We would take the 4th row for the conditional probabilities.
4. Comparing sample overlap results to independent selection
Independent Selection is the sampling procedure that does not use overlap maximization in
any way. After stratifying into groups, the total population of each PSU is divided by its
stratum population to arrive at each PSU’s selection probability. This is known as probability
proportional to size (PPS) sampling where the size statistic is total population as provided by
the decennial Census. PSUs with relatively large populations will have more probability of
selection than relatively small PSUs with this PPS procedure. There is no conditioning on the
previous sample in Independent Selection.
In order to determine which of these methods works the best, the expected sample overlap
value is calculated as follows.
First, an overlap PSU is defined by the 1 to 1 matching method as described in section 3. In 1
to 1 matching each newly defined PSU corresponds to 1 and only 1 previous PSU in the
previous frame. These previous PSUs are clearly defined as being an overlap PSU in the
previous sample or a non-overlap PSU, not in the previous sample. In some cases a new PSU
may match to a “dummy area” of which it is necessarily not an overlap PSU. We then use this
matching method in all three methods in consideration to determine which new PSUs are
considered overlap PSU and which do not. Second, this determination is used to find overlap
probability, which will sum to the expected value of overlap.
Stratum 1
PSU a
Overlap PSU
*
PPS Probability
Overlap Probability
.75
.75
PSU b
.15
0
PSU c
.10
0
.5
.5
.5
0
Stratum 2
PSU d
PSU e
*
The Expected Value of Overlap is the Sum of Overlap Probability
= 1.25
In order to determine the expected overlap, 1 to 1 matching was used after the maximized
probabilities were calculated for each of these three methods. In other words, Perkins method
used partial matching in its overlap maximization procedure, the Ernst86 method used 1 to 1
matching for its procedure and independent selection did not use any overlap definitions in its
procedure; 1 to 1 matching was used in all methods to find the expected overlap.
The expected sample overlap broken down by Census Division for Independent Selection and
the 2 overlap maximization methods for 58 non-self representing PSUs are as follows:
PSU Design
Independent Selection Perkins
Ernst86
Division 1
2
0.4529
0.8635
1.0172
Division 2
4
0.6929
1.2570
1.6678
Division 3
8
2.8817
4.0453
4.8377
Division 4
4
0.7950
1.2120
2.0597
Division 5
14
3.1574
3.7433
6.9912
Division 6
6
0.7833
1.0577
2.1836
Division 7
8
2.0929
2.6480
4.3588
Division 8
6
1.4000
2.4980
3.2277
Division 9
6
0.9154
1.0247
2.2353
Total
58
13.17
18.349
28.59
5. Conclusion
Due to the strong desire to reduce the start-up and shut-down cost of implementing the sample,
the CPI has chosen the Ernst 86 method of overlap maximization. The Ernst 86 method
provides the optimal expected overlap when compared to the Perkins method of overlap
maximization and independent selection.
Disclaimer: Any opinions expressed in this paper are those of the authors and do not
constitute policy of the Bureau of Labor Statistics.
References
Ernst, L.R. (1986). Maximizing the overlap between surveys when information is incomplete.
European Journal of Operational Research, Vol. 27, 192 - 200.
Ernst, L.R., Izsak, Y., and Paben S.P. (2004). Use of Overlap Maximization in the Redesign
of the National Compensation Survey. In JSM Proceedings, Survey Research Methods
Section. Alexandria, VA: American Statistical Association.
Perkins, W. (1970). 1970 CPS Redesign: Proposed Method for Deriving Sample PSU
Selection Probabilities Within 1970 NSR Strata. Memo to Joseph Waksberg, U.S. Bureau
of the Census.