Impact on the quality of the SARs Perturbation and Imputation

Comparison of Perturbation
Approaches for Spatial Outliers in
Microdata
Natalie Shlomo* and Jordi Marés**
* Social Statistics, University of Manchester,
[email protected]
** IIIA and CSIC, Barcelona
[email protected]
The project was funded by the Census Statistical Disclosure Control project at
Westat, Inc. through the sponsorship of the U.S. Bureau of the Census
1
Topics Covered
•
•
•
•
•
•
•
Introduction
Description of Data
Outlier Detection
Coherence Function
Perturbation Methods
•
Record Swapping Method
•
Hot Deck Method
Results
Conclusions
2
Introduction
•
Geographical spatial outliers arise from multivariate
relationships between spatial and non-spatial characteristics
and have a high probability of identification
•
Treat through targetted SDC perturbation in the microdata
•
Focus on US American Community Survey (ACS)
transportation outputs, trajectories defined as vectors of
coordinates: place of residence (origin) and workplace
(destination)
•
Example of an outlier: overly long commutes to work on a
non-typical means of transportation (MOT), such as cycling
•
Objective: to inform and guide decisions about best
practices that could be used for future dissemination
strategies on these and other similar types of datasets by
the US Census Bureau
3
Description of Data
•
Simulation study based on an artificial population
produced from 2006-2008 combined PUMS of the ACS
•
Those living in California, employed and worked within
the US (N=438,850)
•
Latitude and longitude of residence and workplace
generated by adding random distances around a radius
of the centroid of the relevant PUMA (public-use
microdata area with population greater than 100K)
•
Did not take into account survey weights (need to
recalibrate following perturbation) however use other
calibration variables as controls to minimize distortions
to original weights
4
Outlier Detection
•
Outlier detection methods include univariate and
multivariate methods and can take parametric or nonparametric forms
•
For this study we use a multivariate outlier detection
based on the Mahalanobis Distance where large values
indicate outliers
•
Replace mean vector by median vector and covariance
matrix by minimum covariance determinant (MCD)
(Rousseeuw, 1985)
•
Let h be the minimum number of points which are not
1 h
outlying: M  1 x
S



(x ij  M MCD ) (x ij  M MCD ) T


MCD
ccf sscf
h
h -1
h
MCD
j 1
ij
j 1
•
Squared Mahalanobis distances based on p variables
generallly uses a quantile of the D0   p2 (0.975) distribution
•
Under robust Mahalanobis distances RDi use the
adjusted cut-off: D   2 (0.975) median{RD1 ,.., RD n }
0
p
 p2 (0.5)
5
Outlier Detection
•
Robust Mahalanobis distances calculated on distance
travelled and minutes to work
DistanceToWork
=geodist(latitude,longitude,POW_latitude,POW_longitude,'DM');
•
Determine explanatory variables predictive of distance
travelled to produce classes: mode of transport, sex,
earnings and occupation
•
SAS macro: ‘Robcov’ Version 1.3-2 (written by Michael
Friendly)
•
Collapse classes to at least 20 individuals and calculate
robust Mahalanobis distance with a flag if exceeds
critical value
•
Reduced dataset to 283,423 without missing values and
high degree of consistencies: 60,007 outliers (21.2%)
reduced to 59,080 (20.8%) outliers after deleting
‘other’ mode of transport
6
Coherence Function
•
Coherence function maximum and minimum velocity for
each mode of transport based on the set of non-outliers
ì distr
ü
maxVelocitym = max í
´ 60 | "r Î datanonOutliers , rmodeTransport = mý
î timer
þ
ì distr
ü
minVelocitym = min í
´ 60 | "r Î datanonOutliers , rmodeTransport = mý
î timer
þ
ì maxVelocitym + minVelocitym ü
meanVelocitym = í
ý
î
þ
2
•
Assign high coherence to individuals whose travelled distance
is close to mean, and low coherence to individuals whose
travelled distance is
far from mean
•
Use as objective function to
guide perturbation
where we aim to obtain a
higher coherence for
outliers
7
Record Swapping
•
Pair outliers with different workplaces by swapping place
of residence and increase coherence funcion for at least
one of the outliers (without decreasing coherence)
•
Carry out within classes: mode of transport, sex and age
group
•
Split outliers according to workplace, calculate
coherence function by swapping residence of outlier with
all other outliers in different workplace
•
If one of the outliers
have higher coherence
then swap
•
Continue iteratively
8
Hot Deck
•
Impute residence of outlier by residence of non-outlier
within the class and having same workplace
2 approaches for selecting donor (note: need more than
one individual in the workplace)
1.
Candidate donors among those having distance to
work within the coherence range of distances and
donor selected that maximiazes coherence function, i.e.
candidate donor whose distance to work is closer to the
mean velocity)
2.
Instead of coherence function, choose donor from nonoutlier in the same workplace having similar travelled
minutes (nearest neighbor)
9
Results Outliers
Original
Yes
No
Total
Total
Outliers after
Swapping
Outliers after HD
(Coherence Measure)
Outliers after HD
(Minutes)
Yes
Yes
Yes
No
No
No
59,080
42,788
16,292
27,099
31,981
28,123
30,957
(20.9%)
(92.0%)
(6.9%)
(76.2%)
(12.9%)
(79.3%)
(12.5%)
224,343
3,731
220,612
8,456
215,887
7,321
217,022
(79.2%)
(8.0%)
(93.1%)
(23.8%)
(87.1%)
(20.7%)
(87.5%)
283,423
46,519
236,904
35,555
247,868
35,444
247,979
(100%)
(100%)
(100%)
(100%)
(100%)
(100%)
(100%)
• Swapping corrected fewer outliers than hot deck methods
(16K vs 31K) but swapping carried out only on outliers
• Some non-outliers that became outliers since we changed
the distribution structure following perturbation (4K
swapping vs 8K hotdeck))
• Number of non-outliers defined as outliers following
perturbation was much less than those outliers that were
corrected to non-outliers
10
Results
dist (attr1 , attr2 ) 
 | freq
iDom( attr1 )
jDom( attr2 )
ij
 freqij | / 2 N
dist ( attr1 , attr2 )  .5
Normalized Absolute Difference
(
iDom( attr1 )
jDom( attr2 )
freqij  freqij )2 / N
Normalized Hellinger’s Distance
Bivariate
Variables
Crossed with
PUMA
Swapping
HD Minutes
HD
Coherence
Swapping
HD Minutes
HD
Coherence
AGE9
0
0.109
0.095
0
0.154
0.134
AGEP
0.048
0.119
0.107
0.059
0.166
0.147
SEX
0
0.113
0.095
0
0.161
0.140
OCCUPATION
0.094
0.134
0.125
0.164
0.215
0.203
EARNINGS
0.024
0.104
0.089
0.029
0.148
0.129
MODE
0
0.104
0.087
0
0.154
0.131
• Individuals who had their PUMA changed due to the
perturbation: Swapping Method: 56,562 ; Hot Deck Method
(Minutes): 53,945 ; Hot Deck Method (Coherence): 53,181
• Hotdeck methods perturb bivariate counts more than
swapping since swapping does not change marginal
frequencies
• Hotdeck using the coherence function approach resulted in
less information loss than nearest neighbor approach
11
Discussion
• Record swapping had lowest information loss (especially for
bivariate counts of swapping variable with other control variables)
but only corrected 21.3% of the outliers, while the hot-deck
methods corrected ~ 40.0% of the outliers
• Hot-deck method transformed more non-outliers to outliers
compared to record swapping
• Recommendation would be to carry out both methods, starting
with record swapping and then proceeding to hotdeck method
on remaining outliers
• Recalibrate survey weights to new place of residence but including
calibration variables as controls minimizes distortion to survey
weights, especially under record swapping
12
Thank you for your attention
13