Comparison of Perturbation Approaches for Spatial Outliers in Microdata Natalie Shlomo* and Jordi Marés** * Social Statistics, University of Manchester, [email protected] ** IIIA and CSIC, Barcelona [email protected] The project was funded by the Census Statistical Disclosure Control project at Westat, Inc. through the sponsorship of the U.S. Bureau of the Census 1 Topics Covered • • • • • • • Introduction Description of Data Outlier Detection Coherence Function Perturbation Methods • Record Swapping Method • Hot Deck Method Results Conclusions 2 Introduction • Geographical spatial outliers arise from multivariate relationships between spatial and non-spatial characteristics and have a high probability of identification • Treat through targetted SDC perturbation in the microdata • Focus on US American Community Survey (ACS) transportation outputs, trajectories defined as vectors of coordinates: place of residence (origin) and workplace (destination) • Example of an outlier: overly long commutes to work on a non-typical means of transportation (MOT), such as cycling • Objective: to inform and guide decisions about best practices that could be used for future dissemination strategies on these and other similar types of datasets by the US Census Bureau 3 Description of Data • Simulation study based on an artificial population produced from 2006-2008 combined PUMS of the ACS • Those living in California, employed and worked within the US (N=438,850) • Latitude and longitude of residence and workplace generated by adding random distances around a radius of the centroid of the relevant PUMA (public-use microdata area with population greater than 100K) • Did not take into account survey weights (need to recalibrate following perturbation) however use other calibration variables as controls to minimize distortions to original weights 4 Outlier Detection • Outlier detection methods include univariate and multivariate methods and can take parametric or nonparametric forms • For this study we use a multivariate outlier detection based on the Mahalanobis Distance where large values indicate outliers • Replace mean vector by median vector and covariance matrix by minimum covariance determinant (MCD) (Rousseeuw, 1985) • Let h be the minimum number of points which are not 1 h outlying: M 1 x S (x ij M MCD ) (x ij M MCD ) T MCD ccf sscf h h -1 h MCD j 1 ij j 1 • Squared Mahalanobis distances based on p variables generallly uses a quantile of the D0 p2 (0.975) distribution • Under robust Mahalanobis distances RDi use the adjusted cut-off: D 2 (0.975) median{RD1 ,.., RD n } 0 p p2 (0.5) 5 Outlier Detection • Robust Mahalanobis distances calculated on distance travelled and minutes to work DistanceToWork =geodist(latitude,longitude,POW_latitude,POW_longitude,'DM'); • Determine explanatory variables predictive of distance travelled to produce classes: mode of transport, sex, earnings and occupation • SAS macro: ‘Robcov’ Version 1.3-2 (written by Michael Friendly) • Collapse classes to at least 20 individuals and calculate robust Mahalanobis distance with a flag if exceeds critical value • Reduced dataset to 283,423 without missing values and high degree of consistencies: 60,007 outliers (21.2%) reduced to 59,080 (20.8%) outliers after deleting ‘other’ mode of transport 6 Coherence Function • Coherence function maximum and minimum velocity for each mode of transport based on the set of non-outliers ì distr ü maxVelocitym = max í ´ 60 | "r Î datanonOutliers , rmodeTransport = mý î timer þ ì distr ü minVelocitym = min í ´ 60 | "r Î datanonOutliers , rmodeTransport = mý î timer þ ì maxVelocitym + minVelocitym ü meanVelocitym = í ý î þ 2 • Assign high coherence to individuals whose travelled distance is close to mean, and low coherence to individuals whose travelled distance is far from mean • Use as objective function to guide perturbation where we aim to obtain a higher coherence for outliers 7 Record Swapping • Pair outliers with different workplaces by swapping place of residence and increase coherence funcion for at least one of the outliers (without decreasing coherence) • Carry out within classes: mode of transport, sex and age group • Split outliers according to workplace, calculate coherence function by swapping residence of outlier with all other outliers in different workplace • If one of the outliers have higher coherence then swap • Continue iteratively 8 Hot Deck • Impute residence of outlier by residence of non-outlier within the class and having same workplace 2 approaches for selecting donor (note: need more than one individual in the workplace) 1. Candidate donors among those having distance to work within the coherence range of distances and donor selected that maximiazes coherence function, i.e. candidate donor whose distance to work is closer to the mean velocity) 2. Instead of coherence function, choose donor from nonoutlier in the same workplace having similar travelled minutes (nearest neighbor) 9 Results Outliers Original Yes No Total Total Outliers after Swapping Outliers after HD (Coherence Measure) Outliers after HD (Minutes) Yes Yes Yes No No No 59,080 42,788 16,292 27,099 31,981 28,123 30,957 (20.9%) (92.0%) (6.9%) (76.2%) (12.9%) (79.3%) (12.5%) 224,343 3,731 220,612 8,456 215,887 7,321 217,022 (79.2%) (8.0%) (93.1%) (23.8%) (87.1%) (20.7%) (87.5%) 283,423 46,519 236,904 35,555 247,868 35,444 247,979 (100%) (100%) (100%) (100%) (100%) (100%) (100%) • Swapping corrected fewer outliers than hot deck methods (16K vs 31K) but swapping carried out only on outliers • Some non-outliers that became outliers since we changed the distribution structure following perturbation (4K swapping vs 8K hotdeck)) • Number of non-outliers defined as outliers following perturbation was much less than those outliers that were corrected to non-outliers 10 Results dist (attr1 , attr2 ) | freq iDom( attr1 ) jDom( attr2 ) ij freqij | / 2 N dist ( attr1 , attr2 ) .5 Normalized Absolute Difference ( iDom( attr1 ) jDom( attr2 ) freqij freqij )2 / N Normalized Hellinger’s Distance Bivariate Variables Crossed with PUMA Swapping HD Minutes HD Coherence Swapping HD Minutes HD Coherence AGE9 0 0.109 0.095 0 0.154 0.134 AGEP 0.048 0.119 0.107 0.059 0.166 0.147 SEX 0 0.113 0.095 0 0.161 0.140 OCCUPATION 0.094 0.134 0.125 0.164 0.215 0.203 EARNINGS 0.024 0.104 0.089 0.029 0.148 0.129 MODE 0 0.104 0.087 0 0.154 0.131 • Individuals who had their PUMA changed due to the perturbation: Swapping Method: 56,562 ; Hot Deck Method (Minutes): 53,945 ; Hot Deck Method (Coherence): 53,181 • Hotdeck methods perturb bivariate counts more than swapping since swapping does not change marginal frequencies • Hotdeck using the coherence function approach resulted in less information loss than nearest neighbor approach 11 Discussion • Record swapping had lowest information loss (especially for bivariate counts of swapping variable with other control variables) but only corrected 21.3% of the outliers, while the hot-deck methods corrected ~ 40.0% of the outliers • Hot-deck method transformed more non-outliers to outliers compared to record swapping • Recommendation would be to carry out both methods, starting with record swapping and then proceeding to hotdeck method on remaining outliers • Recalibrate survey weights to new place of residence but including calibration variables as controls minimizes distortion to survey weights, especially under record swapping 12 Thank you for your attention 13
© Copyright 2026 Paperzz