Outlier detection and accommodation for business surveys utilizing multiple linear regression models in edit and imputation Robert Philips ICES-III June 21st, 2007 Presentation Outline E & I for the Monthly Wholesale Retail Trade Survey (MWRTS) Outlier Model and Theory Illustrative Example Outlier Procedure for “large” imputation cells Simulation results Conclusion E & I for the MWRTS Statistical edits are run prior to imputation and in part identify which of the respondent data will be used to impute non-respondents. Statistical editing is done at the industrial grouping by geography level; if not enough units then collapse over geography. Hidiroglou - Berthelot method (1986) used in conjunction with monthly, yearly and administrative data trend edits. E & I for the MWRTS (cont.) In general for most E & I classes in the MWRTS the model is of the following form: y X β σW 1/ 2 ε; ( I ) yi ,t 1 yi ,t 1 2 yi ,t 12 ( II ) yi ,t 3 yi ,t m yi ,t m wi i yi ,t 1 wi i E & I for the MWRTS (cont.) The imputation classes are at a finer level of detail than the statistical edit groupings. The principal method for imputation is the bivariate model (60%) and respondents who have passed the univariate statistical edits might actually be considered as outliers during the imputation process. There is clearly a need for an outlier detection routine for the imputation module. Outlier Model and Theory Let M ( i ) with (i ) (i1 ,, ik ) represent the model : yim xitm β im wimim im , for m 1,, k while the majority of observatio ns satisfy y ( i ) X ( i ) β W(1i )/ 2 ε , y ( i ) is the ( n k ) 1 vector of yi ' s omitting yi1 ,, yik etc Outlier Model and Theory cont… The priors for the parameters are: 1 P( M ( i ) | k ) , (i )S kn , each model is equally likely n k i | ,W ~U ( m wimim im 2 , wimim im 2 ), m 1,, n 1 1 2 k ~ Poisson( ), p(β, ) 2 , 2 R , β R p 2 and i1 ,, in iid some . Outlier Model and Theory cont… (n p) 2 t 1 1 / 2 |X W X| S Let q , Q q , (i ) k (i ) n | X t W 1 X |1 / 2 S(i) (i ) S (i) (i ) (i) k where S (i ) is the SSE after omitting ( yi1 ,, yik ) from the regression. For k 0,1, , k max , (0.1) k * p Prob (k outliers | X , W , y ) C Q , . k n k k! k Outlier Model and Theory cont… The posterior estimate of the number of outliers is k max k * E ( k | X ,W , y ) k p* , k k 0 q (i ) and Prob ( yi1 ,, yik are the outliers | k , X ,W , y ) p(i ). Q k The set of indices (i1 ,, ik * ) where max p(i ) is attained determines the most likely outliers yi1 ,, yik * . . Outlier Model and Theory cont… β | k , M ( i ) , X ( i ) , W( i ) , y ( i ) ~ p variate t( n p ) with mean βˆ ( i ) ( X (ti )W(i )1 X ( i ) ) 1 X (ti )W(i )1 y( i ) and variance S(i ) n p2 ( X (ti )W(i )1 X ( i ) ) 1 the posterior estimate of β is β * ˆ p p( i ) β ( i ) k ( i )S kn * k Outlier Model and Theory cont… The posterior variance of β, V ( β | X ,W , y ) p Dk β β . * k * *t k and Dk ( i )S kn S (i ) t 1 1 t ˆ ˆ p( i ) ( X ( i )W( i ) X ( i ) ) β ( i ) β ( i ) ( n p 2) Similarly the posterior estimate of 2 is *2 S (i ) t 1 1 p p( i ) ( X ( i )W( i ) X ( i ) ) ( i )Skn k ( n p 2) * k An illustrative example xi ~ Exp 200 , yi 30 0.95 xi 25 i , i 1,,25 and the errors were from a contaminat ed normal i ~ 0.95 N (0,1) 0.05 N (0,10 2 ). Plot of Simulated data 1200 obs 25 1000 800 Y 600 obs 10 obs 19 400 200 0 0 100 200 X 300 400 500 600 Example cont… k p * k 0 1 0.0000 0.3168 2 3 4 0.3652 0.2143 0.1037 The posterior estimate of the number of outliers is 2.105. With estimates of 34.755 and 0.967 for the intercept and slope. Example cont… The MM (Yohai 1987) M-estimator (high breakdown) indicates that observation 25 is an outlier with high leverage and observation 10 is just of high leverage. The estimates for the parameters are intercept=35.465 and slope= 0.9629. Example cont… If there are only two possible outliers then 19,25 are 60 times more i_1 19 16 1 15 6 22 4 5 13 10 i_2 25 25 25 25 25 25 25 25 25 25 p_i 0.713 0.036 0.029 0.021 0.019 0.015 0.014 0.013 0.013 0.012 likely than (10,25)! Inte rce pt 33.492 37.393 40.039 37.090 31.667 37.280 37.524 38.032 32.614 33.747 S lope 0.965 0.973 0.958 0.955 0.983 0.970 0.968 0.965 0.980 0.987 Plot of Simulated data 1200 obs 25 1000 800 Y 600 obs 10 obs 19 400 200 0 0 100 200 X 300 400 500 600 Outlier Model and Theory cont… Strengths: method works well in detecting outliers and estimating the relevant parameters robustly. All of the data is used. Drawback: method becomes impractical as the imputation class size n increases, since the number of possible subsets of size k will become astronomically large. Outlier Procedure for “large” imputation cells n large on each pass test if k 0 or 1 using p0* , p(im ) and the leverage himim to identify the outliers. ident 25 19 16 hii 0.1849 0.0443 0.0443 pi 1.0000 0.6768 0.1338 P_0 0.0000 0.6676 0.8207 b0 35.6930 35.2458 33.5584 b1 0.9708 0.9695 0.9651 outlier Y Y N The standard errors for 0 and 1 are 8.5023 and 0.0343 respectively. The estimate * is 25.0655. Simulation results Data from MRTS was selected where the number of respondents for the bivariate imputation model > 50 for 3 imputation classes. For a given simulation (1-p)% in each cell were selected to impute for the remaining units. The method presented here was compared to the MM M-estimator using the relative difference of the average predictions. Simulation results for 200 runs p% Method Cell 1 Cell 2 Cell 3 5 MRTS -0.002 -0.020 0.013 5 MM -0.008 -0.028 0.001 10 MRTS -0.000 -0.007 0.015 10 MM -0.008 -0.015 0.003 15 MRTS -0.000 -0.011 0.014 15 MM -0.008 -0.016 0.002 Conclusions The procedure for outlier detection works well and produces fairly robust estimates. It would also allow for more covariates to be included in the E&I process. Even though the assumption of normality led to the closed form solution of the estimator it is still applicable to situations where modest departures from normality arise. Merci! For more information please contact Pour plus d’information, veuillez contacter Robert Philips - e-mail: [email protected] - telephone: (613) 951-1493 www.statcan.ca
© Copyright 2026 Paperzz