Improvements to the Methodology Used in the Production of Tax

Outlier detection and accommodation for business
surveys utilizing multiple linear regression models in
edit and imputation
Robert Philips
ICES-III
June 21st, 2007
Presentation Outline
E & I for the Monthly Wholesale Retail
Trade Survey (MWRTS)
Outlier Model and Theory
Illustrative Example
Outlier Procedure for “large” imputation
cells
Simulation results
Conclusion
E & I for the MWRTS
Statistical edits are run prior to imputation and in
part identify which of the respondent data will be
used to impute non-respondents.
Statistical editing is done at the industrial
grouping by geography level; if not enough units
then collapse over geography.
Hidiroglou - Berthelot method (1986) used in
conjunction with monthly, yearly and
administrative data trend edits.
E & I for the MWRTS (cont.)
In general for most E & I classes in the MWRTS
the model is of the following form:
y  X β  σW
1/ 2
ε;
( I ) yi ,t  1 yi ,t 1   2 yi ,t 12  
( II ) yi ,t   3 yi ,t m  
yi ,t m
wi
i
yi ,t 1
wi
i
E & I for the MWRTS (cont.)
The imputation classes are at a finer level
of detail than the statistical edit groupings.
The principal method for imputation is the
bivariate model (60%) and respondents
who have passed the univariate statistical
edits might actually be considered as
outliers during the imputation process.
There is clearly a need for an outlier
detection routine for the imputation
module.
Outlier Model and Theory
Let M ( i ) with (i )  (i1 ,, ik ) represent the model :
yim  xitm β   im   wimim  im , for m  1,, k
while the majority of observatio ns satisfy
y ( i )  X ( i ) β  W(1i )/ 2 ε ,
y ( i ) is the ( n  k ) 1 vector of yi ' s omitting yi1 ,, yik etc
Outlier Model and Theory cont…
The priors for the parameters are:
1
P( M ( i ) | k ) 
, (i )S kn ,  each model is equally likely
n
 
k 
 i | ,W ~U ( 
m
wimim  im 
2
,
wimim  im 
2
), m  1,, n
1
1
2
k ~ Poisson( ), p(β, )  2 ,  2  R  , β R p
2

and  i1 ,,  in iid some .
Outlier Model and Theory cont…
(n  p)

 2
t

1
1
/
2
|X W X|
 S 
Let q


, Q   q ,

(i )
k
(i )
n
| X t W  1 X |1 / 2  S(i) 
(i )  S
(i) (i ) (i)
k
where S
(i )
is the SSE after omitting ( yi1 ,, yik ) from the regression.

For k  0,1,  , k
max
,
(0.1) k
*
p  Prob (k outliers | X , W , y )  C
Q , .
k
n k
k! 
k 
Outlier Model and Theory cont…
The posterior estimate of the number of outliers is
k max
k *  E ( k | X ,W , y )   k p* ,
k
k 0
q
(i )
and Prob ( yi1 ,, yik are the outliers | k , X ,W , y ) 
 p(i ).
Q
k
The set of indices (i1 ,, ik * ) where max p(i ) is attained
determines the most likely outliers yi1 ,, yik * .
.
Outlier Model and Theory cont…
β | k , M ( i ) , X ( i ) , W( i ) , y ( i ) ~ p  variate t( n  p )
with mean
βˆ ( i )  ( X (ti )W(i )1 X ( i ) ) 1 X (ti )W(i )1 y( i )
and variance
S(i )
n p2
 ( X (ti )W(i )1 X ( i ) ) 1
 the posterior estimate of β is
β
*





ˆ
p   p( i ) β ( i ) 



k
( i )S kn

*
k
Outlier Model and Theory cont…
The posterior variance of β,
V ( β | X ,W , y )   p Dk  β β .
*
k
* *t
k
and Dk 

( i )S kn
S (i )

t
1
1
t 
ˆ
ˆ
p( i ) 
( X ( i )W( i ) X ( i ) )  β ( i ) β ( i ) 
 ( n  p  2)

Similarly the posterior estimate of  2 is

*2


S (i )

t
1
1  
  p   p( i ) 
( X ( i )W( i ) X ( i ) )  
( i )Skn
k
 ( n  p  2)
 
*
k
An illustrative example
xi ~ Exp 200 ,
yi  30  0.95 xi  25 i , i 1,,25
and the errors were from a contaminat ed normal
 i ~ 0.95 N (0,1)  0.05 N (0,10 2 ).
Plot of Simulated data
1200
obs 25
1000
800
Y
600
obs 10
obs 19
400
200
0
0
100
200
X
300
400
500
600
Example cont…
k
p
*
k
0
1
0.0000
0.3168
2
3
4
0.3652
0.2143
0.1037
The posterior estimate of the number of
outliers is 2.105. With estimates of 34.755
and 0.967 for the intercept and slope.
Example cont…
The MM (Yohai 1987) M-estimator (high
breakdown) indicates that observation 25 is
an outlier with high leverage and
observation 10 is just of high leverage.
The estimates for the parameters are
intercept=35.465 and slope= 0.9629.
Example cont…
If there are only two possible outliers then
19,25 are 60 times more
i_1
19
16
1
15
6
22
4
5
13
10
i_2
25
25
25
25
25
25
25
25
25
25
p_i
0.713
0.036
0.029
0.021
0.019
0.015
0.014
0.013
0.013
0.012
likely than (10,25)!
Inte rce pt
33.492
37.393
40.039
37.090
31.667
37.280
37.524
38.032
32.614
33.747
S lope
0.965
0.973
0.958
0.955
0.983
0.970
0.968
0.965
0.980
0.987
Plot of Simulated data
1200
obs 25
1000
800
Y
600
obs 10
obs 19
400
200
0
0
100
200
X
300
400
500
600
Outlier Model and Theory cont…
Strengths: method works well in detecting
outliers and estimating the relevant
parameters robustly. All of the data is used.
Drawback: method becomes impractical as
the imputation class size n increases, since
the number of possible subsets of size k
will become astronomically large.
Outlier Procedure for “large”
imputation cells
n large  on each pass test if k  0 or  1 using
p0* , p(im ) and the leverage himim to identify the outliers.
ident
25
19
16
hii
0.1849
0.0443
0.0443
pi
1.0000
0.6768
0.1338
P_0
0.0000
0.6676
0.8207
b0
35.6930
35.2458
33.5584
b1
0.9708
0.9695
0.9651
outlier
Y
Y
N
The standard errors for  0 and 1 are 8.5023
and 0.0343 respectively. The estimate  * is 25.0655.
Simulation results
Data from MRTS was selected where the number
of respondents for the bivariate imputation model
> 50 for 3 imputation classes.
For a given simulation (1-p)% in each cell were
selected to impute for the remaining units.
The method presented here was compared to the
MM M-estimator using the relative difference of
the average predictions.
Simulation results for 200 runs
p%
Method Cell 1
Cell 2
Cell 3
5
MRTS
-0.002
-0.020
0.013
5
MM
-0.008
-0.028
0.001
10
MRTS
-0.000
-0.007
0.015
10
MM
-0.008
-0.015
0.003
15
MRTS
-0.000
-0.011
0.014
15
MM
-0.008
-0.016
0.002
Conclusions
The procedure for outlier detection works
well and produces fairly robust estimates. It
would also allow for more covariates to be
included in the E&I process.
Even though the assumption of normality
led to the closed form solution of the
estimator it is still applicable to situations
where modest departures from normality
arise.
Merci!
For more
information
please contact
Pour plus
d’information,
veuillez contacter
Robert Philips
- e-mail: [email protected]
- telephone: (613) 951-1493
www.statcan.ca