General to Detail Allocation

Methodology of Allocating
Generic Field to its Details
Jessica Andrews
Nathalie Hamel
François Brisebois
ICESIII - June 19, 2007
Outline
Background Information on Tax Data
Objective
Current Methodology
Other Methodologies Considered
Comparison of the Methodologies
Future Work and Conclusions
Tax Data
Statistics Canada receives annual data
from Canada Revenue Agency (CRA) on
incorporated (T2) businesses
Tax data:
Balance Sheet
Income Statement
88 different Schedules
Tax Data
About 700 different fields to report
Most companies provide only 30-40 fields
Only 8 fields are actually required by CRA (section
totals)
Non-farm revenue
Non-farm expenses
Farm revenue
Farm expenses
Assets
Liabilities
Shareholder Equity
Net Income/Loss
Objective
To impute the missing detail variables
Why ?
Tax data users need detailed data (tax
replacement project (TRP))
Different concepts and definitions between tax
and survey data
A subset of details linked to the same generic can
be mapped to different survey variables (Chart of
Account)
Challenges to meet
Methodology must
Work well for a large number of details
Be capable of dealing with details which are
rarely reported and those which are frequently
reported
Give good micro results for tax replacement,
but also give good macro results when
examined at the NAICS or full database level
First attempt to complete Tax Data
Edit rules
Outlier detection within a record
Deterministic edits (to ensure the record balances within
section)
Review and manual corrections
Overlap between fiscal period
Negative values
Consistency edits between tax variables
Outlier detection between records (Hidiroglou-Berthelot)
CORTAX balancing edits
Deterministic imputation of key variables
Inventories
Depreciation
Salaries and wages
GDA Concepts
Corporation can use either generic or detail fields to report their
results
Case 1
Generic
Details
8810 Office expenses amount
Case 2
Case 3
100
30
Office stationery and supply
8811 expense amount
20
Office utilities expense
8812 amount
30
10
Data processing expense
8813 amount
50
60
100
100
Total
100
GDA Concepts
Block is defined by a generic and its details
Generic field is not a total
Goal is to impute the most significant detail
variables when a generic amount has been
reported
GDA: Generic to detail allocation
Current method
Uses imputation classes based on industry codes
and size of company
First 2 digits of NAICS (about 25 industries)
Three sizes of revenue (boundaries of 5 and 25
million)
Calculates ratios within imputation classes for
each block
Uses all non-zero and non-missing details
Uses only details reported at least 10% of the time (5%
for block General Farm Expense)
Assigns ratios to businesses with a generic
Current method
Originally proposed as a solution with good
macro (aggregate) results
Now need good micro (business) level
results for TRP
Problems
Imputation classes are frequently not
homogeneous in terms of distribution
A large number of small imputation classes
Other methods considered
Historic imputation method
Scores method
Cluster method
Historic imputation method
Assumes distributions of details are the same
from one year to the next
Problems
A change in business strategies/properties will not be
considered this way
Most businesses which report details in the previous
year will report them also in the current year, leaving
few businesses which could be imputed with this
method (~5% on all blocks tested)
Requires use of another method for remaining
businesses
Scores method
Uses response/non response models for
each detail
Groups businesses into imputation classes
on the basis of percentiles of response
probability
Calculates ratios within imputation classes
Assigns ratios to businesses with a generic
Scores method
Problems
Need to create a model for each detail
Difficult to resolve what to do in the case of
blocks with many details (5 or more) which are
frequently reported
This method was excluded due to it’s difficulty
in coping with blocks with a moderate to large
number of details
Cluster method
Divides businesses into imputation classes
on the basis of response patterns to details
Uses clustering or dominant detail method
Uses discriminatory models (parametric or
not) to assign businesses with generic to
imputation classes
Calculates ratios within imputation classes
Assigns ratios to businesses with a generic
Cluster method
Problems
For certain blocks it can be difficult to find good
variables on which to discriminate
Issue of how often clustering method and
models should be reviewed
Comparing the methods
Estimate distributions of known data for
year n from ratios calculated for year n-1
Create a benchmark file
Reported details in years n-1 and n
Put all details into generic fields in year n
Calculate ratios from businesses in year n-1 for
all methods
Assign ratios to businesses in year n
Compare the results to the reported fields
Comparing the methods
Compare the results at the micro
(businesses) and the macro (aggregate)
levels
Compare true and estimated distributions
Comparing the methods
Macro statistics
SSE   (t j  tˆj ) 2
j
SSEP   (
j
for the jth detail in the block
tˆj
tj
 1) 2
Comparing the methods
Micro Statistics
Median Pseudo CV
 x
j
 xˆ ij 
2
ij
x
j
for the jth detail and ith business in the block
ij
Comparing the methods
Micro Statistics
Median Pearson Contingency Coefficient
d2
n n

 nij  i . . j
n
  
ni . n. j
i
j



2
 n 
i
f
j
 f i. f . j 
2
ij
f i. f . j
n
1/ 2
 d2 
P
 d2 n



for the jth detail and ith business in the block
f values represent the marginal distributions
d2 represents the degree of dependency (depends on n, r
and c)
Comparing the methods
We show results for Block 8230: Other
Revenue
This block has 20 details covering revenue
distribution
Important for clients as used in many surveys
The scores method is not shown as it is
difficult to implement with this many details
Comparing the methods
OTHER REVENUE FLDS 8230 TO 8250
8230 Other revenue
8231 Foreign exchange gains/losses
8232 Income/loss of subsidiaries/affiliates
8233 Income/loss of other divisions
8234 Income/loss of joint ventures
…
8248 Insurance recoveries
8249 Expense recoveries
8250 Bad debt recoveries
Results
Block 8230
Micro Statistics
Macro Statistics
Median IQR
Pseudo
CV
Median IQR
Pearson
Cont.
Coeff.
SSE
SSEP
Current
Method
1.08
0.43
0.66
0.14
2.2e20 120
Cluster
Method
0.34
1.39
0.36
0.63
2.8e20 12
Historic +
Cluster
0.51
0.99
0.10
0.7
9.9e19 4.5
Cluster methodology
Most blocks use dominant detail (attractor) x
clusters to define the imputation classes
A business i belongs to cluster j of attractor x
where x>50 if
Yij
x

 Yij 100
j
where Yij is the total value reported by
business i in detail j. If this statement is not
true for any detail then the business is
assigned to cluster j+1.
Cluster methodology
Distribution ratios to details are calculated for
each cluster
Discriminatory models are then created
(nonparametric for most blocks) to assign
businesses with a generic
Use variables on industry (NAICS), location
(province), size (revenue, log revenue), details
and totals of details in other blocks
Cluster methodology
Generic amounts are assigned to details in
the following 3 ways
If generic amount and no details reported then
ratios are assigned as calculated
If generic amount and all details with ratio
greater than 0% are reported then ratios are
assigned as calculated
If generic amount and some details but not all
are reported, then ratios are pro-rated and
generic is assigned only to details which were
not reported
Cluster methodology
Gives better micro results
Improved data for tax replacement
Macro results remain similar to current
methodology
Micro results are consistent year to year
Future work and conclusions
The cluster methodology will be
implemented for reference year 2006 for
the Income Statement
Model fitting and implementation for
Balance Sheet will follow
Review of models and clustering methods
as deemed appropriate
Contact Information / Coordonnées
For more
information
please contact
Pour plus
d’information,
veuillez contacter
[email protected]
[email protected]
[email protected]
Visit our web site at
www.statcan.ca