On the theory framework for register-based statistics --

1
A theoretical framework
for register-based statistics
--- Can we carry on without it?
Li-Chun Zhang
Statistics Norway
[email protected]
Statistical data by combination of sources:
Coverage, content & relevance
Quality: Statistical vs. administrative register
•
Wallgren & Wallgren (2007, Wiley):
– “An administrative register is maintained to store records on all objects to
be administered.” (Ideally)
– “A statistical register is based on data from administrative registers that
have been processed to suit statistical purposes.”
•
A defining distinction in perspectives
– Administrative register: Individual data of all importance
– Statistical register: Properties at various aggregated levels
 Quality of register-based statistics
 Micro-data quality of a statistical register
•
Notable lag of theoretical framework (Platek and Särndal, Holt,
Nanopoulos, 2001)
– A framework for quality assessment
– Theoretical frameworks for different quality aspects
Process accuracy vs. statistical accuracy:
Any unbiased, efficient estimators based on statistical registers?
• Process accuracy
–
–
–
–
Matching/mismatching rate
Extent of duplicates
Amount of missing values
…
• Statistical accuracy
– Coverage
– Relevance
– Inherent stochastic variation
 An example of the UK claimant register (Holt, 2007, TAS)
–
–
–
–
–
people claiming unemployment related benefits
entire population of claimants (say 1.5 million)
no sampling error and arguably a perfect measure
derived once each month on the same working day
daily variation about 10,000 in this count
A historic parallel:
Survey sampling before Neyman (1934)
•
The representative method (Kiær, 1895) with a three-stage design
using 1890 census as frame:
– 1st: 128 counties and 23 towns throughout the country
– 2nd: cohorts of males of age 17, 22, 27, 32, etc.
– 3rd: persons with surname initial A, B, C, L, M, N
•
•
ISI-committee 1924 report: “I think I may venture to say that
nowadays there is hardly one statistician, who in principle will
contest the legitimacy of the representative method”. (Jensen)
Representative sampling (Neyman, 1934):
Thus, if we are interested in a collective character X of a population  and use methods of sampling and of esimation,
allowing us to ascribe to every possible sample,  , a confidence interval X 1 ( ), X 2 ( ) such that the frequency of errors
in the statements X 1 ( )  X  X 2 ( ) does not exceed the limit 1- prescribed in advance, whatever the unknown
properties of the population, I should call the method of sampling representative and the method of estimation consistent.
Comparisons to non-sampling errors
in sample survey and census
Sample Survey
Census
Register-based survey
Coverage errors
Coverage errors
Relevance errors
Non-response errors
Non-response errors
Integration errors
Measurement errors
Measurement errors
Sampling errors
•
Unidentified units in register & non-response in survey
– Related to under-coverage
– Yes, imputation. But a quite different theory!
– Example: register households
•
Coverage errors
Matching/mismatching errors
Missing-link errors
Aggregation errors
(Partial classification)
 ‘Imputation’ of household identity
 Which imputation methods do you use? Hot-deck?
Definitional error in register source & measurement error
– Related to relevance
– Yes, a kind of measurement error. But bias dominates! And often clearly
different in different sub-populations.
– Example: register unemployment (REG_unemp)
REG_unemp = ILO_unemp + Bias + Random_error
A theory for detailed statistics: Signal or noise?
Parameterd,t+1  Parameterd,t  g ( xd ,t , xd ,t 1 )  ud ,t 1  ed ,t 1 ( N d ,t 1 )
d = domain of interest,
t, t+1 = reference time points
x d,t , x d,t+1 = explanatory variables/covariates/auxiliary information
g(x) = description/model of underlying structural change
u d,t+1 = random domain effect beyond structural explanation
N d,t+1 = domain population size
ed,t+1 = random errors governed by the law of large numbers
A theory for micro-data quality
•
•
•
Reality at “Storgata 9”:
–
–
–
–
H0101: Astrid (72) - widow
H0102: Tommy (32) & Jenny (29) & Ronny (2) - cohabitation
H0201: Olav (29) & Lena (29) - cohabitation since Census 2001
H0202: Knut (27) - single
Register:
–
–
–
–
–
H0101: Astrid (72) - widow
H0101: Tommy (32) & Jenny (29) & Ronny (2) - cohabitation
H0101: Olav (29) - single
?: Lena (29) - single
Imputed cohabitation
?: Knut (27) - single
in household register
Only Astrid is correctly registered. But when/how does it matter?
Administrative register => Individual data of all importance
=> Unit-specific error
Statistical register => A theory of types
- How real is a record: how are variables related to each other
- How representative is a record: distribution of the types