Presentation

Q2010
Helsinki
Integrating databases over time: what about
representativeness in longitudinal integrated and
panel data?
Silvia Biffignandi, Bergamo University
Alessandro Zeli, Istat
zeli @ista.it
silvia. biffignandi @unibg.it
Biffignandi Silvia- Zeli Alessandro
Outline
 The problem and the research objectives
 Description of two longitudinal databases
 Quality analyses of these database
 Conclusions
Biffignandi Silvia- Zeli Alessandro
The problem and the research objectives
 NSIs usually carry out business surveys at different point in time
using different samples for different surveys as well as different samples
over time
 Users need more and more statistical information
 New strategies required
Biffignandi Silvia- Zeli Alessandro
The problem and the research objectives
User needs for longitudinal data
a) understanding aggregate changes in a variable,
employment rate, over time
such as
b) studying the time-varying economic characteristic (such as
employment) of an individual
Biffignandi Silvia- Zeli Alessandro
……….and the research objectives
Our task:
 construction of two longitudinal databases, based on various sources
and on different criteria
 to verify the consistency between estimates based on the databases
and population data
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
 IDB ( technically integrated database)
 panel
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
 microdata
 target population: enterprises with 20 employees or more
(40% in terms of employment and 60% in terms of value added)
 variables : balance sheet data; SBS regulation data
 period: 1998-2004
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
1) IBD (technically integrated database)
IDB data by sources (source percentage ) – Years 1998-2004
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Codes description
 only BIL i(ril = 9)
PMI non respondents , but data
integrated by BIL source (bil=5)
SCI non respondents , but data
integrated by BIL source(ril=3)
SCI non respondents , but donor
imputation (ril=2)
 SCI respondents (ril=1)
 PMI respondents (ril=0)
1989 1990 1991 1992 1993 1994 1995 Years
1996 1997 1998 1999 2000 2001 2002 2003 2004
Biffignandi Silvia- Zeli Alessandro
ril=9
ril=5
ril=3
ril=2
ril=1
ril=0
Description of two longitudinal databases
2) Panel
 a catch-up panel database
 it takes business transformation into account
 integrity criterion, i.e. all variables in the panel have to be present for
all enterprises in the whole panel period
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
2) Panel
Step 1:
enterprises (with at least 20 persons employed) respondent to SCIPMI surveys in the starting year (1998)
+
all enterprises with at least 100 employees (even if non respondents)
if the BIL source is available (integration);
Step 2:
continuity criterion to the previous enterprises;
Step 3:
persistence criterion ( i.e. respondents in 1998 or have data in the
BIL for at least 4 years are included in the panel)
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
2 )Panel
Panel 1998-2004 dimension in terms of firms number and persons
employed
% w.r.t. population
Firms
number
Persons
employed
number
Firms
Persons employed
Starting database (Sci-Pmi
respondents)
17,097
3,902,001
24.2
67.0
Persistence
13,573
3,243,549
19.2
55.7
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Verify the equality of the population structure into the different database
(especially the sectoral composition).
We apply two different approaches:
1. the statistical analysis of difference between the distributions of
some important variables in IDB/panel and universe;
2. an index of representativeness related to main categorical
variables.
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Difference between distributions 1.a)
1.a) Spearman’s ranks correlation
for distributions of value added, persons employed and turnover
values of economic divisions, years 1998 – 2004:
 in IDB and universe (minimum 95,2 – maximum 99,8)
in panel and universe (minimum 90,9- maximum 97)
In both situation correlation is very high in each year.
Only an ordinal ranking on the relative changes among the
economic divisions.
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Difference between distributions 2.a)
2.a) Fligner-Policello test of stochastic equality distributions of
shares of turnover, value added and employment by divisions of
economic activities – Years 1992-2004
 IDB vs Universe
 Panel vs Universe.
Test not significant for all variables for all years in the panel.
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes 2.)
R-indexes (representativeness)- RISQ project
(see for instance, Schouten and Cobben, 2007; Shlomo et al. 2009).
Support for the quality comparison of different surveys or register to
compare the response:
• to different surveys that share the same target population
• to a survey during data collection
• to a survey longitudinally
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes 2.)
a) Weak representativity R2 indicator
i.e. the average response propensity over the categories is
constant
Response probability estimation required: usually logistic regression
model
In our study auxiliary variables are:
• industrial division as classified in the NACE Rev1.1 (2
digit sectors)
• 3 size classes: 20 to 49, 50 to 249 and 250 and over
persons employed).
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes b)
b) If X is an auxiliary variable with H classes a
marginal indicator (MR2) proposed is
= centralised regression parameter for category h
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes b)
R2 index and lower bound
0.985
0.98
0.975
0.97
0.965
0.96
0.955
0.95
0.945
1998
1999
2000
2001
2002
Biffignandi Silvia- Zeli Alessandro
2003
2004
Quality analyses
Representativeness indexes b)
Marginal R-index (panel years 1998,2001, 2004) by enterprise size
1.0
0.5
0.0
1998
20-49
50-249
-0.5
250+
2001
2004
-1.0
-1.5
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes b)
Marginal R-index for section of economic (panel years 1998, 2001, 2004)
2
1.5
1
1998
0.5
2001
0
-0.5
CA CB DA DB DC DD DE DF DG DH DI DJ DK DL DM DN E
-1
-1.5
Biffignandi Silvia- Zeli Alessandro
F
G
H
I
K
M
N
O
2004
Quality analyses
Representativeness indexes 2.)
 R2 indexes are very high in each year included in the panel
 marginal R-index remains essentially the same with a slight decrease over
the period.
 a overrepresentation of medium-large enterprises (with 50 persons
employed and over)
 small enterprises (between 20 and 49 persons employed) are
underrepresented
service sector is underrepresented
Summing up:
quite confident that the level of representativeness is appropriate in the
global context.
Biffignandi Silvia- Zeli Alessandro
Concluding remarks
- the use of administrative data integration is promising for longitudinal
database construction
- IBD and panel estimates are satisfactory
- the panel allows for gain of information at reasonable cost/effort
resources:
for instance the grouping of enterprise according to classifications
selected by the user (gazelle, or best performer) other then the ordinary
classification utilised in the sample design (economic activity, size,
geographical area)
Further research
- more representativity indicator analyses
- panel update criteria
Biffignandi Silvia- Zeli Alessandro
Thank you for you attention!!
Biffignandi Silvia- Zeli Alessandro
R-indexes (representativeness) - RISQ project
Shouten and Cobben (2007)
1 N si
2
~
ˆ
ˆ
R 2(  )  1  4
(

i


)

N  1 i 1 
si
weak representativity
is a selection indicator that takes value
1 if the unit is selected in the sample
and 0 otherwise

first order inclusion probability
Weak representativity R2 indicator:
a response subset is representative for a categorical variable
s X with H
categories if the average response propensity over the categories is
constant:
where Nh is the population size of category h,
1 Nh
h 
hk   h,k is the response propensity of unit k in class h
Nh K 1
and summation is over all units in this category.

Biffignandi Silvia- Zeli Alessandro
R-indexes (representativeness)
Shouten and Cobben (2007)
RISQ project
The response probability
exp( iXi)
i  P(ri  1 | si  1) 
1  exp( iXi)
Auxiliary variables in our study are:
• industrial division as classified in the NACE Rev1.1 (2 digit sectors)
• 3 size classes: 20 to 49, 50 to 249 and 250 and over persons employed).
Biffignandi Silvia- Zeli Alessandro
R-indexes (representativeness)
Shouten and Cobben (2007)
ˆ)
B( 

1 
4
Bias
Rˆ 2  1  16  2 2
where
where
Biffignandi Silvia- Zeli Alessandro
R̂ 2


~) / 2 
  S (
Fligner-Policello test
Task
verify the equality of the distributions for IDB or panel and universe of the relative
shares of the economic divisions of the three variables considered with respect to the
totals for each years.
Fligner-Policello test
no assumptions
• no on normality
• no equal variances,
• not that the two distribution have a similar shape. I
It is a test of stochastic equality between two distributions, rejection of the null means
that the two distribution are different in probability.
If the null hypothesis is rejected the sign of F-P statistic points out which of the two
distributions is dominant: a positive sign means that panel shares have an higher
probability to take greater values wit respect to the population.
Biffignandi Silvia- Zeli Alessandro