Q2010 Helsinki Integrating databases over time: what about representativeness in longitudinal integrated and panel data? Silvia Biffignandi, Bergamo University Alessandro Zeli, Istat zeli @ista.it silvia. biffignandi @unibg.it Biffignandi Silvia- Zeli Alessandro Outline The problem and the research objectives Description of two longitudinal databases Quality analyses of these database Conclusions Biffignandi Silvia- Zeli Alessandro The problem and the research objectives NSIs usually carry out business surveys at different point in time using different samples for different surveys as well as different samples over time Users need more and more statistical information New strategies required Biffignandi Silvia- Zeli Alessandro The problem and the research objectives User needs for longitudinal data a) understanding aggregate changes in a variable, employment rate, over time such as b) studying the time-varying economic characteristic (such as employment) of an individual Biffignandi Silvia- Zeli Alessandro ……….and the research objectives Our task: construction of two longitudinal databases, based on various sources and on different criteria to verify the consistency between estimates based on the databases and population data Biffignandi Silvia- Zeli Alessandro Description of two longitudinal databases IDB ( technically integrated database) panel Biffignandi Silvia- Zeli Alessandro Description of two longitudinal databases microdata target population: enterprises with 20 employees or more (40% in terms of employment and 60% in terms of value added) variables : balance sheet data; SBS regulation data period: 1998-2004 Biffignandi Silvia- Zeli Alessandro Description of two longitudinal databases 1) IBD (technically integrated database) IDB data by sources (source percentage ) – Years 1998-2004 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Codes description only BIL i(ril = 9) PMI non respondents , but data integrated by BIL source (bil=5) SCI non respondents , but data integrated by BIL source(ril=3) SCI non respondents , but donor imputation (ril=2) SCI respondents (ril=1) PMI respondents (ril=0) 1989 1990 1991 1992 1993 1994 1995 Years 1996 1997 1998 1999 2000 2001 2002 2003 2004 Biffignandi Silvia- Zeli Alessandro ril=9 ril=5 ril=3 ril=2 ril=1 ril=0 Description of two longitudinal databases 2) Panel a catch-up panel database it takes business transformation into account integrity criterion, i.e. all variables in the panel have to be present for all enterprises in the whole panel period Biffignandi Silvia- Zeli Alessandro Description of two longitudinal databases 2) Panel Step 1: enterprises (with at least 20 persons employed) respondent to SCIPMI surveys in the starting year (1998) + all enterprises with at least 100 employees (even if non respondents) if the BIL source is available (integration); Step 2: continuity criterion to the previous enterprises; Step 3: persistence criterion ( i.e. respondents in 1998 or have data in the BIL for at least 4 years are included in the panel) Biffignandi Silvia- Zeli Alessandro Description of two longitudinal databases 2 )Panel Panel 1998-2004 dimension in terms of firms number and persons employed % w.r.t. population Firms number Persons employed number Firms Persons employed Starting database (Sci-Pmi respondents) 17,097 3,902,001 24.2 67.0 Persistence 13,573 3,243,549 19.2 55.7 Biffignandi Silvia- Zeli Alessandro Quality analyses Verify the equality of the population structure into the different database (especially the sectoral composition). We apply two different approaches: 1. the statistical analysis of difference between the distributions of some important variables in IDB/panel and universe; 2. an index of representativeness related to main categorical variables. Biffignandi Silvia- Zeli Alessandro Quality analyses Difference between distributions 1.a) 1.a) Spearman’s ranks correlation for distributions of value added, persons employed and turnover values of economic divisions, years 1998 – 2004: in IDB and universe (minimum 95,2 – maximum 99,8) in panel and universe (minimum 90,9- maximum 97) In both situation correlation is very high in each year. Only an ordinal ranking on the relative changes among the economic divisions. Biffignandi Silvia- Zeli Alessandro Quality analyses Difference between distributions 2.a) 2.a) Fligner-Policello test of stochastic equality distributions of shares of turnover, value added and employment by divisions of economic activities – Years 1992-2004 IDB vs Universe Panel vs Universe. Test not significant for all variables for all years in the panel. Biffignandi Silvia- Zeli Alessandro Quality analyses Representativeness indexes 2.) R-indexes (representativeness)- RISQ project (see for instance, Schouten and Cobben, 2007; Shlomo et al. 2009). Support for the quality comparison of different surveys or register to compare the response: • to different surveys that share the same target population • to a survey during data collection • to a survey longitudinally Biffignandi Silvia- Zeli Alessandro Quality analyses Representativeness indexes 2.) a) Weak representativity R2 indicator i.e. the average response propensity over the categories is constant Response probability estimation required: usually logistic regression model In our study auxiliary variables are: • industrial division as classified in the NACE Rev1.1 (2 digit sectors) • 3 size classes: 20 to 49, 50 to 249 and 250 and over persons employed). Biffignandi Silvia- Zeli Alessandro Quality analyses Representativeness indexes b) b) If X is an auxiliary variable with H classes a marginal indicator (MR2) proposed is = centralised regression parameter for category h Biffignandi Silvia- Zeli Alessandro Quality analyses Representativeness indexes b) R2 index and lower bound 0.985 0.98 0.975 0.97 0.965 0.96 0.955 0.95 0.945 1998 1999 2000 2001 2002 Biffignandi Silvia- Zeli Alessandro 2003 2004 Quality analyses Representativeness indexes b) Marginal R-index (panel years 1998,2001, 2004) by enterprise size 1.0 0.5 0.0 1998 20-49 50-249 -0.5 250+ 2001 2004 -1.0 -1.5 Biffignandi Silvia- Zeli Alessandro Quality analyses Representativeness indexes b) Marginal R-index for section of economic (panel years 1998, 2001, 2004) 2 1.5 1 1998 0.5 2001 0 -0.5 CA CB DA DB DC DD DE DF DG DH DI DJ DK DL DM DN E -1 -1.5 Biffignandi Silvia- Zeli Alessandro F G H I K M N O 2004 Quality analyses Representativeness indexes 2.) R2 indexes are very high in each year included in the panel marginal R-index remains essentially the same with a slight decrease over the period. a overrepresentation of medium-large enterprises (with 50 persons employed and over) small enterprises (between 20 and 49 persons employed) are underrepresented service sector is underrepresented Summing up: quite confident that the level of representativeness is appropriate in the global context. Biffignandi Silvia- Zeli Alessandro Concluding remarks - the use of administrative data integration is promising for longitudinal database construction - IBD and panel estimates are satisfactory - the panel allows for gain of information at reasonable cost/effort resources: for instance the grouping of enterprise according to classifications selected by the user (gazelle, or best performer) other then the ordinary classification utilised in the sample design (economic activity, size, geographical area) Further research - more representativity indicator analyses - panel update criteria Biffignandi Silvia- Zeli Alessandro Thank you for you attention!! Biffignandi Silvia- Zeli Alessandro R-indexes (representativeness) - RISQ project Shouten and Cobben (2007) 1 N si 2 ~ ˆ ˆ R 2( ) 1 4 ( i ) N 1 i 1 si weak representativity is a selection indicator that takes value 1 if the unit is selected in the sample and 0 otherwise first order inclusion probability Weak representativity R2 indicator: a response subset is representative for a categorical variable s X with H categories if the average response propensity over the categories is constant: where Nh is the population size of category h, 1 Nh h hk h,k is the response propensity of unit k in class h Nh K 1 and summation is over all units in this category. Biffignandi Silvia- Zeli Alessandro R-indexes (representativeness) Shouten and Cobben (2007) RISQ project The response probability exp( iXi) i P(ri 1 | si 1) 1 exp( iXi) Auxiliary variables in our study are: • industrial division as classified in the NACE Rev1.1 (2 digit sectors) • 3 size classes: 20 to 49, 50 to 249 and 250 and over persons employed). Biffignandi Silvia- Zeli Alessandro R-indexes (representativeness) Shouten and Cobben (2007) ˆ) B( 1 4 Bias Rˆ 2 1 16 2 2 where where Biffignandi Silvia- Zeli Alessandro R̂ 2 ~) / 2 S ( Fligner-Policello test Task verify the equality of the distributions for IDB or panel and universe of the relative shares of the economic divisions of the three variables considered with respect to the totals for each years. Fligner-Policello test no assumptions • no on normality • no equal variances, • not that the two distribution have a similar shape. I It is a test of stochastic equality between two distributions, rejection of the null means that the two distribution are different in probability. If the null hypothesis is rejected the sign of F-P statistic points out which of the two distributions is dominant: a positive sign means that panel shares have an higher probability to take greater values wit respect to the population. Biffignandi Silvia- Zeli Alessandro
© Copyright 2026 Paperzz