Deriving Educational Attainment by combining data from Administrative Sources and Sample Surveys Recent developments towards the 2011 Census Frank Linder (co-author Dominique van Roon) Statistics Netherlands Conference Statistics Investment to the future 2 Prague, 14-15 September 2009 Contents Importance of data on education Data Sources education level - traditionally - new alternative Innovation in micro-integration: new way of combining administrative sources and sample surveys Social Statistical Database (SSD), Virtual Census Educational attainment, new method explained - micro-integration steps - weighting strategy Accuracy Conclusions Conference Statistics Investment to the future 2, Prague 14-15 september 2009 2 Importance of data on education Education key social indicator for government policy and socio-economic research EU Lisbon Strategy 2000: “Education and training policies are central to the creation and transmission of knowledge and are determining factor in each society’s potential for innovation… Positive impact of education on employment, health, social inclusion and active citizenship has already been extensively shown” Educational Attainment standard variable in Census Programme Great demand for data on education level by researchers, background variable for their analyses Educational Attainment central position in Social Statistical Database (SSD) Conference Statistics Investment to the future 2, Prague 14-15 september 2009 3 Data Sources education level, traditionally Labour Force Survey (LFS), exclusive domain - complete education career until date of interview - educational attainment Reliability small subpopulations problematic In practice LFS-solution: - unified sample over a period of consecutive years => more observations, lower standard errors - assumption: stability of variable over the period LFS still used as source for education Conference Statistics Investment to the future 2, Prague 14-15 september 2009 4 Data Sources education level, new alternative New administrative education registers in last decade - new opportunities determination educational attainment - full coverage target population of register => more observations => more reliable estimates (in particular small populations!) However ….. alternative still dependent on LFS ! - no coverage e.g. people prior to administrations (mostly older citizens), private education, studies abroad Conference Statistics Investment to the future 2, Prague 14-15 september 2009 5 Innovation in micro-integration combining statistical information from administrative sources ánd sample surveys for the sake of one variable EXAMPLE Integration conventional way (different variables) Integration new way (one variable) Jobs register Education register 1 Education level Employee population LFS Coherent information on education level of employees Education level target population Education register k Education level Education level LFS Education level Conference Statistics Investment to the future 2, Prague 14-15 september 2009 6 Social Statistical Database (SSD), Virtual Census Integration Framework Social Statistics - Micro-linkage and micro-integration of data on demographic and socio-economic issues - Data sources: ∙ administrative registers (primarily) ∙ sample surveys (if no information in registers) - Coherence, consistency, comprehensiveness, completeness, detailedness, 1 figure - 1 phenomenon - Important base production of social statistics Kind of information - Demography, labour, social security, income, health care, security, housing etc….. - Education level (previously LFS, now new method) Key source Virtual Census 2001 and 2011 Conference Statistics Investment to the future 2, Prague 14-15 september 2009 7 Educational Attainment, new method, step 1 Construction Education Archive Collecting sources education data Storage in Education Archive Registers Primary Education -ENR2010.. Secondary education -ENR’02.. -ERR’99.. -CREHE prelim Higher education -CREHE’83.. Sample Surveys Other registers -CWI’90.. -SFR’95.. -RSF’01.. Labour Force Survey ’96.. Education Archive (cumulative storage by annual addition) Conference Statistics Investment to the future 2, Prague 14-15 september 2009 8 Education Archive Extract from Education Archive Conference Statistics Investment to the future 2, Prague 14-15 september 2009 9 Educational Attainment, new method, step 2 Construction Educational Attainment File Construct Education Attainment File (EAF) containing highest attained education level of individuals at reference date Selection from Education Archive (micro-integration) - Records representing education careers until reference date (micro-integr: derive and impute missing start and end dates) Adjust to target population (micro-integration) - E.g. eliminate foreign students in Education Archive; supplement primary education by imputation from Population Register: PR 0-14) Quality assessment sources (micro-integration) - Which sources to be used, which neglected (e.g. CWI)? Assessing validity education levels at reference date - Decision rules: deterministic and stochastic (based on probabilities) Determine highest valid level during education career Example registers: SCED 43 43 53 60 LFS-sample: SCED 20 – 33 – 43 – 53 – 60 60 Conference Statistics Investment to the future 2, Prague 14-15 september 2009 10 Validity education level at reference date o p t i o n a l s h e e t Deterministic decision rules Examples: - Last record of a person in Education Archive: secondary education with certificate in 2003. Education level still valid at reference date 2006? Yes, because not found in register of Higher Education. - Someone doctorate thesis in 2005. Gives highest attained education level period after, because we distinguish no higher level (trivial!) Stochastic decision rules - Probability education level is still valid at reference date (application Survival Analysis, Life Tables method) Example: last observed date D. Is level valid x years after D? Upper bound U: within period [D;D+U] still valid (95%) Stochastic decision rule valid date of attainment education level D invalid D+U Conference Statistics Investment to the future 2, Prague 14-15 september 2009 reference date 11 Weighting Strategy (1), structure EAF Education Attainment File, September 2005 REGISTER 1 2 3 ERR 0 ENR 4 CREHE prelim PR age 0-14 Remaining population, not covered by EAF (9.8 mln) CREHE LFS inflating 5 6 7 8 9 10 11 12 13 14 15 16 mln EAF september 2005: mixture of register and sample (LFS) records - Coverage: 6.5 mln records (5.8 mln register and 0.675 mln sample) - NL population: 16.3 mln people. - Sample inflation to bridge gap of 9.8 mln people Conference Statistics Investment to the future 2, Prague 14-15 september 2009 12 17 Weighting Strategy (2), representativeness Population coverage by source and age in Education Attainment File, September 2005 100 PR 0-14 90 Coverage percentage 80 Registers 70 LFS ≥15 60 Total coverage: PR 0-14 + Registers + LFS ≥15 50 40 30 20 10 0 0 10 20 30 40 50 60 70 80 90 Age -Younger people better represented in EAF (registers!) - Older people are underrepresented in EAF - make EAF representative of NL Population: calibration LFS-weights Conference Statistics Investment to the future 2, Prague 14-15 september 2009 13 Weighting Strategy (3), weighting model weighting model, variables - demographic (e.g. sex, age, country of origin,…) - socio-economic (e.g. socio-econ. category, income,…) - education CWI-register (proxy) weighting model too abundant? - pro: consistency with as many population margins - con: fluctuation final weights, disrupting effect on accuracy, problematic in cells with few observations - revision of weighting strategy is considered Conference Statistics Investment to the future 2, Prague 14-15 september 2009 14 Accuracy and dissemination Education levels accurate enough for dissemination? Measuring instrument needed for determination accuracy: variance estimator Standard literature sampling theory mainly for sample only, less the case for combined register and sample data - approximation formulas – only to be used for larger n. Problematic for smaller subpopulations in which we are particularly interested - solution also applicable for smaller subpopulations: bootstrap resampling method for accuracy measurement developed by methodologists of Statistics Netherlands Conference Statistics Investment to the future 2, Prague 14-15 september 2009 15 Accuracy, small subpopulations Young highly educated persons of Turkish origin, 2005 Coefficient of variation (CV), LFS2005, approximation Ň 0 382 1.175 1.071 n 0 2 7 6 Turkish males; age 18-24; highly educated Turkish females; age 18-24; highly educated Turkish males; age 25-30; highly educated Turkish females; age 25-30; highly educated CV ->20% >20% >20% n is number of observations (unweighted number) Ň is estimate of total population (weighted number) Coefficient of variation (CV), EAF 2005, bootstrap estimation Turkish males; age 18-24; highly educated Turkish females; age 18-24; highly educated Turkish males; age 25-30; highly educated Turkish females; age 25-30; highly educated N1 39 58 286 321 n2 1 3 9 9 Ň2 129 157 415 367 Ň 168 215 701 688 CV 32% 42% 20% 19% N1 is number of register observations n2 is number of sample observations Ň2 is weighted number of sample observations Ň = N1 + Ň2 is estimate of total population Conference Statistics Investment to the future 2, Prague 14-15 september 2009 16 Conclusions Results new method deriving educational attainment are promising - Similarity with results traditional LFS-estimation at high aggregation level - Outperforms LFS for small populations. So more dissemination possibilities for small populations Serious opportunity for innovation of Census of 2011 to produce educational attainment according to new method Conference Statistics Investment to the future 2, Prague 14-15 september 2009 17 Remarks Questions Discussion For more details read our paper! Conference Statistics Investment to the future 2, Prague 14-15 september 2009 18 SSD-system, organizational units o p t i o n a l s h e e t s SSD-system: SSD-core (pivot) and SSD-satellites Satellite Labour Market s p a r e t i m e Satellite Education Satellite Ethnic Minorities Satellite Income SSD Core Satellite Health Care Satellite Social Cohesion Not fully presentation of SSD-satellites!! Manageability SSD-system: split in smaller units Core: demographic and socio-economic information relevant in almost any field Educational attainment core-position: crucial in many social processes Satellite: specific topics Core and satellites consistent: 1-figure 1-phenomenon Conference Statistics Investment to the future 2, Prague 14-15 september 2009 19 Population Census in the Netherlands o p t i o n a l s h e e t s Until 1971 conventional: field enumeration s p a r e t i m e 2001: Virtual Census - Key source: Social Statistical Database (SSD) - greater part from registers. - educational attainment from LFS (2000/2001) Virtual Census results convincing => build on these experiences for Census 2011 2011: Virtual Census - Task force in charge of working-out - Key source: much more comprehensive SSD - Educational attainment new method recommended (detailed table Census table programme) Conference Statistics Investment to the future 2, Prague 14-15 september 2009 20 Survival Analysis, Life Tables Method o p t i s o p n a a r l e s t h i e m e e t s LIFE TABLES LIFE TABLES LIFE TABLES S(t) S(t) S(t) Cumul Cumul Cumul Intrvl Propn Intrvl Propn Intrvl Propn Start Surv Surv at End Start Time (t) Surv Time (t) Start Time (t) ------ ------ ------ ------ at End at End 0 1,0000 1 0,9964 11 0,8441 21 0,7442 2 0,9833 12 0,8409 22 0,7442 3 0,9652 13 0,8272 23 0,7442 4 0,9314 14 0,8147 24 0,7255 5 0,9098 15 0,8147 25 0,7255 6 0,8906 16 0,7905 26 0,7255 7 0,8773 17 0,7905 27 0,7255 8 0,8657 18 0,7850 28 0,7255 9 0,8570 19 0,7787 29 0,7255 10 0,8484 20 0,7482 30 0,7255 The survival function S(t) = P[T ≥ t] gives the probability that the educational attainment has not changed within t years. The distribution was determined empirically on the basis of the LFS for a number of years. Conference Statistics Investment to the future 2, Prague 14-15 september 2009 21 CV-formulae, sample o p t i s o p n a a r l e s t h i e m e e t s - N population size - n sample size - N(g) total population in cell g - n(g) number of sample observations in cell g - p=N(g)/N; q=1-p and f=n/N. With n large enough, the variance of Ň (g) can be approximated as: var(Ň(g)) = N2pq(1-f)/n With the assumption that the average sample fraction f is very small (e.g. LFS sample fraction is about 1 percent), and p is very small (i.e. relative small subpopulation) the coefficient of variation (CV) can be approximated as: [q(1-f)/np]½ ≈ [1/n(g)]½. So, a CV ≤ 20% implies n(g) ≥ 25 (threshold) Conference Statistics Investment to the future 2, Prague 14-15 september 2009 22 CV-formulae, mixture sample/register o p t i o n a l - N1(g) number of register observations in cell g s p a r e s t h i e m e e t s - N2(g) weighted number of sample observations in cell g - N(g)= N1(g)+ N2(g) - n2(g) is the (unweighted) number of sample observations in cell g. For a coefficient of variation of not more than A it is required that n2(g) ≥ (1/A)2. [N2(g) / (N(g)]2. With a higher N1(g) the threshold decreases Conference Statistics Investment to the future 2, Prague 14-15 september 2009 23 Silva & Skinner (1997) s t h i e m e e t s • Simulation study • Adding auxiliary variables in a regression model causes variance of regression estimator to drop initially, but by adding still more variables the variance will tend to increase from a certain point on. 3500 3000 2500 varianceie o p t i s o p n a a r l e 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 number of auxiliary variables included Conference Statistics Investment to the future 2, Prague 14-15 september 2009 24
© Copyright 2025 Paperzz