Presentation

Deriving Educational Attainment
by combining data from
Administrative Sources and Sample Surveys
Recent developments towards the 2011 Census
Frank Linder
(co-author Dominique van Roon)
Statistics Netherlands
Conference Statistics Investment to the future 2
Prague, 14-15 September 2009
Contents
 Importance of data on education
 Data Sources education level
- traditionally
- new alternative
 Innovation in micro-integration:
new way of combining administrative sources and sample surveys
 Social Statistical Database (SSD), Virtual Census
 Educational attainment, new method explained
- micro-integration steps
- weighting strategy
 Accuracy
 Conclusions
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
2
Importance of data on education
 Education key social indicator for government policy
and socio-economic research
EU Lisbon Strategy 2000: “Education and training policies are central
to the creation and transmission of knowledge and are determining
factor in each society’s potential for innovation…
Positive impact of education on employment, health, social inclusion
and active citizenship has already been extensively shown”
 Educational Attainment standard variable in
Census Programme
 Great demand for data on education level by
researchers, background variable for their analyses
 Educational Attainment central position in
Social Statistical Database (SSD)
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
3
Data Sources education level,
traditionally
 Labour Force Survey (LFS), exclusive domain
- complete education career until date of interview
- educational attainment
 Reliability small subpopulations problematic
 In practice LFS-solution:
- unified sample over a period of consecutive years =>
more observations, lower standard errors
- assumption: stability of variable over the period
 LFS still used as source for education
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
4
Data Sources education level,
new alternative
 New administrative education registers in last decade
- new opportunities determination educational attainment
- full coverage target population of register =>
more observations =>
more reliable estimates (in particular small populations!)
 However ….. alternative still dependent on LFS !
- no coverage e.g. people prior to administrations (mostly
older citizens), private education, studies abroad
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
5
Innovation in micro-integration
 combining statistical information from administrative
sources ánd sample surveys for the sake of one variable
EXAMPLE
Integration conventional way
(different variables)
Integration new way
(one variable)
Jobs register
Education register 1
Education
level
Employee
population
LFS
Coherent
information
on education
level of
employees
Education
level
target
population
Education register k
Education
level
Education
level
LFS
Education
level
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
6
Social Statistical Database (SSD),
Virtual Census
 Integration Framework Social Statistics
- Micro-linkage and micro-integration of data
on demographic and socio-economic issues
- Data sources:
∙ administrative registers (primarily)
∙ sample surveys (if no information in registers)
- Coherence, consistency, comprehensiveness,
completeness, detailedness, 1 figure - 1 phenomenon
- Important base production of social statistics
 Kind of information
- Demography, labour, social security, income,
health care, security, housing etc…..
- Education level (previously LFS, now new method)
 Key source Virtual Census 2001 and 2011
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
7
Educational Attainment, new method, step 1
Construction Education Archive
 Collecting sources education data
 Storage in Education Archive
Registers
Primary
Education
-ENR2010..
Secondary
education
-ENR’02..
-ERR’99..
-CREHE
prelim
Higher
education
-CREHE’83..
Sample Surveys
Other
registers
-CWI’90..
-SFR’95..
-RSF’01..
Labour
Force
Survey
’96..
Education Archive (cumulative storage by annual addition)
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
8
Education Archive
Extract from Education Archive
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
9
Educational Attainment, new method, step 2
Construction Educational Attainment File
 Construct Education Attainment File (EAF) containing
highest attained education level of individuals at reference date
 Selection from Education Archive (micro-integration)
- Records representing education careers until reference date (micro-integr:
derive and impute missing start and end dates)
 Adjust to target population (micro-integration)
- E.g. eliminate foreign students in Education Archive; supplement
primary education by imputation from Population Register: PR 0-14)
 Quality assessment sources (micro-integration)
- Which sources to be used, which neglected (e.g. CWI)?
 Assessing validity education levels at reference date
- Decision rules: deterministic and stochastic (based on probabilities)
 Determine highest valid level during education career
Example
registers:
SCED
43 43 53 60
LFS-sample: SCED 20 – 33 – 43 – 53 – 60
60
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
10
Validity education level at reference date
o
p
t
i
o
n
a
l
s
h
e
e
t
 Deterministic decision rules
Examples:
- Last record of a person in Education Archive: secondary education
with certificate in 2003.
Education level still valid at reference date 2006? Yes, because
not found in register of Higher Education.
- Someone doctorate thesis in 2005. Gives highest attained education
level period after, because we distinguish no higher level (trivial!)
 Stochastic decision rules
- Probability education level is still valid at reference date
(application Survival Analysis, Life Tables method)
Example: last observed date D. Is level valid x years after D?
Upper bound U: within period [D;D+U] still valid (95%)
Stochastic decision rule
valid
date of attainment
education level D
invalid
D+U
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
reference date 
11
Weighting Strategy (1),
structure EAF
Education Attainment File, September 2005
REGISTER
1
2
3
ERR
0
ENR
4
CREHE prelim
PR age 0-14
Remaining population,
not covered by EAF
(9.8 mln)
CREHE LFS
inflating
5
6
7
8
9
10
11
12
13
14
15
16
mln
EAF september 2005: mixture of register and sample (LFS) records
- Coverage: 6.5 mln records (5.8 mln register and 0.675 mln sample)
- NL population: 16.3 mln people.
- Sample inflation to bridge gap of 9.8 mln people
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
12
17
Weighting Strategy (2),
representativeness
Population coverage by source and age in Education Attainment File, September 2005
100
PR 0-14
90
Coverage percentage
80
Registers
70
LFS ≥15
60
Total coverage: PR 0-14 +
Registers + LFS ≥15
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
Age
-Younger people better represented in EAF (registers!)
- Older people are underrepresented in EAF
- make EAF representative of NL Population: calibration LFS-weights
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
13
Weighting Strategy (3),
weighting model
 weighting model, variables
- demographic (e.g. sex, age, country of origin,…)
- socio-economic (e.g. socio-econ. category, income,…)
- education CWI-register (proxy)
 weighting model too abundant?
- pro: consistency with as many population margins
- con: fluctuation final weights, disrupting effect on
accuracy, problematic in cells with few observations
- revision of weighting strategy is considered
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
14
Accuracy and dissemination
 Education levels accurate enough for dissemination?
 Measuring instrument needed for determination accuracy:
variance estimator
 Standard literature sampling theory mainly for sample only,
less the case for combined register and sample data
- approximation formulas – only to be used for larger n.
Problematic for smaller subpopulations in which we are
particularly interested
- solution also applicable for smaller subpopulations:
bootstrap resampling method for accuracy measurement
developed by methodologists of Statistics Netherlands
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
15
Accuracy, small subpopulations
Young highly educated persons of Turkish origin, 2005
Coefficient of variation (CV), LFS2005, approximation
Ň
0
382
1.175
1.071
n
0
2
7
6
Turkish males; age 18-24; highly educated
Turkish females; age 18-24; highly educated
Turkish males; age 25-30; highly educated
Turkish females; age 25-30; highly educated
CV
->20%
>20%
>20%
n is number of observations (unweighted number)
Ň is estimate of total population (weighted number)
Coefficient of variation (CV), EAF 2005, bootstrap estimation
Turkish males; age 18-24; highly educated
Turkish females; age 18-24; highly educated
Turkish males; age 25-30; highly educated
Turkish females; age 25-30; highly educated
N1
39
58
286
321
n2
1
3
9
9
Ň2
129
157
415
367
Ň
168
215
701
688
CV
32%
42%
20%
19%
N1 is number of register observations
n2 is number of sample observations
Ň2 is weighted number of sample observations
Ň = N1 + Ň2 is estimate of total population
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
16
Conclusions
 Results new method deriving educational attainment are
promising
- Similarity with results traditional LFS-estimation at high
aggregation level
- Outperforms LFS for small populations. So more
dissemination possibilities for small populations
 Serious opportunity for innovation of Census of 2011
to produce educational attainment according to
new method
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
17
Remarks
Questions
Discussion
For more details read our paper!
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
18
SSD-system, organizational units
o
p
t
i
o
n
a
l
s
h
e
e
t
s
SSD-system: SSD-core (pivot) and SSD-satellites
Satellite Labour Market
s
p
a
r
e
t
i
m
e
Satellite Education
Satellite Ethnic Minorities
Satellite Income
SSD Core
Satellite Health Care
Satellite Social Cohesion
Not fully presentation of SSD-satellites!!
 Manageability SSD-system: split in smaller units
 Core: demographic and socio-economic information relevant
in almost any field
Educational attainment core-position: crucial in many social processes
 Satellite: specific topics
 Core and satellites consistent: 1-figure 1-phenomenon
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
19
Population Census in the Netherlands
o
p
t
i
o
n
a
l
s
h
e
e
t
s
 Until 1971 conventional: field enumeration
s
p
a
r
e
t
i
m
e
 2001: Virtual Census
- Key source: Social Statistical Database (SSD)
- greater part from registers.
- educational attainment from LFS (2000/2001)
Virtual Census results convincing =>
build on these experiences for Census 2011
 2011: Virtual Census
- Task force in charge of working-out
- Key source: much more comprehensive SSD
- Educational attainment new method recommended (detailed
table Census table programme)
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
20
Survival Analysis, Life Tables Method
o
p
t
i s
o p
n a
a r
l e
s t
h i
e m
e e
t
s
LIFE TABLES
LIFE TABLES
LIFE TABLES
S(t)
S(t)
S(t)
Cumul
Cumul
Cumul
Intrvl
Propn
Intrvl
Propn
Intrvl
Propn
Start
Surv
Surv
at End
Start
Time
(t)
Surv
Time (t)
Start
Time
(t)
------
------
------
------
at End
at End
0
1,0000
1
0,9964
11
0,8441
21
0,7442
2
0,9833
12
0,8409
22
0,7442
3
0,9652
13
0,8272
23
0,7442
4
0,9314
14
0,8147
24
0,7255
5
0,9098
15
0,8147
25
0,7255
6
0,8906
16
0,7905
26
0,7255
7
0,8773
17
0,7905
27
0,7255
8
0,8657
18
0,7850
28
0,7255
9
0,8570
19
0,7787
29
0,7255
10
0,8484
20
0,7482
30
0,7255
The survival function S(t) = P[T ≥ t] gives the probability that the educational
attainment has not changed within t years. The distribution was determined
empirically on the basis of the LFS for a number of years.
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
21
CV-formulae, sample
o
p
t
i s
o p
n a
a r
l e
s t
h i
e m
e e
t
s
- N population size
- n sample size
- N(g) total population in cell g
- n(g) number of sample observations in cell g
- p=N(g)/N; q=1-p and f=n/N.
With n large enough, the variance of Ň (g) can be
approximated as: var(Ň(g)) = N2pq(1-f)/n
With the assumption that the average sample fraction f is
very small (e.g. LFS sample fraction is about 1 percent),
and p is very small (i.e. relative small subpopulation) the
coefficient of variation (CV) can be approximated as:
[q(1-f)/np]½ ≈ [1/n(g)]½.
So, a CV ≤ 20% implies n(g) ≥ 25 (threshold)
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
22
CV-formulae, mixture sample/register
o
p
t
i
o
n
a
l
- N1(g) number of register observations in cell g
s
p
a
r
e
s t
h i
e m
e e
t
s
- N2(g) weighted number of sample observations in cell g
- N(g)= N1(g)+ N2(g)
- n2(g) is the (unweighted) number of sample
observations in cell g.
For a coefficient of variation of not more than A it is
required that
n2(g) ≥ (1/A)2. [N2(g) / (N(g)]2.
With a higher N1(g) the threshold decreases
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
23
Silva & Skinner (1997)
s t
h i
e m
e e
t
s
• Simulation study
• Adding auxiliary variables in a regression model causes
variance of regression estimator to drop initially, but by
adding still more variables the variance will tend to increase
from a certain point on.
3500
3000
2500
varianceie
o
p
t
i s
o p
n a
a r
l e
2000
1500
1000
500
0
1
2
3
4
5
6
7
8
9
number of auxiliary variables included
Conference Statistics Investment to the future 2, Prague 14-15 september 2009
24