Producing Population Estimates Using Administrative Data: In Theory

Beyond 2011
Beyond 2011: Producing Population estimates Using
Administrative Data: In Theory
July 2013
Background
The Office for National Statistics is currently taking a fresh look at options for the production of
population and small area socio-demographic statistics for England and Wales. The Beyond 2011
Programme has been established to carry out research on the options and to recommend the best
way forward to meet future user needs.
Improvements in technology and administrative data sources offer opportunities to either
modernise the existing census process, or to develop an alternative by re-using existing data
already held within government. Since methods for taking the traditional census are already
relatively well understood most of the research is focussing on how surveys can be supplemented
by better re-use of ‘administrative’ data already collected from the public.
The final recommendation, which will be made in 2014, will balance user needs, cost, benefit,
statistical quality, and the public acceptability of all of the options. The results will have implications
for all population-based statistics in England and Wales and, potentially, for the statistical system
as a whole.
About this paper
This paper outlines how population estimates could be produced using a combination of
anonymously linked record-level administrative data and a Population Coverage Survey. It
presents an initial design for a Population Coverage Survey and some options for the estimation
framework, using a simulation study to explore the likely quality of estimates.
It is being published alongside the companion papers: Beyond 2011: Producing Population
Estimates Using Administrative Data: In Practice (Paper M7), and Beyond 2011: Matching
Anonymous Data (Paper M9).
This document is one of a series of papers to be published over coming months. These will report
our progress on researching and assessing the options, discuss our policies and methods and
summarise what we find out about individual data sources.
For more information
Paper references (Paper M1, O1, R2 etc) used throughout refer to other papers published by
Beyond 2011 all of which are available on the Beyond 2011 pages of the ONS website.
Search Beyond 2011 @ www.ons.gov.uk or contact : [email protected]
© Crown Copyright 2013
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
Table of Contents
1
Executive Summary ............................................................................................... 3
2
Introduction ............................................................................................................ 4
2.1
Background .............................................................................................................. 4
2.2
This paper ................................................................................................................ 4
3
Background to the paper ....................................................................................... 6
3.1
Proposed framework ................................................................................................ 6
3.2
Definitions ................................................................................................................ 7
3.3
Assumptions ............................................................................................................ 9
3.4
Research principles ................................................................................................ 11
3.5
Survey requirements and scope ............................................................................. 11
4
Survey context ..................................................................................................... 13
5
Sample design options ........................................................................................ 16
5.1
Level of clustering – choice of Primary Sampling Unit ............................................ 16
5.2
Stratification ........................................................................................................... 17
5.3
Sample size ........................................................................................................... 17
5.4
Sample allocation and constraints .......................................................................... 18
5.5
More complex design features ............................................................................... 18
5.6
Summary of initial sample design ........................................................................... 18
6
Estimation Framework ......................................................................................... 19
6.1
Weighting approach ............................................................................................... 19
6.2
Measurement of quality approach .......................................................................... 21
6.3
Extending the estimation approach ........................................................................ 23
6.4
Developing an estimation strategy.......................................................................... 24
7
Simulation studies ............................................................................................... 25
7.1
The simulation engine ............................................................................................ 25
7.2
Estimators to be explored ....................................................................................... 27
7.3
Areas and scenarios .............................................................................................. 27
7.4
PCS sample design and sample size ..................................................................... 28
7.5
Summary of simulation runs ................................................................................... 28
8
Results .................................................................................................................. 30
8.1
Simulated SPDs ..................................................................................................... 30
8.2
Analysis of estimator performance ......................................................................... 31
8.2.1
Basic simulation probabilities ................................................................................. 31
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
1
8.2.2
Different SPD scenarios ......................................................................................... 32
8.2.3
Summary................................................................................................................ 33
8.2.4
Results by age and sex .......................................................................................... 33
8.2.5
Summary................................................................................................................ 36
8.3
Analysis of different PCS sample sizes .................................................................. 36
8.4
Analysis of impact of lower PCS response rates .................................................... 39
8.5
Assessment of estimates against quantitative design targets ................................. 42
9
Conclusions ......................................................................................................... 43
Appendix A: Statistical quality targets ..................................................................................... 44
Appendix B: SPD probabilities of inclusion and incorrect inclusion ...................................... 45
How the SPD probabilities were derived ....................................................................................... 47
Person characteristics and area type ............................................................................................ 48
Allowance for erroneous enumerations in the ‘unaccounted for’ PR .............................................. 49
Duplicates ..................................................................................................................................... 49
Appendix C: How the PCS probabilities were derived ............................................................. 50
Modelling the 2001 Census coverage patterns ............................................................................. 50
Applying the models to the 2001 Census database....................................................................... 51
Appendix D: Evaluation criteria for simulation studies ........................................................... 52
Relative bias ................................................................................................................................. 52
RRMSE ......................................................................................................................................... 52
RSE .............................................................................................................................................. 52
Examples for all 3 measures ......................................................................................................... 52
Appendix E: Additional simulation results ............................................................................... 54
Glossary ...................................................................................................................................... 71
References .................................................................................................................................. 72
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
2
1 Executive Summary
This paper outlines how population estimates could be produced using a combination of
anonymously linked record-level administrative data and a Population Coverage Survey (PCS).
The methodology for such an approach is much like the framework developed for the 2001 and
2011 Censuses, albeit with the census replaced by linked administrative data. This paper re-uses
the statistical technology from that framework to explore the development of a methodology that is
aligned to using administrative data.
The approach has been to build on the 2011 Census methodology, as the general structure of the
problem is similar. Based on a number of assumptions, the paper presents an initial design for a
PCS and some options for the estimation framework. The survey design is limited in complexity
due to the current lack of information about the quality of administrative data with which to optimise
the sampling. However, the proposed design can be refreshed as data are obtained.
The paper uses a simulation study to explore the likely quality of estimates. The results show a
great deal of promise in terms of precision of the population estimates. The Beyond 2011
Programme quality standards for population estimates are likely to be met, assuming further work
on stratification and the methods for obtaining local authority populations are able to detect
localised differences.
However, the evaluation of bias in the simulation results has shown that the methods as they stand
perform better when over-coverage in the linked administrative data is minimised, even if the price
paid is an increase in under-coverage.
Critical to all of the results in these simulations were the assumptions underpinning them. The key
ones were high accuracy of matching and minimal erroneous inclusions (records appearing in the
administrative data which are not in the population). The impact of violating both these
assumptions is a one-to-one increase in bias. Further work into mitigation, the likelihood and
potential adjustment methods for both of these will be critical for confirming whether an
administrative based method combined with a PCS will produce population estimates of
acceptable accuracy.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
3
2 Introduction
2.1
Background
Population and socio-demographic statistics are currently based on a 10-yearly census of the
population. There is a clear, ongoing need for high quality statistics and, whilst the 2011 Census
was successful, the traditional census is becoming increasingly costly and difficult to conduct.
There is also a demand for more frequent statistics from some users.
Improvements in technology and administrative data sources offer opportunities either to
modernise the census process, or to develop an alternative by re-using existing data already held
within government. Whilst this would require investment in the short term, it should deliver realterm savings in the longer term compared to the 2011 Census, which cost £480m over 10 years.
In May 2010 the UK Statistics Authority asked the Office for National Statistics (ONS) to investigate
the possible alternatives for England and Wales. Whilst the UK Statistics Authority is an
independent, non-Ministerial department, the final decision will be taken by Parliament because of
funding and legislative requirements. A recommendation will be made in 2014.
During the first phase of the programme, running from 2011/12 to 2014/15 the programme will
assess user requirements for small area population and socio-demographic statistics and the best
way to meet these needs. The outcome of the first phase will be a full business case underpinning
the recommendation, setting out the costs and benefits of the options considered.
A full public consultation allowing stakeholders to contribute views on the key issues and options is
planned for September to November this year.
2.2
This paper
This paper develops a statistical methodology that underpins an alternative approach to the
production of population estimates through use of administrative data-based options being
considered by the Beyond 2011 Programme. A full overview and updated evaluation of all the
options can be found in ‘Beyond 2011: Options Report (Paper O2).
The paper is focused on producing population estimates using a combination of anonymously
linked record-level administrative data and a coverage survey. It is anticipated that linked
administrative data sources cannot at present be used in isolation for the provision of high quality
population estimates since their existing quality is not of a sufficient level to provide a benchmark,
and there is no population register in England and Wales. However, a PCS could be designed to
measure the coverage properties of a Statistical Population Dataset (SPD) created by linking
individual records from administrative data sources to produce population estimates. The
methodology for such an approach is much like the framework developed for the 2001 and 2011
Censuses, albeit with the census replaced by the linked administrative data. This paper re-uses the
statistical technology from that framework to explore the development of a methodology that is
aligned to using linked administrative data.
A key objective for the Beyond 2011 Programme is to estimate the local authority (LA) usually
resident population 1 by five year age-sex with an acceptable degree of accuracy; ‘accuracy’ is
1
The place at which the person has lived continuously for at least the last 12 months, not including temporary absences
for holidays or work assignments, or intends to live for at least 12 months (United Nations, 2008).
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
4
defined as both the width of confidence intervals on the adjusted population estimates and the
level of bias. Statistical quality standards for national and local authority level population estimates
have been set and are summarized in Appendix A. Quality standards have not yet been set
explicitly for population estimates by age and sex.
As the research phase of the Beyond 2011 Programme is ongoing, some starting assumptions are
necessary. Over time, these assumptions will be refined in response to the research findings.
Section 3 outlines the proposed framework and these high level assumptions. To ensure it is clear
what is being measured, a number of terms relating to under-coverage and over-coverage that are
used throughout the paper are defined. The paper then sets out the research principles, survey
requirements and scope.
Sections 4 and 5 describe the survey context and then develop options for the survey design and
discuss associated practical issues. Section 6 develops some options for producing population
estimates from a population coverage survey and a Statistical Population Dataset. It builds on the
high level work in the papers Beyond 2011: Exploring the Challenges of Using Administrative Data
(M2),(2012) and Beyond 2011: Administrative Data Models: Research Using Aggregate Data (R3),
(2013), bringing the models presented in those papers into a common estimation framework and
then considering some alternatives to that framework.
Section 7 of the paper then outlines a simulation framework which can be used to evaluate
different survey design and estimation methods. This will demonstrate the likely quality of
population estimates from such an approach under the assumptions underpinning the simulations.
A simulation study to evaluate the proposed sample design and estimation methods under a
number of scenarios is outlined. Section 8 analyses the results from the application of the
simulations to show the performance of the estimators, impact of sample size and impact of
different levels of survey non-response.
This is one of three detailed papers being published simultaneously which propose a framework
and methods for producing population estimates from administrative data which meet the quality
criteria. The other two papers are:
Beyond 2011: Producing Population Estimates Using Administrative Data: In Practice (M7) and
Beyond 2011: Producing Population Estimates Using Administrative Data (M6)
The latter provides a summary of the framework and findings from all three papers, bringing
together the evidence on whether such an approach is feasible.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
5
3 Background to the paper
3.1
Proposed framework
The proposed framework for producing population estimates from administrative data, developed
in Beyond 2011: Producing Population Estimates Using Administrative Data: In Practice (M7), is
shown in Figure 1. As noted in the introduction, this is similar to the framework for the 2011
Census coverage assessment methodology, where the linked record-level administrative data
replace the census. ONS (2013d) describes the processes to create the SPD. The stages are set
out in Figure 1 and are listed below:
Stage 1: Evaluate the quality and integrity of the administrative data sources individually.
Stage 2: Anonymously match data sources at record level.
Stage 3: Construct an SPD through a set of ‘rules’ to determine, based on the matching
information, which records and/or matches are included in the SPD.
Stage 4: Carry out a PCS which is designed to measure both the under- and overcoverage of the SPD.
Stage 5: Produce estimates of the usually resident population at LA level by age and sex
through an estimation process
STAGE 2
STAGE 3
STAGE 4
Statistical
Population
Datasets
(SPDs)
Coverage Survey
STAGE 1
Matching
Figure 1 - Framework for producing population estimates under a linked record-level administrative
data design
Admin datasets
B
Rules
A
C
STAGE 5
Estimation
Estimates
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
6
3.2
Definitions
These definitions are based on a target population of usual residents in England and Wales
(defined on a 12 month plus basis) assuming that an individual’s correct location is defined at
output area (OA) level, as initially this is the lowest level of outputs. However, further consideration
will explore this as it has strong links with the estimation methodology developed in section 6.
•
Under-coverage is where a record does not exist, or is not present at its correct location
(but it might be present elsewhere, where it becomes over-coverage).
•
As well as people in the wrong place, over-coverage includes records which are included
erroneously (erroneous meaning that it should not be in the population at all). In the case of
administrative sources, over-coverage is usually a result of people who have emigrated not
having been removed, people who have died not having been removed, duplicates, or
people who fall outside of the definition of a usual resident (ONS, 2012a).
For the SPD or the PCS, a person is defined as being in the correct location if they are included in
the OA where they are usually resident. The OAs are potentially the lowest level of output from an
administrative data-based statistical system. It is assumed that every person has only a single
correct location which is defined as their usual residence. The definition of usual residence must
ensure that everyone has only one usual residence.
The SPD could:
•
•
•
•
•
include people in the correct location – these are defined as correct inclusions;
include people but in the wrong location – these are defined as incorrect inclusions;
exclude people that should be there – these are defined as exclusions;
include people more than once – these are made up of a mixture of a correct inclusion
and/or possible multiple incorrect inclusions (duplicates); or
include people who are out of the scope of the usually resident population – these are
defined as erroneous inclusions.
Therefore, exclusions are under-coverage, but incorrect inclusions and erroneous inclusions are all
over-coverage at the local level. At the England (or alternatively Wales) level people in the wrong
place are resolved, so over-coverage is a result only of erroneous inclusions and duplicates. Figure
2 shows these outcomes.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
7
Figure 2 - Visual representation of correct, incorrect and erroneous inclusions and exclusions
© Crown copyright 2013
Contains Ordnance Survey data © Crown copyright and database right 2013
Some examples are provided to make these definitions clear.
a) A person is in the SPD at their previous home address, and also in the care home where they
moved to six months ago. There is one incorrect inclusion (previous home address) and one
correct inclusion (care home). This is one example of duplication (over-coverage).
b) A person is in the SPD at their parents address, but missed from the halls of residence where
they attend university. There is one incorrect inclusion (parents address) and one exclusion (halls
of residence). Whilst nationally these records net out, locally this is one example of over-coverage
in one location and under-coverage in another location.
c) A person is in the SPD at their parents address, and also in the flat they rented with a partner for
six months where they worked locally, but they have since moved abroad. There are two
erroneous inclusions which are both duplicates (over-coverage).
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
8
A very strict rule-based SPD may reduce, or remove some of the over-coverage problems but may
increase the under-coverage. For example, using a rule that, where the sources differ, only the
most recently updated address is used to define location.
3.3
Assumptions
Some of the key assumptions made in this paper are listed below.
a) Administrative sources are combined at individual and address level to produce an SPD
That is various sources are matched together, and some rules are used to determine whether or
not matched/unmatched households/persons/addresses are included in the SPD. The properties of
this dataset are still being developed whilst work continues on evaluating the accuracy of the
matching and the inclusion rules. However, the rules in particular offer some control over the likely
extent of over-coverage or under-coverage in relation to the usually resident population that is the
target for the estimates. The impact of matching error and the rules on the measurement of
coverage error will be very important for the design of the PCS and the estimation processes. A
single source SPD has not necessarily been ruled out should the matching errors be prohibitive.
b) The SPD is compiled and linked longitudinally every year
This is important for estimation, as it may provide some additional auxiliary data (for example
additional matching information) from previous years’ SPDs. This will help with information about
population movement and potentially change in the administrative data. Equally this longitudinal
element may be used in the PCS design as proxy information about the quality of the SPD.
c) The estimates will require a reference date
The population estimates will relate to the 30th June as per the ONS mid-year population estimates
(MYEs) 2. This implies that the PCS requires a reference date to ensure that, regardless of the
collection base, usual residence is defined at a single point in time. In addition, as the SPD will
potentially be a construct from different sources with different reference dates, the survey may
need to ask about alternative and historical addresses to facilitate matching. The reference date
itself will require further consideration to ensure that it is suitable for field operations.
d) The current usual residence definition, as used in the mid-year population estimates, will
be the key population base, used to define where an individual should be included.
e) The PCS cannot be a dependent survey
At this stage, it is assumed that a dependent survey (where the sample is drawn directly from
either a single administrative data source or a combined set) will not be possible. There are
complex legal and ethical issues associated with such an approach, and alternative options are
discussed in Beyond 2011: Exploring the Challenges of Using Administrative Data (M2).
Essentially, the administrative sources are currently only able to be used for statistical analysis, not
operations. So, for example, drawing a sample from the NHS Patient Register(PR) and then
attempting to find those patients is not permitted under the terms of data provision. From a public
acceptability point of view it may not be acceptable to turn up on the doorstep asking whether a
person is living at that address and where they might be otherwise registered as a patient.
However, such an approach would enable a PCS to determine whether a sample of records are
accurate or not by following up in the field, thereby measuring (for instance) over-coverage
efficiently. If such a survey was possible, the methods proposed in this paper would have been
very different. The 2011 Census used an independent survey because the problem was one of
2
ONS produces mid-year population estimates using a cohort-component method to estimate the population at LA level
by age and sex between censuses.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
9
under-coverage. If the main issue in the 2011 Census had been over-coverage then a dependent
survey might have been a better solution.
f) The PCS will define the correct geographical location of individuals at their usual
residence
This is essential, in terms of measuring whether populations are in the correct location. The survey
will use the rules around the definition of usual residence to establish the correct location. It is
expected to be one of the challenges using the SPD (it is likely to include administrative records in
the incorrect location due to lags in updating) for certain population groups. The survey must
establish the correct usual residence and hence location of an individual. This will be a key
objective of the PCS design and field implementation.
g) The PCS sample size is likely to be around 350,000 households
The 2011 Census Coverage Survey (CCS) was a sample of 350,000 households, designed
primarily to measure under-coverage of the order of 6 %, but also to measure over-coverage of the
order of 1 %. It is expected that a survey of this size would be affordable and, although there is no
constraint on the size, a larger survey must provide clear gains in quality. For the sample size
suggested, it is still likely that direct 3 LA level estimates will not be possible and we will need to
develop a small area method for doing so (see section 6).
h) The PCS is a compulsory survey
In this paper it is assumed initially that the proposed PCS is a compulsory survey, achieving a
response rate of around 90 % which is approximately that achieved in the 2001 and 2011 CCS.
The simulation study described in section 7 explores the impact on the accuracy of the estimates if
the survey were to be voluntary and therefore achieve much lower response rates. This will help to
make the case for legislation to support a compulsory PCS.
i) The PCS will be a mixed mode survey
It is assumed that the PCS will offer a self completion internet response mode initially, followed by
telephone (if possible) and interviewer assisted doorstep e-completion. This optimizes cost but not
necessarily quality in terms of introducing mode effects, although further research will explore this.
Internet completion does potentially reduce processing errors due to online coding, but any form of
self completion is at risk of respondent error particularly when it comes to applying complex
concepts such as definitions of usual residence (Tourangeau et al, 2000).
j) There is not yet a definitive address register with near perfect coverage to use as a frame
for a PCS
There is a national address register product, AddressBase 4, which has some similarities with the
address register created for the 2011 Census However, this is a new product and whilst efforts are
being made to evaluate it and improve its quality, there is, as yet, no evidence that it is of sufficient
quality to be used as a frame for a PCS to draw a sample of addresses. The risk is that it does not
identify all residential addresses accurately, and therefore there would be some addresses which
would have no chance of being in the sample.
Therefore we will explore area based approaches until we are confident that we have an address
register of sufficient quality for an un-clustered sample design. However, the intention is to use
AddressBase in the survey fieldwork as a starting point for listing all of the addresses in the
sampled areas, although there will need to be a process for interviewers to check and correct it to
identify any addresses not included in the initial list.
3
Direct in this context means an estimate that uses just the sample from that local authority. For most LAs the sample
will not be large enough to support direct estimates as was the case in the 2011 Census. Small area techniques which
borrow strength from the sample across many local authorities will be required.
4
See http://www.ordnancesurvey.co.uk/oswebsite/products/addressbase-premium/index.html
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
10
k) Separate attributes survey
We assume that a survey for obtaining estimates of attributes (in the Beyond 2011 context this
includes households and their characteristics, and person characteristics beyond age and sex)
beyond those limited ones included in the PCS is separate. However, by its nature such a survey
may provide information on coverage that could be integrated to improve the estimates, but this
would be an extension of the basic model. Further work on how the two surveys would be
integrated to provide a set of coherent population statistics is a subject of future work.
3.4
Research principles
We will design the PCS using standard statistical design principles, bearing in mind that we are
likely to draw heavily on models in the estimation to get efficiency (using some form of
ratio/regression estimator). That is, we should use good design based techniques for evaluating
the design. Furthermore, the survey will need to be designed with the flexibility to be adapted and
improved as more is learned about the characteristics and patterns of coverage error in the SPD.
The initial design will have to be robust – that is, able to measure with a reasonable degree of
accuracy regardless of the underlying unobservable patterns. The design will be optimised as more
data are collected.
3.5
Survey requirements and scope
Survey response rate and accuracy
The PCS will have to have a high response rate, and produce good quality data, as it is to be used
to estimate accurately under- and over-coverage in the SPD. It must capture correctly those people
who are:
•
•
•
•
a correct inclusion on the SPD;
an incorrect inclusion on the SPD;
excluded from the SPD; or
included more than once in the SPD.
Note that erroneous inclusions are not included in the above list, since the survey will never find
them as they are not really there.
The fieldwork and method of collection must be built around maximising response and accuracy in
the survey. Whilst an interview based collection is arguably that which provides highest coverage
and accuracy, an initial self completion option (say, via internet) is attractive due to its much lower
cost. The likely lower response rate and lower within-household coverage 5 may be outweighed by
cost savings which would allow a larger sample size for the same fixed resource. Options, such as
the extent to which an initial self completion option is offered, will be explored. Using multiple
modes of data collection does carry the risk of over-count in the survey, because people may
complete the survey more than once (or incorrectly include themselves in the case of selfcompletion), which would be very difficult to resolve or remove at the estimation phase. If a choice
is to be made between higher response rates or low over-coverage, from an estimation perspective
the low over-coverage option is more likely to lead to estimates that are unbiased. Therefore it is
preferable to have controls in place in the data collection phase to reduce the risks of survey overcoverage.
5
Within-household coverage is the proportion of people who respond within a household, given that the household does
respond. In census terms these are people missing from returned questionnaires
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
11
One important requirement is high matching accuracy between the SPD and the PCS, assuming
that anonymised individual level matching is required for use in the estimation process, such as
that required for a Dual System Estimator (DSE) (see section 6). However, alternative approaches
such as a weighting class adjustment (see section 6), only require matching of households (rather
than individuals) which may reduce this requirement. Alternatively, a process to measure matching
error and make an adjustment on that basis may be possible, although not ideal. Regardless, the
data capture process has to be accurate (to reduce matching effort and error due to poorly
captured data) and therefore the collection process should attempt to recognise this.
To support the high response requirement, the questionnaire has to be short and simple. Data
should only be collected on variables required for matching or analysis. The 2011 Census CCS
questions provide a good starting point, but the addition of questions to help match in multiple
locations, for example, questions on address history, will be important as we expect that the SPD
will include individuals in incorrect locations. Nevertheless the questionnaire should be as short as
possible to minimise burden and help with general accuracy.
Scope of PCS measurement
There are inevitably going to be records in the SPD that are erroneously included, such as the
deceased and persons who are not usual residents (for example emigrants who are no longer
usually resident in England and Wales, or short-term migrants who are present on administrative
sources). For some of these records an independent survey will never gather any information to
measure them (for instance the deceased or emigrants). The only way to measure this through a
survey is a dependent sample of unmatched records which is not possible (see assumption b), or
by using another source to estimate erroneous inclusions. This could be through exploring sources
which provide a ‘signs of life’ type indicators, analysis of previously identified unmatched records
over time to estimate erroneous inclusions (making some assumptions), or by direct matching to a
source that identifies deaths or emigrants. These options are being actively explored, for example,
through analysis of longitudinally linked administrative data and use of indicators on such sources
which show when an individual last interacted with the system.
In addition, the filter rules applied to compile the SPD may be able to make these sorts of records
negligible. Separate work is exploring how erroneous inclusions will be tackled, as the estimation
of any residual erroneous element is challenging, and will be important in the production of
population estimates to meet the quality standards required.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
12
4 Survey context
Motivation
The aim of the PCS is to facilitate the estimation of the population at a LA level by age and sex.
This section sets out the design decisions that will be explored in this paper. As with any survey
this involves decisions regarding the stratification of the sample and the extent to which the sample
is geographically clustered. Clustering occurs as we often select our sample by first selecting areas
referred to as primary sampling units (PSUs) and then selecting lower level units within the areas
to create the final sample (see chapters 9-11 of Cochran, 1977). The design decisions can then be
evaluated using criteria based upon cost, practicality, and statistical efficiency, along with how well
the design will support the estimation methods. This paper does not carry out a full evaluation of
the decisions, but proposes a starting point for the survey design.
Design constraints
The design objectives have to be met within the survey context. This section outlines the main
features and constraints which need to be accounted for in the design, beyond the assumptions
outlined in section 3.
a) Design information
Ideally when designing a survey to measure coverage one should have information about likely
coverage patterns and in particular information about the variability of those patterns within and
between strata and PSUs. If there is such data, the allocation of the sample across strata can be
optimised to give good overall precision as well as good precision at smaller areas. However, if
there is not any suitable data then there is no prior information to say where the sample needs to
be larger and where it can be smaller. Some proxies can be used, but they would rely on making
an assumption (which may be incorrect) of correlation with the true underlying pattern.
For the Beyond 2011 context, currently very little quantitative data about coverage patterns in
administrative data sources exists, aside from aggregate comparisons and analysis. There is some
empirical and anecdotal evidence at national and LA level, but very little below this. Work is being
developed to learn about the quality of linked administrative records, but at present, very little is
known. While we are developing this knowledge, the sample design will have to be robust so that it
can measure coverage regardless of the underlying patterns. It may not provide an optimal design
in the first year of operation but the information from the first survey can be used to refine the
subsequent survey and so on. Similarly, the SPD development described in section 3, and outlined
in ONS (2013d) may result in some information that can be used. For instance, the 2011 Census is
currently being used to evaluate the under and over-coverage of an SPD constructed using 2011
administrative sources. The evaluation data could then be used as proxy information for designing
a PCS for an SPD in 2021 (assuming the SPD is constructed in broadly the same way using the
same rules). Alternatively, information from the SPD matching process (for example match rates)
could be used as a proxy for likely under or over coverage. Exploring these possibilities is beyond
the scope of this paper but will be an important topic for future work.
b) Collection base
The output base requirement is for usual residence. However, this does not restrict the survey to
use this as its collection base. One issue with using a usual residence collection base is that it
gives opportunities for people to opt out by claiming they are not a usual resident, increasing nonresponse bias. A ‘persons present’ collection base, where information is collected for everyone
who was present on the reference date regardless of their usual residence, does not. However,
this would require much more processing effort (to move people back to their usual residence) and
would likely result in just as much bias in the estimates due to errors in this additional processing.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
13
A ‘persons present’ base will also miss those abroad for the field period, which would then have to
be estimated from another source.
The best option is a combination of the two, which is similar to that used in the 2011 Census. A
listing of usual residents as well as those present is collected, with an indication of usual residence
(and whether present) for all persons. Usual residents then complete the key questions and those
visiting complete a reduced set.
This makes estimation straightforward as no adjustment is needed for changes between the
reference date for outputs and the reference date of the PCS (assuming a single PCS near the
reference date).
c) Coverage of the population and sampling units
In terms of its sample frame, the survey must cover all households containing usual residents
within England and Wales. As noted in the assumptions, there will not be an address source in
place in the short term that will have sufficient coverage for drawing a sample for the purposes of
estimating the population. The proposed alternative is to use a sample frame of small geographic
areas. Postcodes are the obvious choice. This is because postcodes have complete coverage of
the England and Wales household population, even if that is not reflected in the postal address file
(the Royal Mail’s administrative system). Postcode boundaries can be identified on maps and the
householder can either self complete or assist the interviewer as they will most likely know their
own postcode. Using postcodes as the basic sampling unit makes sense in the absence of a
complete list of households.
However, it is likely that AddressBase will improve sufficiently over time to the point where any
under-coverage is negligible, which will enable it to be used as a sampling frame for the survey.
However, until that can be proven it is suggested that the starting point for the design are
postcodes, with some consideration that in the future this could change to addresses. Work to
explore the point at which the level of under-coverage of addresses would facilitate switching, and
the impact on the design should be undertaken. The impact on sample size and clustering will be
explored. Given the objectives of the PCS, another consideration is that an under-coverage rate
on AddressBase sufficiently low for general purpose use may not be low enough for a PCS.
Note that the choice of postcodes does not mean that AddressBase could not be used as a starting
point for the households to be enumerated in the survey – the main reason for the independent
listing in the CCS was to ensure independence from the census. As the administrative based SPD
does not use an ‘address list’ then it is acceptable to use AddressBase in the survey with a robust
address check exercise to check its coverage.
d) Special populations
There are several special population groups that will require consideration from a PCS perspective.
A relatively small but important component of the usually resident population resides in communal
establishments (CEs) of varying sizes. These are defined as ‘managed’ accommodation and
include hotels, prisons, military bases and nursing homes.
CEs, especially large ones, are a particular issue, as this type of population will behave differently
from the household population in terms of coverage on an SPD since they may be covered by
specific administrative sources themselves (for example, prisons). In addition, there is a specific
risk of over-coverage (both duplication and individuals included at incorrect locations) between the
household population and some CE types. Examples include persons in transition between living
at home and living at a care home, or persons in prison for short terms.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
14
Large CEs (those with over 100 bed spaces) will be excluded from the PCS as there are relatively
few of these, and sample survey may not be the best approach for estimating the population in
such establishments. The intention is to explore the use of administrative data where it is available
to potentially provide population estimates for the large CEs, although the quality of some of these
sources can be variable and will require further work to evaluate. If these prove not to be suitable,
it may be that the best approach is to carry out a complete enumeration of these large
establishments.
There are certain types of CE for which the behaviour of residents is very different, for example
armed forces bases, student halls of residence and prisons. For such types, it is suggested that
they are not included in the PCS regardless of size, and either the direct use of administrative data
or a complete enumeration is undertaken dependent on the data availability and evaluation as
noted above. Robust guidance for the PCS fieldstaff will be required to ensure they are excluded
as some of these types might look like ordinary households (e.g. university maintained
accommodation, armed forces accommodation). For the identified types, especially those requiring
a complete enumeration (and to remove them from the SPD), an address list is likely to be
required. Those that require a complete enumeration would probably not require a full enumeration
every year - once every five years, or a rolling basis should suffice if the population in those types
does not change much (identified through the list maintenance and liaison with umbrella
organisations). Further work is being done to understand what ‘special populations’ mean for an
SPD, and work will be required to determine these types and the practical aspects.
Beyond the very large and special types, the remaining CEs (for example, small hotels and guesthouses) should be included and interviewed in the PCS with the household population, as their
behaviour is likely to be more similar. There is no suitable frame for CEs at present and so a
separate survey of them is not yet possible. Including them within the normal PCS sample will
ensure that nothing within the sampled areas is missed, and therefore they are treated identically
(although the field procedures may differ slightly). This data can then be used separately at the
estimation stage as necessary. As the focus is on the household population, the presence of the
CE population will not be considered at this stage in the design work, and the PCS designed to
obtain the best estimates of the household population.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
15
5 Sample design options
5.1
Level of clustering – choice of Primary Sampling Unit
As the PCS will require a certain amount of intensive fieldwork to ensure response rates are high,
some further clustering is desirable for the interviewers. Using the postcode as the final sampling
unit already creates small clusters of households but a single postcode is too small to be the full
workload of an interviewer. Therefore some clustering of postcodes is desirable in terms of efficient
fieldwork management. In the 2011 CCS, OAs were used as Primary Sampling Units (PSUs) as
there was some robust data at that level and it provided some clustering for efficient fieldwork.
Furthermore, this assumes that there is no sub-sampling within postcodes (meaning that data is
collected for all households in a postcode). However, it could be possible to sub-sample within
postcodes if AddressBase fits with population estimation needs as noted previously and the final
sample unit could then be addresses, with AddressBase as the sampling frame. It may be possible
to sub-sample addresses with an incomplete address list. This would involve sub-sampling from
the existing list and asking interviewers to check the address list and implement a process for subsampling from the checked list. This possibility will be explored in future work.
For the PCS, the decision about whether the design should include clustering above the postcode
level is dependent on there being some intelligence or data about the likely coverage patterns in
the SPD. For instance, if it is likely that coverage patterns within a small geographical unit are
homogenous, then a highly geographically clustered sample design will be statistically inefficient,
albeit more cost efficient. A design that has some limited clustering of postcodes would offer a
better balance between statistical efficiency and increased cost of interviewers travel time.
However, if coverage is more to do with individual characteristics so that there is heterogeneity
within or across postcodes then some clustering may be advantageous. This choice would depend
on the level of homogeneity by geography and the survey cost. Some additional clustering beyond
postcode is sensible to help control costs and would be robust from a statistical perspective given
the current lack of information about coverage patterns.
The most efficient design from a field perspective would be one that provides a cluster that is sized
such that an interviewer does not have to travel outside of it, or at least minimizes travel. Therefore
the cluster must contain enough work for a single interviewer. If we assume that the survey is
primarily internet self completion followed up by an interviewer visit, it is compulsory, and we obtain
a roughly 50 % initial response via internet, then for every 100 households the interviewer will have
to follow up 50. Thus if an interviewer can, over the survey period, make a sensible number of
visits to obtain responses from 100 households, then an efficient workload size might be 200
households. A cluster of this size would then be most efficient from a travel perspective. If the
overall sample is relatively large and thus travel less of an issue, then two clusters of size 100
households each may be an alternative, or even four clusters of size 50. Development of a cost
model will help explore the cost versus statistical efficiency of such choices.
Given the field considerations and the lack of information at present, the initial design will use lower
level super OAs (LSOAs) as the PSU. These are chosen as there is non-census data available
(such as that obtained from aggregate level administrative sources) at this level for stratification,
and they are also relatively consistent in their population sizes (around 750 households).
Geographically, most are small enough to enable an interviewer to easily travel within. Selecting a
quarter of the postcodes from each LSOA would result in clusters of about 190 households which
may provide sensible workload sizes. As postcodes vary so much in size, a selection process that
chooses a quarter of the postcodes rather than a fixed number should help to even out cluster
sizes. It also means from an estimation point of view that the sample weights are constant within a
PSU (as the sample fraction will always be 0.25).
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
16
5.2
Stratification
As LA population estimates are one of the key outcomes, it would be sensible to consider them as
potential strata. There is also motivation from a user acceptability perspective for ensuring there is
sample in every local authority and by including them specifically as strata the sample can be
constrained to ensure this happens explicitly.
The most important stratification beyond LAs is an index similar to the hard-to-count (HtC) index
used in the 2001 and 2011 CCS designs. The motivation behind the index is to reduce the
variance of the estimates by ensuring that the sample is spread according to prior knowledge of
the likely level and variability of the outcome. In the 2011 Census context this meant utilizing prior
knowledge of the likely level and variability of under-coverage6 within LAs. The outcome was the
HtC index which used proxy data associated with factors correlated with census non-response
such as private rented accommodation and young persons. In the context of using an SPD, it is
realistic to expect that the coverage characteristics of the SPD will vary within an LA, although the
characteristics of this variation are less well known as such a dataset has not been used in this
way. In addition, the variability will be potentially due to two competing factors – under-coverage
and over-coverage. In the absence of any reliable information below LA level for either of these,
some assumptions will be required based on our empirical studies – for instance it could be
assumed that coverage is correlated with levels of internal and international migration, or use the
information from the SPD development as described in section 4.
To create the stratification, any identified correlated variables can be used as predictions of the
likely coverage patterns across the PSUs and combined in some way to provide a ‘score’. For
simplicity, the structure could be similar to that used in the HtC index, which grouped LSOAs into
five categories containing 40 %, 40 %, 10 %, 8 % and 2 % of the LSOAs. However, to be robust, a
40 %, 40 %, 20 % will be proposed initially. Further work will be undertaken to develop this
stratification.
5.3
Sample size
The sample size is assumed to be around 350,000 households, which is about 17,000 postcodes.
From a practical perspective, the previous censuses demonstrated that a survey of this size could
be undertaken, and provided estimates of a suitable quality (see Abbott, 2009 and Abbott, 1998).
However, the likely quality in this context will depend on the underlying quality of the SPD and the
efficiency of our sample design and estimation methods. While the knowledge about the SPDs is
being developed, a robust design is sensible which inevitably leads to a larger sample size for a
fixed precision. A smaller sample size for fixed precision may be possible assuming the SPD
quality remains relatively constant over time (this is regardless of the time between surveys,
although the narrower the gap the more gains can be made). Further work will explore whether, if
the survey is repeated at yearly intervals, the sample size can be smaller by borrowing strength
from previous samples in the estimation process.
However, further research will not be restricted to the assumed sample size as there may be
important gains from larger samples, such as improved small area estimates. Such gains will be
balanced against the increased costs and risks of a larger survey. However, there might be value
in boosting the sample periodically (for example every five or ten years) to provide better quality
small area population estimates which allow improved benchmarking for the attribute survey.
6
Over-coverage was not a significant factor for the 2011 Census and thus the coverage survey design ignored this within
its sample design.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
17
5.4
Sample allocation and constraints
Before considering allocation of the overall sample to the stratified PSUs, we have to define how
many PSUs will be selected. This is a function of the overall sample size (17,000 postcodes for a
sample size of 350,000 households) and the second stage sample size (this is the number of
postcodes selected in the PSUs). For this second stage, section 5.1 suggested a selection of a
quarter of postcodes. This approach then means that 1,750 LSOAs can be selected at the first
stage (so that 200 households are selected in each) to obtain 350,000 households overall. This is
a sampling fraction of 5.1 % of PSUs.
In social surveys it is common to start with an approximately proportional allocation based on the
number of units in each stratum. This will spread the sample across all strata, putting more sample
in the strata that cover more of the population, and is optimal when the within stratum variances
are similar. It is also a good default in a multi-purpose survey as, while it is unlikely to be optimal
for any specific variable, it will never be as poor as no stratification. In the absence of design
information it is proposed to use a robust allocation like this. However, there may be other options
that become available as more information is gathered about the SPD or in subsequent years
beyond the first year of operation, enabling a more optimal allocation of the sample across strata.
Beyond the overall sample size, there will have to be some form of minimum constraint within each
stratum to prevent under allocation of the sample. Again, this will be chosen with a robust design in
mind. Initially, a minimum sample size of one PSU within each stratum is a good starting point.
5.5
More complex design features
There is a question as to whether the survey could have a longitudinal element, perhaps by having
a proportion of the sample as a panel. This would not help with the estimation of levels of
coverage, but would allow a form of Longitudinal Study (LS) – ensuring a stable sample was
maintained over time. However, this could also be achieved within an attribute survey. Therefore,
this should not be included to ensure the PCS design is not over-complicated, given there is no
value to the key estimates.
Use of overlapping samples can give gains in precision for estimates of change, such as migration.
The idea is to control the overlap between successive samples. Whilst having a complete
separation might be attractive to control burden, there are some gains in precision to be had by
having some overlap. However, the gain depends on the level of correlation between the survey
outcomes between samples. If the correlation is high, say 0.9, then having as much as a 30 %
overlap would provide good gains in precision of estimates of change. However, this would have to
be considered further alongside the relative priority of these measures of change.
5.6
Summary of initial sample design
In summary, the proposed initial PCS design that will be adopted for the simulation studies
described in section 7 is a stratified two stage sample design, stratifying LSOAs by LA and a three
level index based on a robust 40 %, 40 %, 20 % distribution. A simple random sample of LSOAs
will be drawn from each strata, with the overall PSU sample size being 1,750 (a sampling fraction
of 5.1 %), and this will be proportionally allocated to the strata according to the number of LSOAs
with a minimum sample size of one.
A simple random sample of postcodes in each selected LSOA will then be drawn. A 25 % sampling
fraction will be used to define the number of postcodes drawn in each PSU. All households in each
selected postcode will be interviewed.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
18
6 Estimation Framework
The estimation problem can be framed in a number of different ways. This section outlines three
alternative frameworks to estimating the population using the PCS and SPD. For consistent
notation, define:
•
•
•
•
Yig.h as the true population of individuals for small area i, in estimation stratum g for overcoverage and estimation stratum h for under-coverage 7.
Xig.h is the number of individuals on the SPD for the same small area on the same date. For
now let us assume that Xig.h can be cleaned of completely erroneous inclusions but it will
still include some individuals that belong to other small areas (out-movers who may or may
not be duplicated in their correct area) while missing some genuine residents (in-movers
and additional under-coverage of the administrative system). This is where the link
between the SPD rules and the estimation can be seen to be important, as the rules will
determine whether these initial assumptions can be met. However, it is recognised that the
lack of erroneous assumptions is not realistic and work is underway as described in section
3.5 to address this important challenge.
Zig.h is the count from a survey of the small area i which has no erroneous inclusions but
does have under-coverage due to survey non-response.
Mig.h is the count of matched survey responders to the SPD in the sample areas Note that
false negative matching errors will deflate this whilst false positives will inflate it.
Note that X is available for all small areas i in the estimation strata cross-classification g.h, while Z
is only available for a sample.
6.1
Weighting approach
The first approach is based upon the motivation used to develop the estimator for the 2001 and
2011 Census. The idea is that there are two attempts to count the true population in a sample of
areas, and these two attempts can be combined (for example using DSE) to estimate the true
population in those sampled areas. The presence of the census across the whole population is
then utilized to ‘ratio up’ to obtain a population estimate for large geographical areas, using the
very strong correlation between the census count and the estimate of the true population in the
sample. The key issue to deal with is undercounting in both sources, hence the motivation to use a
DSE. However, the census does have a small amount of over-count and so a small adjustment is
made to account for that (Large et al, 2011). The census estimator (without over-count correction)
was essentially:
=
Ŷh
Ch
C × Zih
× ∑ i∈S w ih ih
h
M ih
∑ i∈S w ih Cih
(1)
h
w
Where ih are the sampling weights, Mih, are the numbers of matched true residents between the
census and CCS and C is the census count. The estimator in (1) can also be expressed as a
weighting approach where the idea is to weight up the obtained coverage survey counts to the
desired population level to provide estimates of the total. Thus for the Beyond 2011 context, if the
census is replaced with the SPD, a ratio estimator can still be used to provide gains in precision
over a standard Horvitz-Thompson 8 approach. Furthermore, as the PCS will have non-response
there are a number of options for estimating weights to correct the bias this would imply. This
7
Note that the g.h subscript denotes the cross classification of over-coverage stratum g and under-coverage stratum h
The Horvitz-Thompson estimator is one which just uses the sample weights and no auxiliary information (Cochran,
1977).
8
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
19
includes simple approaches such as weighting classes and, assuming individual level matching is
possible, dual-system estimation or even triple-system estimation. The basic form of such an
estimator applies all the weights to the survey counts:
ˆ =∑ w w
ˆ ′h w
ˆ ′′ih Zih
Y
h
ih
i∈s
(2)
h
where the estimated survey non-response weights are the
ŵ ′′ih and the ratio estimator weights are
X ih
ŵ ′h . In the context of (1), ŵ ′′ih would be defined by M ih and ŵ ′h by
Xh
∑i∈S w ih X ih
. The main
issue with (2) is that in order to obtain good survey non-response weights through an approach
such as DSE with the SPD as the auxiliary data, over-coverage in the SPD becomes a serious
problem. Both erroneous inclusions and incorrect inclusions inflate the DSE and cause large
h
ŵ ′′
biases as they appear in the numerator for ih but not in the denominator. As for the census
approach, a correction for this can be made (see Large et al 2011) which estimates an adjustment
factor based on observed survey individuals being found elsewhere in the auxiliary data. This
essentially adds another weighting term to (2) to make such a correction. It is this over-coverage in
the SPD that implies this approach might not be the best way to approach the problem.
The approach is motivated by the fact that the SPD, unlike the census, is not an attempt to count
the same population as the survey – it is based upon a series of administrative systems that tend
to be characterized by lags and over-coverage, with associated under-coverage at a local level,
rather than a high under-coverage at the national level. However, even at a low level of
aggregation it would be expected to be highly correlated with the unknown true population counts.
Therefore, it can be used to improve estimation with weights like
ŵ ′h , but there will be a need to
ŵ ′′
∑
∑
a∈i
X aih
X aih
a∈ri
be careful with adjusting for survey non-response. An alternative for ih would be
where ‘a’ represents the addresses in sampled small area i and ri are those addresses in i where a
survey response is obtained. This re-weights the individuals in the responding addresses in the
sampled areas based on the pattern of non-response at the address level as measured by the
SPD members attached to the sampled addresses. This is a form of a method of non-response
adjustment referred to as a weighting-class adjustment (Lohr, 1999 p266). There is no need to
match individuals between the survey and the SPD but it must be possible to match addresses
(this will be straightforward in the situation where a national address register is a basis for both the
SPD and the survey, or at least the starting point). The issue of erroneous inclusions on the SPD is
not an issue for this non-response adjustment method, as they appear in the numerator and
ŵ ′′
denominator of the ratio for ih ; but this non-response adjustment will only make sense provided
there is a correlation, in terms of say age-sex, between those incorrectly included on the SPD and
those excluded by the SPD 9. The other advantage is that matching error is much more likely to be
lower for matching of addresses compared to individuals. However, the price of not matching at the
individual level implies that no adjustment can be made for within household non-response in the
survey. For example, suppose a young man is missed within a household that responds in the
survey. If they exist on the SPD they will appear in the numerator and denominator of
therefore a non-response adjustment will not be made.
ŵ ′′ih
9
This is a reasonable assumption to make as, for example, when students leave a local area they tend to be replaced by
more students.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
20
6.2
Measurement of quality approach
The simple form of (2) can also be written as
Ŷh = ∑i∈S w ih ŵ ′h ŵ ′ih′ X ih
h
′′
where ŵ ih is then defined
Z ih
M ih , a correction factor for those missing from the SPD as measured by the independent
by
survey. This motivates formulating the estimation as a measurement error problem, where the SPD
is thought of as an approximation of a population spine or register. There are some people not
included anywhere on the SPD but the much bigger challenge is that there are many in the wrong
location (incorrect inclusions). In the main, these errors are due to lags in the registration, deregistration and updating processes for the sources used to construct the SPD. Thus someone
who is not included at a point in time is almost always likely to appear at a future point in time on
one of the sources, and someone who is in the wrong location will be updated at some point in the
future. This motivates an approach that focuses on using a survey to measure these errors. Thus
the estimation problem becomes one of estimating inflation and deflation factors to be applied to
the SPD population totals.
To start with, consider that for a small area the SPD will have some records incorrectly included,
and some missed that are ‘elsewhere’. So, at the level of small area i, there are Eig.h records
incorrectly included and Oig.h missed. Therefore, by definition
Yig.h = X ig.h − E ig.h + Oig.h
(3)
Y = X −E +O
g
g
g
and as this is aggregated up to the stratum level, g
is obtained by summing
across the small areas within the stratum g. A similar result is obtained by summing across the
small areas within stratum h. Let us now define parameters
δg =
Xg − Eg
Xg
(4)
as a deflation factor to correct the SPD for the incorrectly included in stratum g and
γh =
X h − E h + Oh
Xh − Eh
(5)
as an inflation factor to correct the SPD for those missed in stratum h. Therefore, (3) can be
estimated as
Ŷig.h= X ig.h × δ g × γ h
,
(6)
assuming that the errors on the list are stable within the estimation strata g and h. Essentially we
are specifying the model
E[Yig.h | X ig.h ]= X ig.h × δ g × γ h
(7)
where δ and γ are stratum specific parameters relating to the coverage of the SPD within stratum g
and h. It is worth noting that while area i will belong to a unique cross-classified cell of the g by h
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
21
stratification, the model has only main effect terms for the two strata. This definition of inflation in
ŵ ′′ih
Z ih
M ih
defined by
(5) links to the alternative form of (2) given at the start of this section with
(essentially defining a DSE) but we are assuming a common error across all small areas in stratum
h. This is not unreasonable as the SPD is based on linked administrative data with some sensible
rule based criteria for inclusion and would therefore not be expected to have as much localized
variation in its errors as in the census, which is still a localized data collection exercise that is
susceptible to variation in the quality of the field staff.
From the survey we observe Zig.h for a sample (i є s) of areas. Matching to the SPD gives a count
of matched true residents Mig.h, which is less than Yig.h due to survey non-response and undercoverage on the list. (If the survey had 100 % response and no matching error, then Mig.h would be
equal to Xig.h – Eig.h, the true records on the SPD, and Zig.h would equal Yig.h.) However, assuming
survey non-response is independent of under-coverage on the SPD 10, we can still get an
approximately unbiased estimate of the inflation ratio (5) for the SPD using
γˆh =
∑
∑
i∈sh
w ih Zih
i∈sh
w ih M ih
,
(8)
where wih is the sample weight for area i selected into the survey and Sh are the sample of small
areas from under-coverage estimation stratum h.
We can also match the survey responses in small area i to the SPD in another small area j to
identify
(j)
N ig.h
, those individuals identified by the survey as belonging to area i but incorrectly
N (i) = 0
included on the SPD in area j. By definition ig.h
as the sample for area i cannot find directly
individuals incorrectly included within an area i, but can only point to incorrect individuals in other
areas. Therefore, an estimate of Ejg, all the incorrectly included records for small area j, is given by
Ê jg = ∑ i∈s w ig N ig(j)
(9)
where we are summing across all the sample areas (i є s), given matching from the sample
elsewhere has identified incorrectly included members in small area j. This is not restricted to
within over-coverage estimation stratum g as a sample area from any stratum might find an
incorrectly included member in small area j of stratum g. We would expect Ê jg to under-estimate
the incorrectly included for small area j, due to survey non-response, but M will also under-estimate
X-E for a sampled area due to the same survey non-response (assuming that survey non-response
in sampled area i is independent of being incorrectly included in another area j). Therefore, despite
the non-response, we can get an approximately unbiased estimate of the deflation ratio (4) for the
SPD using
δˆg =
∑
∑
i∈s g
i∈s g
w ig M ig
w ig M ig + ∑ i∈g Eˆ ig
(10)
where wig is the sample weight for area i selected into the survey.
10
This assumption may not be realistic in practice as was the case for the 2011 Census. However, it is
expected that administrative sources and a survey would be closer to independence.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
22
Putting it all together we get
ˆ = X × δˆ × γˆ
Y
ig.h
ig.h
g
h
∑ i∈sg w ig Mig
∑ i∈sh w ih Zih
=
×
X ig.h ×
∑ i∈s w ig Mig + ∑ i∈g Eˆ ig ∑ i∈s w ih Mih
(11)
h
g
and the estimated Y’s can then be aggregated to produce estimates at the desired level. In the
2011 CCS we had distinct strata for g and h but given that in an administrative system much of the
incorrectly included will be excluded elsewhere it makes sense that the stratification g and h might
be common and defined with g. Therefore, we can re-write (11) as
= X ig ×
Ŷig.h
∑
∑
i∈s g
i∈s g
w ig Zig
w ig M ig + ∑ i∈s w ig N i(g)
(12)
where the ratio in (12) trades-off the under-coverage against the incorrect inclusions within the
N (g)
common stratum g to give a net adjustment to the SPD at the local level, and i are all the
incorrectly included individuals in estimation stratum g as measured by the survey responses in
small area i, which can be from any region. This estimator is very similar to the census type
approach outlined above, which is reassuring given the slightly different motivation. However, both
have the disadvantage that in their current form they do not deal with records on the SPD that
should not be in the population at all – erroneous inclusions. They are also susceptible to matching
errors. The next approach explores one way of measuring all types of over-coverage.
6.3
Extending the estimation approach
An estimation approach utilizing multiple administrative sources that can also cope with the
erroneously included individuals is being developed. Suppose rather than combining two sources
to create the SPD, the two sources are available as distinct administrative counts but still linked at
the individual record-level within local areas. This linkage defines those included on both sources
and those included on only one of the sources in each local area. It does not include those
excluded from both at the local level but it does include all the erroneous inclusions; some on a
single source and some appearing on both sources. Suppose an independent PCS is now linked to
the two sources. This survey only covers real members of the local area so for the survey
responders we can fully define the relationship between the two administrative sources; those
correctly included by both, those correctly included on one source, and those excluded from both
sources. However, those included on the administrative sources that do not respond to the survey
will be a mixture of genuine survey non-response and all the incorrect and erroneous inclusions of
the administrative sources. If the survey is independent of the administrative sources, we can force
the relationship between the administrative sources for survey responders (those who match to the
administrative sources) onto the data for non-responders (those who don’t match), thereby
identifying the incorrect and erroneous inclusions and adjusting for those excluded by both sources
and the survey. This is similar to a latent class model where a certain structure forces the
identification of two classes. Here, the structure for survey responders forces the identification of
two classes, genuine survey non-response and incorrect/erroneous inclusions of the administrative
sources, with additional restrictions on the ‘joint’ behaviour of the incorrect and erroneous
inclusions on the two sources.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
23
The basic idea above requires at least two distinct administrative sources, both with reasonable
coverage of the whole population. However, if these have already been combined into the single
SPD, the same idea can be used but by creating two sources using the SPD at two time points.
Here, the assumption would be that at the second time point some erroneously included records
from time point one have been removed from the SPD but then replaced by new records, some
real and some erroneous. For the survey responders, taken at the first time point, we would see
those real members that remain on the SPD at the second time point, those that leave because
they have moved from the local area, and those that are added to the SPD by the second time
point due to lags in registration. Similar to the two sources, we could then use this relationship to
identify the erroneous records in the movements among the survey non-responders and therefore
estimate the true population of the SPD at the first time point.
Both of these approaches needs considerable further development but there is the possibility to
test the two list approach using data from 2011 where the CCS and Census matched data were
also matched to the NHS Patient Register. Using the SPD across time can also be investigated as
it is being created for 2011, and so can be linked to the CCS; provided it can be created for a
future time point.
6.4
Developing an estimation strategy
Of the options described above, the first framework is the simplest and closest to the methods
used for the 2011 Census. As a result, this is the option that will be explored in the simulation
studies described in section 7, which should establish the feasibility of producing population
estimates from anonymously linked record-level administrative data and a PCS.
The intention in the next phase of the research is to explore whether the alternative approaches
are likely to provide improvements over the simple weighting approach. We expect the error
measurement approach to be very similar to the weighting approach, as they are essentially the
same idea but the motivation for error measurement is perhaps more appropriate when thinking of
an SPD. In addition, the estimator can be easily extended to utilize multiple surveys over time,
essentially opening up the opportunity to smooth the parameters over time. However, we are
particularly keen to explore the feasibility of the last approach due to the measurement of all
erroneous inclusions being built into the strategy. For the other options, such measurement is not
built in and may prove to be problematic if the SPD construction process is not able to remove
almost all of them (since the survey does not help because the erroneous inclusions are never
observed). Some initial ideas for dealing with this are being explored as described in section 4. The
weighting class ideas also need investigating as, while they would not be expected to perform as
well as the other approaches if there were minimal erroneously included records, they have the
potential to offer reasonably efficient estimation even with erroneous records present on the SPD
and no need to link between the survey and the SPD at the individual level. The other attraction is
that they would be less susceptible to the problem of matching error as the only matching required
is at address (or household) level, whereas all other approaches would require good quality
matching at person level or a robust assessment of matching error.
One issue not discussed above is that of obtaining local authority level estimates. In general, the
sample sizes discussed in section 5.3 would not allow direct estimates at LA level. Some grouping
of LAs is likely to be required to enable sufficient sample sizes for direct estimates, and then some
type of small area estimator could then be applied to obtain LA level estimates. For this paper, we
will assume the use of a simple synthetic small area estimator as used in the 2011 Census. Further
work will explore the use of more sophisticated estimators.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
24
7 Simulation studies
7.1
The simulation engine
In the development of the sample design and estimation methods for the 2011 Census, simulation
studies were the key tool used to provide the evidence necessary to make decisions on the
methods to use. They were able to provide assurance that the methods were robust and provided
good statistical qualities. Whilst some judgements had to be made, the simulations provided the
basis for these judgements. A similar approach can be used for the development of methods for
deriving population estimates from an anonymously linked record-level administrative data based
approach.
The objective of such simulation exercises is to explore the performance of the statistical
methodology, under some simplifying assumptions. For instance, the simulations can show the
quality of the estimates if the survey has zero non-response, and matching is perfect. Even though
these assumptions are clearly unrealistic, if the resulting estimates are poor then it is clear that the
methodology used to derive those estimates is not worth pursuing. As simulation studies are
complex systems, the general approach is to start simple and build in additional complexity over
time and as the methodology develops. For instance, the initial simulation studies may make the
assumption of independence between the SPD and the survey. Later simulations may be able to
explore the impact on the estimates of dependence between the two. Previous research, and the
known properties of particular statistical techniques will provide guidance on the impact of violating
assumptions and therefore prioritise which assumptions should be explored in subsequent
iterations of the simulation studies.
The simulation framework developed for this paper is follows below.
•
•
•
•
•
•
•
Extracts of fully edited and imputed 2001 Census household and individual data were used
as the basis, and are treated as the ‘truth’. 2001 data was used to enable timely research
progress. Note that even though the 2001 Census data were not fully adjusted following the
post-census studies, this does not impact on the simulations as they are using the base
data as a tool for exploring how well the estimator recovers the truth. Later simulation
studies may use 2011 Census data, although the results should not be any different.
A set of probabilities of being included (correctly) on an SPD were estimated relating to key
characteristics for persons. An assumption was made that all households are included
unless no persons were subsequently included. See Appendix B for the details.
A set of probabilities of being incorrectly included on an SPD (conditional on whether the
individual was included correctly) were estimated relating to key characteristics for persons.
See Appendix B for the details.
A set of probabilities of being counted in a PCS were obtained from modelling the 2001
Census data. See Appendix C for the details.
The probabilities were merged onto the data.
A random number generator was used to decide whether each household was counted in
the PCS (generating an outcome with 4 values – counted in both, SPD only, PCS only,
missed in both). If the household was missed, then by default all the persons in that
household were missed. Initially, all households were assumed to be included in the SPD.
A random number was used to decide whether each person within each household was
included correctly in the SPD and/or PCS, and also whether they were included incorrectly
or not in the SPD (generating an outcome with 8 values as shown in Table 1).
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
25
Table 1 - Simulation outcome flags
Outcome
flag
1
2
3
4
5
SPD
only
count
0
1
0
0
1
PCS
only
count
0
0
1
0
0
‘both’ (in
SPD and
PCS) count
1
0
0
0
1
Incorrect
inclusion
on SPD
No
No
No
No
Yes
6
2
0
0
Yes
7
1
1
0
Yes
8
1
0
0
Yes
•
•
•
•
Scenario
Included correctly in both SPD and PCS
Included correctly in SPD, missed by PCS
Included correctly in PCS but not on SPD
Not on SPD, missed in PCS
Included correctly in both SPD and PCS but
also included incorrectly on SPD. This is a
duplicate in the SPD.
Included correctly and incorrectly in SPD
but missed by survey. This is a duplicate in
the SPD.
Included incorrectly on SPD but counted
correctly by PCS.
Included incorrectly on SPD and missed by
survey.
This was repeated to provide a bank of 400 simulations in a single dataset.
The simulation replicates were then fed into a sampling and estimation program, which
drew a PCS sample according to a specified design and sample size, then calculated
weighted population estimates using a specified methodology (e.g. DSE, ratio estimator, or
in some cases weighting class adjustment and ratio estimator).
The estimates were stored, and the sampling and estimation process repeated, giving a
bank of 400 results.
To evaluate the performance of each option, measures of relative bias, relative root mean
square error (RRMSE) and relative standard error (RSE) were calculated. These are
defined in Appendix D.
Note that some simplifying assumptions are necessary for these simulations.
a) Firstly, when an individual is incorrectly included in the SPD, we assume they are included
incorrectly within the same location. Whilst this is not realistic (as many people included once, but
in the wrong location, can be included in a different local authority), it makes the simulations
simpler. In essence the incorrectly included record represents another incorrectly included record
in that location (even though it has the same characteristics). A further assumption is that the area
simulated is a closed population. Therefore, a PCS within the area can identify both undercoverage and over-coverage for the same individual.
The crucial point here is that there is no attempt to place the incorrectly included record elsewhere,
so it appears as incorrectly included in the same place. Therefore, if the PCS samples the real
record it will identify the incorrectly included record to contribute to estimating the over-coverage
regardless of whether the real record is in the SPD and contributing to the DSE (if used). This
simplifies the simulation compared to the reality of searching outside the PCS sample to find
incorrectly included records, although it means the simulations will provide an optimistic
assessment of how well an over-coverage adjustment will work, as it assumes perfect matching.
b) All households in the SPD are included with probability 1, although if all persons within are
excluded then the household is treated as missed. Thus the household inclusion is dependent on
the inclusion of the persons (either correct or incorrect).
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
26
c) There is independence between the SPD probabilities and the PCS probabilities. This is a
reasonable starting assumption; later research will explore the impact of a violation.
d) There are no erroneous inclusions. This is a simplifying assumption as otherwise the simulations
would have to generate a list of records which were not part of the population. The impact of these
records on the estimates would be to inflate the estimates by whatever proportion there are. So if
there are 1 % erroneous inclusions, this creates a 1 % bias on the population estimate (as there is
no mechanism yet for correcting for them). This is another topic for future research.
e) Perfect matching. That is, there are no matching errors when matching the SPD and PCS and
no errors when searching for incorrect inclusions or duplicates.
7.2
Estimators to be explored
Table 2 shows which estimation options, discussed in section 6, were implemented within the
simulation studies. This range of options will demonstrate performance across the chosen
scenarios (see section 8.3), allowing an assessment of robustness. The expectation is that some
of these options will not be adopted, but they were included for completeness and to establish a
baseline to show the change in bias and variance for each component (e.g. it will show the
improvement in relative standard error (RSE) when using a ratio estimator). This is important, as
there will be biases operating in different directions. However, not all of the 6 options shown in
table 2 were implemented for every area or scenario to keep the simulation workload to a
reasonable level.
Table 2 - Estimation options to be tested in the simulation studies
Option to
be tested
None
Overcoverage
adjustment
None
Ratio
None
Ratio
None
Ratio
None
5
Simple
Weighting
class
DSE
Ratio
None
6
DSE
Ratio
Yes
1
2
3
4
7.3
Survey Nonresponse
option
Use true
values
Use true
values
None
Calibration
weight
Notes
Basic Horvitz Thompson, should be
unbiased and have large variance
Shows effect of ratio estimator. Should
have small positive bias and small variance.
Shows effect of non-response. Should have
negative bias and small variance.
Shows how simple weighting class can
correct for some non-response but not all.
Expect bias to be smaller.
Shows effect of basic DSE to bias - should
have positive bias due to inflation in SPD
and increased variability.
All options turned on. Should provide the
best option in terms of bias and variance,
but won’t be perfect.
Areas and scenarios
The simulations were run using the following four estimation areas (EA), chosen specifically to
represent a few different area types:
a) Inner London - Lambeth and Southwark;
b) Large city – Leeds;
c) Rural – Breckland, Babergh, Forest Heath, Mid Suffolk and St. Edmundsbury; and
d) Rural with large town – Conwy, Denbighshire, Flintshire and Wrexham.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
27
For each area, the following six scenarios were implemented to be able to show the sensitivity and
robustness of the survey design and estimation options under different conditions.
a) Base set of probabilities, which varied by characteristics including broad area type (at local
authority level), broad age-sex, student status and migrant status (internal, international).
These base probabilities are defined in Appendix B.
b) As per the base set, but with increased probabilities of records in the wrong place.
c) As per the base set, but with lower probabilities of records in the wrong place.
d) As per the base set, but with zero probability of duplicated records.
e) As per the base set, but with lower probabilities of records in the correct location (and
therefore there were a higher proportion completely missed and a higher proportion in the
wrong location).
f) As per the base set but with much lower probabilities of records in the wrong place (and
therefore there was very low over-coverage but the level of exclusions was high).
7.4
PCS sample design and sample size
The sample design used in the simulations was fixed as that described in section 5.6. However,
three different overall sample sizes were used to show how the quality of the estimates changes
with the sample size. A sample size of 350,000 households was the base option and alternatives
were an increase to 450,000 households and a decrease to 250,000 households. These translate
into sample fractions for the PSU selection of 3.7 %, 5.1 % and 6.6 %. For all areas proportional
allocation was used to allocate this sample to strata.
For the majority of the simulations, the PCS probabilities were fixed as those modelled and
described in Appendix C. To explore the impact of different levels of survey non-response, the
probabilities for the PCS were decreased by raising them to a power of 1.5. This only applied to the
analysis in section 8.4.
7.5
Summary of simulation runs
The simulations required for each sample size option are shown in table 3 below, together with the
estimation options that were run for each area. Therefore there were three separate runs of the
simulations required shown in the table – one for each sample size option. The base scenario and
the large city area included all options to provide the key evidence for evaluating the difference
estimation options. For the others, only the estimation method which was expected to perform best
was tested.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
28
Table 3 -Simulation study scenarios, areas and extent of estimation options tested
EA
Simulation scenario
A
BASE
B
C
D
E
F
Increased
probability of being
in wrong place
Lower probability of
records in wrong
place
Zero probability of
duplicates
Lower probability of
records in correct
location
Base with lots of
under-coverage
and low
overcoverage
Inner London
Large city
Rural
Rural with a
large town
All estimation
options
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
All estimation
options
All estimation
options
All estimation
options
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
All estimation
options
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
DSE, overcoverage, ratio
only
All estimation
options
All estimation
options
All estimation
options
All estimation
options
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
29
8 Results
8.1
Simulated SPDs
Table 4 shows the percentage difference between the simulated data and the true population (this
is the bias) across the scenarios in the SPD for the four simulation areas. The bias shows how
close the SPD gets to the true population before any estimation process. For each area the bias is
small when all observations are included, but excluding the over-coverage shows the true undercoverage; the simulated SPD looks reasonable quality at the aggregate level, but in reality is a
mixture of under and over-coverage. In addition, the SPD is assumed to include no erroneous
inclusions (e.g. deceased persons or emigrants no longer usually resident in England and Wales).
So if, for example, erroneous inclusions added an additional 4 % of records the SPD totals would
appear closer to the truth, but would actually be even more inaccurate. Summaries by age and sex
are given in Appendix D for the base and low over-coverage scenarios.
Table 4 - Percentage difference between simulated data and ‘the truth’ by EA, with and without overcount
Simulation scenario
A
Base scenario
EA
Inner London
Percentage difference between simulated
data and 'the truth'
SPD: under- SPD total (including
coverage only
over-count)
-25.8
-3.1
Large city
-15.6
-2.9
Rural
Rural with a large
town
-11.7
-2.5
-10.7
-2.5
B
Increased
probability of being
in wrong place
Large city
-15.6
-1.8
C
Lower probability of
records in wrong
place
Large city
-15.6
-3.9
D
Zero probability of
duplicates
Large city
-15.6
-2.9
E
Lower probability of
records in correct
location
Large city
-22.0
-2.7
Inner London
-25.8
-21.2
Large city
-15.6
-12.9
F
Base with lots of
under-coverage
and low
overcoverage
Rural
Rural with a large
town
-11.7
-9.7
-10.7
-8.9
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
30
8.2
Analysis of estimator performance
8.2.1 Basic simulation probabilities
Table 5 summarises the relative bias, relative standard error (RSE) and relative root mean square
error (RRMSE) for each estimation method for the Large city EA, using the base simulation data
and mid-sized survey sample (350,000 households). The first two methods (non-response =
TRUE) assume a perfect survey, meaning there is no survey non-response. The relative bias is
largely a measure of how well survey non-response is estimated, the RSE measures the effect of
variability due to sampling and non-response, and the RRMSE a combination of the two.
Table 5 - Relative bias, RSE and RRMSE by estimation method for Large city EA, to 2 significant
figures
Estimation methods
Non-response
Calibration
Overcoverage
Relative
bias (%)
TRUE
-
-
0.11
4.9
4.9
TRUE
RATIO
-
0.018
0.21
0.21
WEIGHTING
CLASS
RATIO
-
-12
0.92
12
RATIO
-
-1.9
0.74
2.0
DSE
RATIO
-
17
0.64
1.7
DSE
RATIO
ON
0.59
0.27
0.65
RSE (%)
RRMSE (%)
The first estimation option (non-response = ‘true’) is the basic Horvitz-Thompson (H-T) estimator,
which uses only sampling weights to weight the true survey counts to the total population level.
The relative bias is small (less than 0.2 %) as there is no survey non-response and the H-T
estimator is unbiased. However, as expected, the RSE shows a high variance (more than 4.5 %)
due to sampling error. It is high because the estimation is simply relying on the survey weights and
not using any other information. As a consequence, the RRMSE is high due to the variance.
In order to improve precision over the basic H-T estimator, the SPD can be used as a powerful
auxiliary as it is highly correlated with the survey counts. The second estimation option adds on the
ratio estimator (see section 6). This reduces the relative bias (note: a random effect, as we would
usually expect to see a slight positive increase in bias as the ratio estimator is biased with order
1/n), and substantially reduces the RSE. The RRMSE is also low. This option gives the best
simulation results given this sample size, because there is no survey non-response, so is used as
the benchmark against which to assess the remaining results.
To demonstrate the effect of survey non-response, the third method uses the simulated survey
counts within the ratio estimator. This shows a substantial negative bias, caused by having survey
non-response and not including an estimate for it. The bias is essentially the same as the survey
non-response rate for this area. The RSE for this method is higher than method 2, but not as high
as method 1. This is because the correlation between the survey counts and SPD is lower than the
correlation between the true count and the SPD, and therefore the ratio estimator does not provide
the same reduction in variance. The RRMSE is high due to the relative bias dominating.
To improve the estimates, the bias due to non-response in the survey must be dealt with. The next
option is the weighting class method with the ratio estimator: the SPD is now being used through
the weighting classes to estimate survey non-response. The weighting class estimator corrects for
some, but not all, of the non-response: the relative bias is still negative, but smaller than just with
the ratio estimator. The weighting class method used has corrected for around 85 % of the survey
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
31
non-response. This is expected as it cannot estimate for within-household non-response or
households missed from both the survey and SPD. The RSE is also reduced, as is the RRMSE.
The main alternative to the weighting class is the DSE. The fifth option shows the effect of the DSE
and ratio methods, without correcting for over-coverage. This now has a positive bias because of
the over-coverage in the SPD, which causes the DSEs to overestimate survey non-response (the
over-coverage appears to look like survey non-responders as they could not be found in the
survey). The RSE is slightly lower than for the weighting class method, but the RRMSE is high
because of the bias.
Finally, adding an over-coverage adjustment method gets the closest to our benchmark measures.
The relative bias is reduced to below 1 %, while the RSE is also reduced and close to the
benchmark. The RRMSE is the best of the alternatives. However, there is some residual positive
bias in the estimates when compared to method 2. This is likely to be due to the over-coverage
method being slightly conservative (in other words it doesn’t make enough downwards
adjustment). Measures for the baseline, weighting class and DSE/ratio/over-coverage methods are
provided for the other simulation EAs in Appendix E.
From these results, the basic Horvitz-Thompson method can be ruled out due to its large variance
and the ratio method without survey non-response adjustment can be ruled out as it does nothing
to deal with survey non-response. The weighting class method will not be considered further in this
paper, although further work will explore whether gains can be made with a modified method.
Lastly, it is unrealistic to expect the SPD to not feature some over-coverage, so the DSE/ratio
method without over-count adjustment is also ruled out. The rest of the analysis will therefore focus
on the DSE/ratio/over-coverage method.
8.2.2 Different SPD scenarios
The different SPD scenarios generated can show how the methods perform under differing levels
of under-coverage and over-coverage. Table 6 shows the relative bias, RSE and RRMSE for the
benchmark and DSE/ratio/over-coverage methods, for the Large City EA across the scenarios. The
results for the other areas are given in Appendix E.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
32
Table 6 - Over- and under-coverage, relative bias, RSE and RRMSE by estimation method and SPD
scenario for Large city EA, to 2 significant figures
Scenario
Base
(A)
13
Increased
probability of
being in the
wrong place
(B)
14
Lower
probability
of records
in wrong
place (C)
12
Lower
probability of
records in
correct
location (E)
19
High undercoverage
and low
overcoverage (F)
3
16
16
16
22
16
benchmark
DSE/ratio/overcoverage
RSE (%)
0.018
0.010
0.023
0.0013
0.076
0.59
0.62
0.54
0.84
0.18
benchmark
DSE/ratio/overcoverage
RRMSE (%)
0.21
0.16
0.23
0.17
0.39
0.27
0.24
0.28
0.30
0.44
benchmark
DSE/ratio/overcoverage
0.21
0.16
0.23
0.17
0.40
0.65
0.66
0.61
0.89
0.47
SPD over-coverage
(%)
SPD under-coverage
(%)
Relative bias (%)
Table 6 shows that the DSE/Ratio/over-coverage method is fairly robust. As the over-coverage
varies, the bias changes in line with this and shows that the bias is largely driven by the level of
over-coverage. Even with quite high levels of over-coverage (20 %) the bias in the estimator is still
below 1 %. In addition, the RSE is largely unaffected. The final scenario simulates low overcoverage by having much smaller probabilities of being counted in the wrong place, and the result
is a large reduction in bias. However, the RSE is higher because the SPD has lots of exclusions
and thus the correlation between it and the survey outcome is poorer. Overall this gives an
RRMSE closest to the benchmark.
8.2.3 Summary
In summary, the DSE/ratio/over-coverage method is robust to different levels of over and undercoverage. However, it has better overall accuracy when over-coverage is low. This is likely to be
because the over-coverage adjustment is much smaller and so the DSE assumptions are much
better approximated and bias is reduced. However, the price paid is higher variability as the SPD
quality is lower, but this may be driven by simulation assumptions. Further simulations with more
realistic over-coverage patterns will provide better information about the likely variance of the
estimates.
8.2.4 Results by age and sex
Figure 3 shows the relative bias by broad age and sex groups for the DSE/ratio/over-coverage
method for the different SPD scenarios. The age and sex groups used here are those used in the
estimation process, grouped to include similar SPD coverage rates. The scenario with higher overcoverage (Scenario E - lower probability of records in correct location, resulting in greater numbers
of records in wrong location) results in high bias across all age/sex groups. The scenario with lots
of under-coverage and low over-coverage (Scenario F) gives a very low bias, reducing it by half
compared to the base scenario and it is almost comparable with the ‘benchmark’ (no survey nonresponse, ratio estimator) method. This confirms the findings from the previous section, but also
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
33
gives an indication of how the bias varies by age and sex, particularly for the groups where overcoverage is higher.
Figure 3 - Relative bias by age and sex groups, DSE/ratio/OC method versus benchmark, by SPD
scenarios, Large city EA
3.5%
3.0%
Relative Bias
2.5%
DSE/ratio/OC:
Scenario A
2.0%
DSE/ratio/OC:
Scenario B
1.5%
DSE/ratio/OC:
Scenario C
1.0%
MF60-79
MF50-59
MF40-49
F35-39
MF80-120
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
Benchmark:
Scenario F
F20-24
-0.5%
M20-24
DSE/ratio/OC:
Scenario F
MF10-19
0.0%
MF1-9
DSE/ratio/OC:
Scenario E
MF0
0.5%
Figure 4 shows the RSE across the scenarios. Note that the variability is high for scenario F (high
under-coverage and low over-coverage) because of the reduced correlation between the SPD and
the truth, although the variation is similar to the benchmark method for the same data set. This
indicates that the extra variation due to the DSE and over-coverage weighting adjustment is small
for this scenario. The high over-coverage scenario E (lower probability of records in correct
location) also has high variance.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
34
Figure 4 - RSE by age and sex groups, DSE/ratio/OC method versus benchmark, by SPD scenarios,
Large city EA
4.5%
DSE/ratio/OC
: Scenario A
4.0%
3.5%
DSE/ratio/OC
: Scenario B
3.0%
DSE/ratio/OC
: Scenario C
RSE
2.5%
2.0%
DSE/ratio/OC
: Scenario E
1.5%
DSE/ratio/OC
: Scenario F
1.0%
0.5%
Benchmark:
Scenario F
Age-sex group
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Figure 5 shows the RRMSE across the scenarios. While the low over-coverage scenario F had low
bias, the high RSE across each age group results in high RRMSE, broadly similar to that from the
high over-coverage scenario E. Essentially this shows that across the scenarios, overall
performance is similar, but the estimates are likely to have more bias if over-coverage is high.
Figure 5 - RRMSE by age and sex groups, DSE/ratio/OC method versus benchmark, by SPD
scenarios, Large city EA
5.0%
4.5%
DSE/ratio/OC
: Scenario A
4.0%
3.5%
DSE/ratio/OC
: Scenario B
DSE/ratio/OC
: Scenario C
2.5%
DSE/ratio/OC
: Scenario E
2.0%
1.5%
DSE/ratio/OC
: Scenario F
1.0%
Benchmark:
Scenario F
0.5%
MF80-120
MF60-79
MF50-59
F35-39
MF40-49
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0.0%
MF0
RRMSE
3.0%
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
35
Equivalent charts comparing the results in the other EAs are given in Appendix E.
8.2.5 Summary
The DSE/ratio/over-coverage method looks to be working well to estimate EA level totals,
appearing robust in its estimation across difference scenarios, although the estimates have a lower
bias when over-coverage is traded for increased exclusions. At age-sex detail the patterns show
that the bias can be quite significant for groups with high over-coverage on the SPD – this is likely
to be due to the over-coverage adjustment being conservative; as a contrast the low over-coverage
scenarios have much lower bias levels. Variance is high for both of these scenarios, but that can
be improved through better stratification, for example by collapsing estimation categories and
improving sample sizes by estimating at regional level. Thus, if over-coverage could be minimised
in the SPD, even at the expense of giving more exclusions overall, these results show that good
quality estimates could be generated using a DSE/ratio/over-coverage method.
Note that as simulations are simple due to some of the assumptions made, the resulting variability
in the estimates will be lower than in reality - so the RSEs would really be bigger. This will be
addressed in future simulation studies.
8.3
Analysis of different PCS sample sizes
For this analysis a perfect survey is assumed in order to see how the variability (as measured by
the RSE) of the estimates change as the sample size differs. The bias of an estimator of nonresponse should, in general, be unaffected by the sample size, so it is not examined here.
Table 7 summarises the RSE for estimates derived using the ratio estimator, assuming a perfect
survey, across the four EAs. The estimation method is run with three different sizes of PCS;
250,000 households, 350,000 households (equivalent to the size of the 2011 CCS sample) and
450,000 households. Table 7 includes results for the base simulation scenario and the scenario
with low over-coverage and high under-coverage (note that the sample size for the rural EA was
the same for the 250,000 sample size as the 350,000 sample size due to the constraint to sample
1 PSU from each LA by quality index stratum. Therefore no result is presented).
Table 7 - RSE (%) by EA, simulation scenario and PCS sample size
High under-coverage and low overcoverage scenario (F)
Base scenario (A)
EA
Inner
London
Large city
Rural
Rural with a
large town
250,000
sample
350,000
sample
450,000
sample
250,000
sample
350,000
sample
450,000
sample
0.29
0.24
0.21
0.82
0.63
0.58
0.23
0.21
0.17
0.47
0.39
0.37
-
0.25
0.21
-
0.47
0.41
0.33
0.24
0.21
0.66
0.48
0.42
Table 7 demonstrates that as the sample size increases, the RSE decreases as expected,
irrespective of the simulation scenario we are using. However, the improvement is not linear, as
shown in Figure 6. This is expected, as sampling theory shows that the standard error of an
estimate decreases by a function of the square root of the sample size.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
36
Figure 6 - RSE (%) by sample size for low over-coverage and high under-coverage
0.9
0.8
0.7
RSE (%)
0.6
0.5
0.4
Inner London
0.3
Large city
Rural
0.2
0.1
450,000
sample
350,000
sample
250,000
sample
0
PCS sample size
Table 8 summarises relative 95 % confidence interval (CI) widths for the low over-coverage
scenario across the different sample sizes. The table shows how the relative CI width changes as
the sample size changes from the assumed 350,000 household sample size. It shows that, as
expected, the smaller sample size results in wider confidence intervals. Again, it is worth noting
that the change in CI width is not directly proportional to the change in sample fraction (so it is not
a linear relationship).
The results imply that the estimates are more robust for a PCS with a large sample size, as a
smaller one can result in some very variable CI widths. This is mostly an effect of the type of
population we are trying to estimate, with a CI width change of 19 % for a large city (which already
has a large sample size), and increasing for Inner London and Rural/urban mix areas. Utilising the
larger sample seems to produce more stable decreases in CI width (between 6 % and 13 %). This
reflects the work on the 2001 and 2011 Census, which examined the CCS sample size (see
Abbott, 2009 and Abbott, 1998), concluding that a sample size of around 300,000 households
provided robust estimates for the measurement of census coverage.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
37
Table 8 – Relative 95% CI width, and change in width for simulation scenario with low over-coverage
and lots of under-coverage by EA, LA, simulation scenario and PCS sample size
Relative 95% CI width (%)
EA
Inner London
Large city
Rural
Rural with a
large town
Percentage change to CI
width from 350,000 sample
to given sample size (%)
250,000
sample
350,000
sample
450,000
sample
250,000
sample
450,000
sample
Lambeth
1.65
1.29
1.17
28
-9
Southwark
1.57
1.20
1.13
31
-6
Leeds
0.92
0.77
0.73
19
-6
Breckland
-
0.91
0.79
-
-13
Babergh
-
0.91
0.82
-
-10
Forest Heath
-
1.06
0.94
-
-12
Mid Suffolk
-
0.91
0.81
-
-10
St Edmundsbury
-
0.97
0.85
-
-12
Conwy
1.25
0.89
0.80
39
-10
Denbighshire
1.29
0.93
0.83
39
-11
Flintshire
1.36
1.00
0.86
37
-13
Wrexham
1.33
0.97
0.86
36
-12
LA
Appendix D contains a similar table for the base scenario. Figure 7 shows how the RSE varies by
age and sex for the low over-coverage scenario.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
38
Figure 7 – RSE by age and sex groups, truth/ratio/- method for different PCS sample sizes for
scenario F (low over-coverage and high under-coverage), Large city EA
5.0%
4.5%
4.0%
3.5%
250,000 sample,
Scenario F
RSE
3.0%
350,000 sample,
Scenario F
2.5%
2.0%
450,000 sample,
Scenario F
1.5%
1.0%
0.5%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Age-sex group
Figure 7 demonstrates that, across all age-sex groups, the larger the sample, the smaller the RSE
and the less variability is introduced into the estimate. The change in RSE is greatest between the
ages of 20 and 39, irrespective of sex. Equivalent charts comparing the results in the other EAs are
given in Appendix E.
These results indicate that a sample size of at least 350,000 households would provide a set of
robust estimates at EA level. Whilst this is not fully conclusive, further work to explore gains in
precision especially looking at borrowing strength across time may indicate that a small sample
would be sufficient. However, this does not include an assessment of the impact on making
estimates for small areas which may require a larger sample size.
8.4
Analysis of impact of lower PCS response rates
Table 9 summarises the mean person response rates across the simulations for the PCS for both
the base scenario and the scenario with poor PCS response for each of the four test areas. It
shows that the simulations cover a wide variety of survey response rates to enable an assessment
of their impact on the accuracy of the estimates.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
39
Table 9 - Mean person response rate
Mean person response rate
Base simulation (%)
Poor PCS response (%)
75.9
29.8
Large city
88.0
55.2
Rural
89.5
59.4
Rural with a large town
90.8
63.5
EA
Inner London
Table 10 summarises the relative bias and RSE for the base simulation scenario and the scenario
with poor PCS response across the four areas. This uses the estimation strategy which includes
the DSE, ratio estimator and an over-coverage adjustment. A PCS sample size of 350,000 was
used. These options were used in order to see the impact of a PCS with lower response rates on
performance for a fixed SPD, estimation process, and sample size.
Table 10 - Relative bias and Relative Standard Error by EA and simulation scenario
Relative bias (%)
RSE (%)
Base
Poor PCS
response
Difference
Base
Poor PCS
response
Difference
EA
Inner London
0.77
-1.64
-2.41
0.43
1.29
+0.86
Large city
0.59
0.26
-0.33
0.27
0.47
+0.23
Rural
0.38
0.16
-0.22
0.42
0.57
+0.15
Rural with a large town
0.56
0.40
-0.16
0.36
0.48
+0.12
Table 10 indicates that there was a negative impact on the relative bias for all areas when the PCS
response rate is lower. This implies that a poor PCS induces a negative bias. Whilst the relative
bias looks closer to zero for most of the areas when the PCS response is poorer, the key point is
that the bias has decreased when compared to the base. If the base were unbiased, then the
negative bias impact would be more clearly seen. This is probably caused by correlation bias in the
DSE. In the simulations, the probabilities of being missed by the SPD have a degree of correlation
with the probability of being missed by the survey. This causes correlation bias in a DSE. Whilst
the correlation is not as strong as in a census context, it still is likely to exist (as young people are
more likely to be missed than other age groups from the PCS and also by the SPD). When
response in one source is lower, the DSE is more susceptible to correlation bias, which is showing
in the bias in Table 10.
The relative bias for Inner London changes the most, as would be expected. Table 10 shows that
the bias with a poor PCS is more variable, especially between London and the rest of the country.
The RSE values increase for every area. This can be seen both in Table 10 and in Figure 8. This is
due to the lower correlation between the survey outcome and the SPD, meaning the ratio estimator
is less effective. This leads to more variability in the estimates, hence larger RSEs.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
40
Figure 8 - RSE for base scenario and poor PCS response scenario across the four EAs
1.4%
1.2%
RSE
1.0%
0.8%
Base
0.6%
Poor PCS Coverage
0.4%
0.2%
0.0%
Inner London
Large city
Rural
Area
Rural with a
large town
For Inner London, the RSE i almost tripled between the base simulation and the poor PCS
response simulation, the Large city increased by 74 %, the Rural area increased by 38 % and the
Rural area with a large town increased by 34 %. This indicates that, irrespective of the type of
area, poor response in the PCS results in more variability in the estimates, however this effect is
more pronounced in urban areas, where the PCS response is likely to be poorest.
Figure 9 shows the RSE by age and sex groups for the Large city EA. As can be seen from the
graph, the RSE increases for every age and sex group in this particular area when the PCS has
poor response, but the increase varies substantially across age and sex groups.
Figure 9 - RSE by age and sex groups, DSE/ratio/over-coverage method for base scenario and poor
PCS response scenario, Large city EA
6%
5%
3%
Base
2%
Poor CCS
1%
MF60-79
MF50-59
MF40-49
F35-39
MF80-120
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0%
MF0
RSE
4%
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
41
Equivalent charts comparing the results in the other EAs are given in Appendix E.
The results clearly show that a PCS with low response rates result in estimates that have both a
greater bias and variability. In addition, the methods become less robust as this will have a greater
impact in low response areas or subgroups. Therefore, to minimise the risk of poor quality
estimates, a PCS that achieves a high response rate is essential. This implies it will only be
feasible to produce estimates of acceptable quality using administrative data if the PCS achieves a
high response rate. Further work is required to explore what the target response rate should be.
8.5
Assessment of estimates against quantitative design targets
Table 11 shows the relative 95 % confidence interval widths and relative bias for each LA within
the EAs, for the base and low over-coverage SPD scenarios using the DSE/ratio/over-coverage
method. While this is only for a limited set of areas, and with perhaps artificially low RSEs, all CIs
are well within the quality standards set of within plus or minus 3 % of the truth. There is also a
bias quality standard, that at the local levels a bias of 0.4 % or greater would not be acceptable.
Levels of bias generally meet this standard with the low over-coverage SPD in most areas. These
results, as with the other simulation outcomes, have the caveat of needing high accuracy of
matching and minimal erroneous inclusions.
Table 11 - Relative 95% confidence interval width and relative bias by LA for DSE/ratio/over-coverage
method for each EA, base and low over-coverage SPD scenarios, base sample size
EA
Inner
London
Large city
Rural
Rural with a
large town
LA
Relative 95% CI width (%)
High underBase
coverage and
scenario
low overcoverage
Relative bias (%)
High underBase
coverage and
scenario
low overcoverage
Lambeth
0.88
1.57
0.79
0.29
Southwark
0.82
1.48
0.74
0.26
Leeds
0.53
0.85
0.59
0.18
Breckland
0.80
1.05
0.36
0.10
Babergh
0.77
1.04
0.34
0.08
Forest Heath
0.96
1.24
0.49
0.17
Mid Suffolk
0.78
1.02
0.35
0.07
St Edmundsbury
0.85
1.11
0.39
0.11
Conwy
0.66
0.97
0.43
0.97
Denbighshire
0.69
1.01
0.47
0.98
Flintshire
0.74
1.08
0.52
1.03
Wrexham
0.76
1.05
0.80
-2.22
In the rural with large town area local authority effects are apparent in the relative bias: Wrexham
has a higher bias in the base scenario than the other LAs in the EA. In the high undercoverage/low over-coverage scenario this bias is even more extreme, with Wrexham substantially
underestimated (by 2.22 %), while the other areas are overestimated (by between 0.97-1.03 %). A
synthetic estimator was used to estimate the LA population, using the patterns observed at EA
level. While a LA effect of greater under-coverage and over-coverage was built into the data for the
large town (using different p and w probabilities as defined in Appendix B), the way the synthetic
estimation method worked could take account of this. This highlights the need to develop an
equivalent of the census HtC index for stratification, and further exploration of the methods to
estimate at local authority level. These will be subject for further research and testing.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
42
9 Conclusions
This paper has outlined an initial methodology for deriving population estimates using
administrative data and a PCS, utilising the framework described in Beyond 2011: Producing
Population Estimates Using Administrative Data: In Practice (M7) and summarised in section 3.1.
The general approach has been to build on the 2011 Census methodology, as the general
structure of the problem is similar. Based on a number of assumptions and the survey context
outlined in section 4, section 5 of the paper presents an initial survey design. The design is limited
in complexity due to the limited information about the SPDs at present with which to optimise the
sampling. However, the proposed design can be refreshed as data are obtained.
The paper also presents some options for the estimation framework in section 6, again using some
of the methods from the 2011 Census, but also considering how these could be extended. One of
the critical issues for future research will be the estimation of erroneous inclusions, as these cannot
be observed in the survey.
Lastly, sections 7 and 8 outline the methods and results from a simulation study, which re-uses the
technology from the 2011 Census development. The simulations used here are relatively simple,
reflecting the lack of knowledge about the likely coverage patterns in an SPD. However, the results
show a great deal of promise in terms of precision of the population estimates using a PCS that is
similar in size and nature to the previous CCS. The results show that the quality standards in terms
of the confidence intervals are likely to be met, assuming further work on stratification and the
methods for obtaining LA populations are able to detect localised differences.
The evaluation of bias in the simulation results have shown that the methods as they stand perform
better when over-coverage is minimised, even if the price paid is an increase in under-coverage.
This has implications for how the SPD is formed and further work on both the estimation methods
and the SPD formation will focus on this.
Sections 8.3 and 8.4 explored the impact of different sample sizes and PCS response levels on the
estimates. They show that samples smaller than 350,000 households may increase the risk of not
meeting the CI quality standards, but that a larger sample may provide important gains particularly
at LA and lower geographic levels. A PCS which achieves response levels of 50 to 70 % increases
both the bias and variability of the estimates, and makes the methodology less robust. The
conclusion is that to minimise the risk of not meeting quality standards for population estimates a
high response PCS is essential. As a result, the Programme is currently exploring whether new
legislation could be created to make provision for this survey to be compulsory.
Critical to all of the results in these simulations were the assumptions underpinning them, outlined
in section 7. The key ones were high accuracy of matching and minimal erroneous inclusions. The
impact of violating both these assumptions is a one-to-one increase in bias. Further work into
mitigation, the likelihood and potential adjustment methods for both of these will be critical for
confirming whether an anonymously linked record-level administrative based method combined
with a PCS will produce population estimates of acceptable accuracy.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
43
Appendix A: Statistical quality targets
In order to evaluate each of the options that the Beyond 2011 Programme is considering, it has
been necessary to produce a range of evaluation criteria (See Beyond 2011: Options Report (O1)
for details). One of the criteria is the evaluation of statistical quality of the estimates. The standards
proposed can be found in tables A1 and A2, and evaluate the estimates against the maximum
quality, average quality and the quality achieved in the current population estimates system.
Beyond these, there is also an evaluation criteria to produce unbiased estimates, with biases not
larger than around 0.5 % nationally and 0.4 % locally.
Table A1: LA quality standards for population estimates
97 % of LApopulation estimates have a
95 % confidence interval of …
All LA population estimates have a 95 %
confidence interval of …
P1
+/- 3.0 % or better
+/- 3.8 % or better
P2
+/- 3.0 % or better in the peak year
+/- 3.8 % or better in the peak year
+/- 6.0 % or better in year nine
+/- 13.0 % or better in year nine
+/- 5.2 % or better
+/- 8.5 % or better
P3
Table A2: National quality standards for population estimates
11
England and Wales
population estimates
have a 95 % confidence
interval of …
England population
estimates have a 95 %
confidence interval of …
Wales population
estimates have a 95 %
confidence interval of …
P1
+/- 0.15 %
+/- 0.15 %
+/- 0.64 %
P2
+/- 0.15 % in the peak year
+/- 0.15 % in the peak year
+/- 0.64 % in the peak year
+/- 0.27 % in year nine
+/- 0.28 % in year nine
+/- 1.06 % in year nine
+/- 0.22 %
+/- 0.23 %
+/- 0.81 %
P3
Quality standards have not yet been set explicitly for population estimates by age and sex, for
small areas or for change over time.
The quality standards set out above all relate to the variability of the population estimates. It is also
important that the approach does not systematically over or under estimate the population.
Therefore there is an additional target that the estimates should be unbiased. For further
information about how these quality standards have been set, please refer to ‘Beyond 2011:
Options Report 2’ (Paper O2).
11
The accuracy achieved in the 2011 Census for the population estimate for Wales is less than that for
England due to the smaller population size being estimated.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
44
Appendix B: SPD probabilities of inclusion and incorrect inclusion
Three probabilities are used within the simulations to drive the outcome for each individual. The
probabilities are provided as a set of input assumptions.
The first probability is that of being included correctly, to reflect whether the individual is included in
the SPD in the correct location:
In addition, two conditional probabilities are used within the simulations. They are conditional on
whether the individual was included correctly. This allows the simulations to reflect the different
types of incorrect inclusion on an SPD – records being in the wrong place (not included at the
correct location, but included somewhere else) and duplicates (included at the correct location
AND included somewhere else).
So if the probability p of being included correctly is:
then the conditional probability:
defines the level of duplication, and the conditional probability:
defines the records in the wrong location.
Note that therefore:
• The probability of correct inclusion is p*(1-d).
• The probability of a duplicate is p*d.
• The probability of an individual being included once but in the wrong location is (1-p)*w.
• The probability of a individual being missed entirely is (1-p)*(1-w).
These are summarised in Figure B1.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
45
Figure B1 – SPD probabilities of inclusion and incorrect inclusion for each outcome
The three probabilities are defined as input assumptions, and can vary by characteristics that are
included on the 2001 Census data such as sex, age, marital status, location one year ago, whether
a student or not, activity last week. However, the simulations do not have any specific geographical
variation beyond that defined by individual characteristics.
The base probabilities used are summarised in Figures B2 and B3.
Figure B2 – Base probability of being included correctly, by quinary age, sex and area type
1.0
0.6
0.4
Inner
London
Urban
0.2
F1 - 4
F5 - 9
F10 - 14
F15 - 19
F20 - 24
F25 - 29
F30 - 34
F35 - 39
F40 - 44
F45 - 49
F50 - 54
F55 - 59
F60 - 64
F65 - 69
F70 - 74
F75 - 79
F80-84
F85+
M1 - 4
M5 - 9
M10 - 14
M15 - 19
M20 - 24
M25 - 29
M30 - 34
M35 - 39
M40 - 44
M45 - 49
M50 - 54
M55 - 59
M60 - 64
M65 - 69
M70 - 74
M75 - 79
M80-84
M85+
0.0
MF0
P=Probability(Correct)
0.8
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
46
Figure B3 – Base probability of only being counted in the incorrect location, by quinary age, sex and
area type
W=Probability (Incorrect=1|Correct=0)
1.0
0.8
0.6
0.4
Inner
London
Urban
0.2
F1 - 4
F5 - 9
F10 - 14
F15 - 19
F20 - 24
F25 - 29
F30 - 34
F35 - 39
F40 - 44
F45 - 49
F50 - 54
F55 - 59
F60 - 64
F65 - 69
F70 - 74
F75 - 79
F80-84
F85+
M1 - 4
M5 - 9
M10 - 14
M15 - 19
M20 - 24
M25 - 29
M30 - 34
M35 - 39
M40 - 44
M45 - 49
M50 - 54
M55 - 59
M60 - 64
M65 - 69
M70 - 74
M75 - 79
M80-84
M85+
MF0
0.0
Age-sex group
How the SPD probabilities were derived
The SPD probabilities of inclusion and incorrect inclusion were derived using data used in the
quality assurance (QA) of the 2011 Census. As part of the QA process, NHS Patient Register (PR)
data were matched against the combined 2011 Census and CCS data (referred to collectively as
the census data), within CCS areas, in order to determine which registrations were at the correct
location at the time of the census. Unmatched PR records were then matched against other
administrative sources to determine whether the patient was in fact present, but missed by the
census, or whether the registration should be elsewhere in the country. This exercise has so far
been carried out for over 50 LA districts, ranging from Inner London to rural areas (see Beyond
2011: Options Report (O1).
The PR/census matching work led to an outcome for each PR and census record within the CCS
clusters:
•
•
•
•
•
•
PR and census match;
census unmatched;
PR unmatched to census, but matched to alternative source and assessed present;
PR unmatched to census, but matched to alternative source and assessed elsewhere;
PR matched to census elsewhere; or
PR unmatched and unaccounted for.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
47
Assuming the 2011 Census estimate – the original census counts plus the additional estimate due
to coverage assessment and adjustment – is the true population, the above outcomes can be
represented in relation to the true population. See Figure B4. The idea is to use the matching
outcomes from this study to define the probabilities needed for the simulations. Some assumptions
will be required for the derivation, and for the initial simulations these will be kept simple.
Figure B4 – Summary of matching outcomes against the ‘true population’
Probabilities
P = P(correct=1) = P(correct inclusion) = (A+D)/(A+B+C+D)
This takes those matched (in the same place) and those which we assess are present (which we
think the census missed) as ‘correct’.
W = P(incorrect=1|correct=0) = P(in wrong location, given missed at right location) = (E+F)/(B+C)
This takes those on the PR, but not found in the census and consider are not really here, as
‘incorrect’, out of those not in the correct location.
Person characteristics and area type
The above outcomes were available for 54 LA districts, by quinary age and sex. The person
characteristics available are limited to those on the PR data. While the census data would provide
marital status, location one year ago, whether a student or not and activity last week, it would not
be appropriate to assume the unmatched PR records had similar distributions of these
characteristics.
The LAs included in the above analysis were categorised simply as Inner London, Major Urban or
Other, according to ONS area classifications 12.
12
See www.ons.gov.uk/ons/guide-method/geography/products/area-classifications/index.html
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
48
Allowance for erroneous enumerations in the ‘unaccounted for’ PR
As described in section 3, erroneous enumerations will be present in the PR. An allowance for this
has been made at a national level, by comparing the total 2011 Census estimate against the total
number on the PR, by age and sex, creating a down-rating factor which was applied to the
‘unaccounted for’ counts (E).
Duplicates
A national duplication level has been assumed, using information from the Audit Commission’s
National Duplicate Registration Initiative (Audit Commission, 2012). This found in excess of
125,000 potential duplicates for investigation in 2009/10 (0.2 % of the population). Only 23 % of
these were subsequently removed from the data. This initiative followed a similar initiative in
2003/04, which removed around 63,000 duplicates, but assessed that many more remained due to
National Health Applications and Infrastructure Services sites not appropriately reviewing the
information provided to them (Audit Commission, 2006). As an initial estimate, the default
duplication level was set at 0.2, with increased probabilities for outside-UK migrants of 0.5, and
within-UK migrants of 0.4, as these two groups were more likely to have duplicate registrations.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
49
Appendix C: How the PCS probabilities were derived
The approach used to generate the probabilities of being counted by a PCS for each individual and
household is outlined.
Modelling the 2001 Census coverage patterns
Using the matched data from the 2001 Census and CCS it is possible to identify three groups of
individuals:
a) those who responded to both the Census and the CCS;
b) those who responded to the Census only; and
c) those who responded to the CCS only.
In addition, we have household and individual level characteristics for each case as well as the
local area defined by the postcode and OA. If we assume independence between the census and
CCS at the individual level, after controlling for characteristics and area, we can model the census
coverage probability as
group a
group a
and the CCS coverage probability as
.
group a + group c
group a + group b
One approach to modelling probabilities is through logistic regression. If we consider the model for
census coverage, our outcome is defined as:
Yi,CEN = 1 if individual i is in group a
Yi,CEN = 0 if individual i is in group c.
The logistic model is then defined as:
logit(Yi,CEN ) = β 0 + β1x 1i + β 2 x 2i + ... + β p x pi
(C1)
where x1, …, xp are a set of observed characteristics that in general can refer to the individual, their
household, and the type of area they come from. To allow for the local area effects, we extend (C1)
to be a standard multilevel logistic model with random effects at the postcode, OA, LA, and EA
level. An equivalent model can be defined for CCS coverage. These are in effect extensions of the
models fitted by Rahman and Goldring (2006).
The two multilevel models (a census model and a CCS model) have been fitted to the 2001 CCS
sample data for the whole of England and Wales. The variables included in each model (as fixed
effects) were:
•
•
•
•
•
•
•
•
•
five-year age-sex groups (37 categories);
economic activity (8 categories);
marital status (6 categories);
tenure (7 categories);
household ethnicity (5 categories);
household structure (12 categories);
household size (continuous variable, linear and quadratic effects);
region (11 categories); and
HtC score (continuous variable, linear and quadratic effects).
In a similar way, two multi-level models were fitted to the household level matched census and
CCS data, one for the census household coverage patterns, and one for the CCS household
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
50
coverage patterns. The variables included in the two household models were the same as those
listed for the individual level models, omitting individual level variables (age-sex group, economic
activity and marital status).
In all four models were fitted:
•
(i)
CCS coverage at individual level (generates the probabilities P+1 );
•
(i)
census coverage at individual level (generates the probabilities P1+ );
•
(h)
CCS coverage at household level (generates the probabilities P+1 ); and
•
(h)
census coverage at household level (generates the probabilities P1+ ).
The HtC score is a continuous variable derived from the proportion of single and rented
households, and the proportion of non-white, young adults and unemployed individuals, within
each OA. This HtC score is based on 2001 data, and replaces the HtC index used in previous
models which was based on 1991 data. In the work presented here, the new HtC score is used to
generate the response probabilities, while the HtC index based on the 1991 data is used when
selecting the simulated CCS samples and in any part of the estimation process where the HtC
index is required. This emulates the situation that will occur in 2011 when the estimation system
will need to be robust to a potentially out-of-date index.
Applying the models to the 2001 Census database
Using the fitted parameters from the models it is possible to predict a coverage probability for each
individual and each household in the census database for both the census and the CCS. For the
structural effects coming from the characteristics, which were included as fixed effects, this is
straightforward, but for the area effects that are fitted as random effects it is not so simple. The EA
and LA effects are the estimated residuals associated with each unit as all areas are included in
the model. For the lower level effects (OA and postcode) only a sample are included in the model
so there are not estimated effects for each specific area. Therefore, effects are selected at random
(with replacement) from the fitted effects with the same region. In all cases, the census and CCS
effects are sampled as a pair to preserve the observed correlation between the areal effects.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
51
Appendix D: Evaluation criteria for simulation studies
The three main measures for assessing accuracy/sensitivity of census coverage methods for the
simulations studies prior to the 2011 Census were the following:
Relative bias
The basic formula for the relative bias is:
(D1)
where n is the number of observations equivalent to the number of simulation runs, the truth is
defined as the 2001 Census data and the observed is the estimated value obtained after applying
estimation (or adjustment) on the simulated data.
This measures the bias of the estimates, in other words how close the estimates are to the truth. A
small relative bias means that the estimates are, over repeated simulations, close to the true value
of the population.
RRMSE
The basic formula for the RRMSE is:
(D2)
Where n is the number of observations equivalent to the number of simulation runs, the truth is
defined as the 2001 Census data and the observed is the estimated value obtained after applying
estimation (or adjustment) on the simulated data.
The RRMSE is an indicator of the mean accuracy of the estimates. A low RRMSE means that on
average the estimates are more precise. The RRMSE includes both bias and variance, so if the
relative bias is high then by definition the RRMSE will also be high. In general the RRMSE will
therefore always be higher than the relative bias.
RSE
The RSE generally takes the form:
(D3)
It is used as a display of the mean variability of the estimates. The RSE in conjunction with the
RRMSE is a measure of the variability of the estimates. The additional information that the RSE
provides is that it is does not take the bias of the estimates into account.
Examples for all 3 measures
In Brown (2000) all three criteria were used in order to evaluate the methods for obtaining
population estimates across different methods like a robust ratio estimator and the simple ratio
estimator with DSEs obtained at the cluster, hard to count and postcode level. After the simulation
process the simple ratio estimation method produced estimates with small relative bias and good
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
52
overall accuracy, whereas the robust method gave estimates with poorer relative biases but slightly
improved RRMSE and RSE.
Table D1 below illustrates the figures of the three criteria across the different methods.
Table D1: Simulation Criteria (%) for different methods
Method
Total
Overall Relative Bias (%)
Overall RRMSE (%)
Overall RSE (%)
2001 robust DSE
-0.31
0.51
0.45
Simple ratio estimation at
Cluster level
-0.05
0.50
0.49
As can be seen the relative bias for the simple ratio estimation at Cluster level (-0.05 %) is much
smaller compared to the 2001 robust DSE (-0.31 %). The closer the bias is to zero the closer the
estimates are to the true value. It doesn’t matter if it positive or negative, only how far it is from
zero. Positive bias means that the estimates overestimate the population and negative bias means
that the estimates underestimate the population.
The RSE for the ratio estimation method is slightly improved (0.45 %) compared to the relative
square error of the 2001 robust DSE method (0.49 %). Here the smaller the RRMSE and the RSE
are, the better quality estimates are provided in terms of accuracy and variability. It can sometimes
be better to choose a slightly biased estimator though, if the precision of the estimator is high. In
other words, it is acceptable to pay a small price in bias for a large gain in variance. However in our
case the gain of the variance reduction is not significant. Therefore the simple ratio estimation at
Cluster level is preferable.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
53
Appendix E: Additional simulation results
Figure E1 –Percentage difference between simulated data and ‘the truth’, by age and sex and EA,
with and without over-coverage
a) Inner London
50
40
% difference from truth
30
20
without
overcount
10
0
with overcount base SPD
-10
-20
-30
with overcount v low overcount
SPD
-40
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
-50
Age-sex group
b) Large City
50
30
without
overcount
20
10
0
with overcount base SPD
-10
-20
-30
with overcount v low overcount
SPD
-40
MF60-79
MF50-59
MF40-49
F35-39
MF80-120
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
-50
MF0
% difference from truth
40
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
54
c) Rural
50
40
% difference from truth
30
20
without
overcount
10
0
with overcount
- base SPD
-10
-20
-30
with overcount
- v low
overcount SPD
-40
Age-sex group
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
-50
d) Rural with large town
50
30
without
overcount
20
10
0
-10
with
overcount base SPD
-20
-30
-40
MF60-79
MF50-59
MF40-49
F35-39
MF80-120
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
-50
MF0
% difference from truth
40
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
55
Table E1 – Relative bias, RSE and RRMSE for base, DSE/ratio and DSE/ratio/over-coverage
estimation methods by EA, base scenario and base sample design, to 2 significant figures
Rural with
Inner London
Large city
Rural
large town
Relative bias (%)
Horvitz-Thompson
benchmark
Ratio
Weighting
class/ratio
DSE/ratio
DSE/ratio/overcoverage
-0.68
0.0062
-24
0.11
0.018
-12
-2.32
-0.00050
-10
-0.14
0.0038
-9.2
-2.2
35
-1.9
17
-2.2
10
-1.8
12
0.77
0.59
0.38
0.56
5.3
0.24
1.4
4.9
0.21
0.92
8.2
0.25
1.0
7.7
0.24
0.95
0.85
1.4
0.74
0.64
0.74
0.74
0.55
0.74
0.43
0.27
0.42
0.36
5.3
0.24
24
4.9
0.21
12
8.5
0.25
11
7.7
0.24
9.2
2.4
35
2.0
17
2.3
10
1.9
12
0.88
0.65
0.56
0.67
RSE (%)
Horvitz-Thompson
benchmark
Ratio
Weighting
class/ratio
DSE/ratio
DSE/ratio/overcoverage
RRMSE (%)
Horvitz-Thompson
benchmark
Ratio
Weighting
class/ratio
DSE/ratio
DSE/ratio/overcoverage
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
56
Table E2 – Relative bias, RSE and RRMSE by benchmark, DSE/ratio and DSE/ratio/OC estimation
methods and base and low overcoverage SPD scenarios, by EA, to 2 significant figures
Base with
lots of underBase
coverage
scenario
and low
overcoverage
23
5
SPD over-coverage (%)
SPD under-coverage (%)
Relative bias (%)
Inner London
benchmark
DSE/ratio/overcoverage
RSE (%)
benchmark
DSE/ratio/overcoverage
RRMSE (%)
benchmark
DSE/ratio/overcoverage
SPD over-coverage (%)
SPD under-coverage (%)
Relative bias (%)
Rural
RSE (%)
RRMSE (%)
benchmark
DSE/ratio/overcoverage
benchmark
DSE/ratio/overcoverage
benchmark
DSE/ratio/overcoverage
SPD over-coverage (%)
SPD under-coverage (%)
Relative bias (%)
Rural with large
town
RSE (%)
RRMSE (%)
benchmark
DSE/ratio/overcoverage
benchmark
DSE/ratio/overcoverage
benchmark
DSE/ratio/overcoverage
26
26
0.0062
0.12
0.77
0.24
0.28
0.63
0.43
0.24
0.77
0.64
0.88
0.82
8
2
11
11
-0.00050
0.024
0.38
0.25
0.10
0.47
0.42
0.25
0.54
0.47
0.56
9
12
0.0038
0.55
2
12
0.015
0.56
0.24
0.13
0.48
0.36
0.24
0.52
0.48
0.67
0.54
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
57
Figure E2 – Relative bias by age and sex groups, DSE/ratio/OC method versus benchmark, by SPD
scenarios
a) Inner London
2.5%
2.0%
Relative Bias
1.5%
Scenario A benchmark
1.0%
Scenario A DSE/Ratio/OC
0.5%
Scenario F benchmark
Scenario F DSE/Ratio/OC
0.0%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
-0.5%
Age-sex group
b) Rural
2.5%
2.0%
Scenario A benchmark
1.0%
Scenario A DSE/Ratio/OC
0.5%
Scenario F benchmark
Scenario F DSE/Ratio/OC
0.0%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
-0.5%
MF0
Relative Bias
1.5%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
58
c) Rural with large town
3.5%
3.0%
2.5%
Scenario A benchmark
1.5%
Scenario A DSE/Ratio/OC
1.0%
0.5%
Scenario F benchmark
0.0%
Scenario F DSE/Ratio/OC
-0.5%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
-1.0%
MF0
Relative Bias
2.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
59
Figure E3 – RSE by age and sex groups, DSE/ratio/OC method versus benchmark, by SPD scenarios
a) Inner London
9.0%
8.0%
7.0%
RSE
6.0%
Scenario A benchmark
5.0%
Scenario A DSE/Ratio/OC
4.0%
3.0%
Scenario F benchmark
2.0%
Scenario F DSE/Ratio/OC
1.0%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Age-sex group
b) Rural
8.0%
7.0%
6.0%
Scenario A benchmark
4.0%
3.0%
Scenario A DSE/Ratio/OC
2.0%
Scenario F benchmark
1.0%
Scenario F DSE/Ratio/OC
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0.0%
MF0
RSE
5.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
60
c) Rural with large town
7.0%
6.0%
5.0%
Scenario A benchmark
3.0%
Scenario A DSE/Ratio/OC
2.0%
Scenario F benchmark
1.0%
Scenario F DSE/Ratio/OC
RSE
4.0%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
61
Figure E4 – RRMSE by age and sex groups, DSE/ratio/OC method versus benchmark, by SPD
scenarios
a) Inner London
9.0%
8.0%
7.0%
RRMSE
6.0%
Scenario A benchmark
5.0%
Scenario A DSE/Ratio/OC
4.0%
3.0%
Scenario F benchmark
2.0%
Scenario F DSE/Ratio/OC
1.0%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Age-sex group
b) Rural
8.0%
7.0%
6.0%
Scenario A benchmark
4.0%
3.0%
Scenario A DSE/Ratio/OC
2.0%
Scenario F benchmark
1.0%
Scenario F DSE/Ratio/OC
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0.0%
MF0
RRMSE
5.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
62
c) Rural with large town
7.0%
6.0%
4.0%
Scenario A benchmark
3.0%
Scenario A DSE/Ratio/OC
2.0%
Scenario F benchmark
1.0%
Scenario F DSE/Ratio/OC
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0.0%
MF0
RRMSE
5.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
63
Table E3: Relative 95% CI width, and change in width for base simulation scenario by EA, LA,
simulation scenario and PCS sample size
Percentage change to CI
width from 350,000
sample to given sample
Relative 95% CI width (%)
size (%)
EA
Inner London
Large city
Rural
Rural with a
large town
250,000
sample
350,000
sample
450,000
sample
Lambeth
0.60
0.49
0.43
Southwark
0.56
0.47
0.42
Leeds
0.46
0.41
0.33
Breckland
-
0.49
0.40
Babergh
-
0.48
0.42
Forest Heath
-
0.57
0.48
Mid Suffolk
-
0.48
0.41
St Edmundsbury
-
0.52
0.44
Conwy
0.62
0.46
0.40
Denbighshire
0.63
0.48
0.41
Flintshire
0.66
0.50
0.42
Wrexham
0.68
0.51
0.44
250,000
sample
450,000
sample
21
-12
20
-11
12
-19
-
-18
-
-13
-
-16
-
-16
-
-16
36
-14
32
-14
33
-16
33
-15
LA
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
64
Figure E5 – RSE by age and sex groups, truth/ratio/- method for different PCS sample sizes for base
scenario
a) Inner London
4.5%
4.0%
3.5%
RSE
3.0%
2.5%
2.0%
250,000 sample
1.5%
350,000 sample
450,000 sample
1.0%
0.5%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Age-sex group
b) Large city
3.5%
3.0%
2.5%
250,000
sample
1.5%
350,000
sample
1.0%
450,000
sample
RSE
2.0%
0.5%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
65
c) Rural
0.05
0.045
0.04
0.035
RSE
0.03
0.025
0.02
350,000 sample
0.015
450,000 sample
0.01
0.005
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0
Age-sex group
d) Rural with large town
5.0%
4.5%
4.0%
3.5%
2.5%
250,000 sample
2.0%
350,000 sample
1.5%
450,000 sample
1.0%
0.5%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0.0%
MF0
RSE
3.0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
66
Figure E6 – RSE by age and sex groups, truth/ratio/- method for different PCS sample sizes for base
scenario with low over-coverage and lots of under-coverage, Large city EA
a) Inner London
7%
6%
5%
250,000 sample,
scenario F
4%
RSE
350,000 sample,
scenario F
3%
450,000 sample,
scenario F
2%
1%
MF80-120
MF60-79
MF50-59
Age-sex group
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0%
b) Rural
7%
6%
5%
350,000 sample,
scenario F
RSE
4%
450,000 sample,
scenario F
3%
2%
1%
MF60-79
MF50-59
MF40-49
F35-39
MF80-120
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0%
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
67
c) Rural with large town
9%
8%
7%
6%
250,000 sample,
scenario F
350,000 sample,
scenario F
450,000 sample,
scenario F
4%
3%
2%
1%
MF60-79
MF50-59
MF40-49
F35-39
MF80-120
Age-sex group
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0%
MF0
RSE
5%
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
68
Figure E7 – RSE by age and sex groups, DSE/ratio/over-coverage method for base scenario and poor
PCS coverage scenario
a) Inner London
10%
9%
8%
7%
RSE
6%
5%
4%
Base
3%
Poor CCS
2%
1%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0%
Age-sex group
b) Rural
9%
8%
7%
5%
4%
Base
3%
Poor CCS
2%
1%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
0%
MF0
RSE
6%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
69
c) Rural with large town
7%
6%
5%
RSE
4%
3%
Base
Poor CCS
2%
1%
MF80-120
MF60-79
MF50-59
MF40-49
F35-39
M35-39
F30-34
M30-34
F25-29
M25-29
F20-24
M20-24
MF10-19
MF1-9
MF0
0%
Age-sex group
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
70
Glossary
Abbreviation
Meaning
CCS
CE
CI
DSE
EA
HtC
H-T
LA
LSOA
MYE
ONS
OA
PR
PSU
PCS
QA
RRMSE
RSE
SPD
Census Coverage Survey
communal establishment
confidence interval
Dual System Estimator
estimation area
hard-to-count
Horvitz-Thompson
Local authority
lower level super output area
mid-year population estimate
Office for National Statistics
Output Area
Patient Register
primary sampling unit
Population Coverage Survey
quality assurance
relative root mean square error
relative standard error
Statistical Population Dataset
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
71
References
Abbott, O. (1998). Census Coverage Survey Design – One Number Census Steering Committee
paper. Available at: http://www.ons.gov.uk/ons/guide-method/census/census-2001/design-andconduct/the-one-number-census/methodology/steering-committee/key-papers/census-coveragesurvey-design.pdf
Abbott, O. (2009). Precision of census estimates for different levels and patterns of census
response. Unpublished paper. Available on request
Audit Commission (2012) Audit Commission 2009/10 NDRI report, Februrary 2012, Available at
http://www.audit-commission.gov.uk/sitecollectiondocuments/downloads/ndrireport2012.pdf
Audit Commission (2006) Audit Commission 2003/04 NDRI report, August 2006, Available at
http://archive.auditcommission.gov.uk/auditcommission/sitecollectiondocuments/AuditCommissionReports/NationalSt
udies/ndri2004report.pdf
Beyond 2011: Exploring the Challenges of Using Administrative Data (M2), ONS (2012). Available
at: http://www.ons.gov.uk/ons/about-ons/what-we-do/programmes---projects/beyond2011/news/reports-and-publications/methods-and-policies/index.html
Beyond 2011 Options Report (O1), ONS (2013). Available at: http://www.ons.gov.uk/ons/aboutons/what-we-do/programmes---projects/beyond-2011/what-are-the-options-/index.html
Beyond 2011 Options Report (O2), ONS (2013). Published alongside this report. Available at:
http://www.ons.gov.uk/ons/about-ons/what-we-do/programmes---projects/beyond2011/news/reports-and-publications/index.html
Beyond 2011: Administrative Data Models – Research Using Aggregate Data (R3), ONS (2013).
Available at: http://www.ons.gov.uk/ons/about-ons/what-we-do/programmes---projects/beyond2011/news/reports-and-publications/research/index.html
Beyond 2011: Producing Population Estimates Using Administrative Data: In practice (M7), ONS
(2013). Published alongside this report. Available at: http://www.ons.gov.uk/ons/about-ons/whatwe-do/programmes---projects/beyond-2011/news/reports-and-publications/index.html
Beyond 2011: Producing Population Estimates Using Administrative Data (M6), ONS (2013).
Published alongside this report. Available at: http://www.ons.gov.uk/ons/about-ons/what-wedo/programmes---projects/beyond-2011/news/reports-and-publications/index.html
Brown, J. J. (2000) Design of a census coverage survey and its use in the estimation and
adjustment of census underenumeration. University of Southampton, unpublished PhD thesis.
Cochran, W. G. (1977) Sampling techniques, 3rd edition. New York: Wiley & Sons.
Large, A., Brown, J., Abbott, O. And Taylor, A. (2011) Estimating and Correcting for Over-count in
the 2011 Census. ONS Survey Methodology Bulletin 69. Available at
http://www.ons.gov.uk/ons/guide-method/method-quality/survey-methodology-bulletin/index.html
Lohr, S. L. (1999) Sampling: Design and Analysis. Pacific Grove: Duxbury Press.
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
72
Rahman, N and Goldring, S (2006) Factors associated with household non-response in the 2001
Census. Survey Methodology Bulletin, 59, pp11-24. Available at http://www.ons.gov.uk/ons/guidemethod/method-quality/survey-methodology-bulletin/smb-59/survey-methodology-bulletin-59--sept-2006.pdf
Tourangeau, R., Rips, L. J., and Rasinksi, K. (2000) The Psychology of survey response,
Cambridge, UK: Cambridge University Press.
United Nations (2008). Principles and Recommendations for Population and Housing Censuses.
Statistical Papers, series M, No.67/Rev.2. New York. Available at
http://unstats.un.org/unsd/publication/SeriesM/Seriesm_67rev2e.pdf
Beyond 2011: Producing Population Estimates Using Administrative Data: In Theory
73