Modelling Census Under-Enumeration: A logistical regression

Modelling Census Under-Enumeration
A logistic regression perspective
Bernard Baffour
The views expressed in this paper are those of the author and do not necessarily represent those
of the General Register Office for Scotland.
Abstract
Inevitably, not everyone in Scotland was enumerated in the 2001 Census. In Scotland, as
in the rest of the UK, the raw Census figures were adjusted to compensate for the
differential levels of under-enumeration that existed between different age-sex groups
and areas. This was achieved by the One Number Census project which produced an
individual level Census database, with synthetic individuals used to fully adjust for the
differential levels of under-enumeration. Census under-enumeration stems from a wide
variety of factors – for example demographic, socio-economic and household
composition.
An accurate identification and measurement of under-enumerated people allows the
implementation of initiatives to maximise coverage. For example new designs of the
census form can be trialled in households most susceptible to under-enumeration with
the aim of making it easier to complete.
Therefore this paper seeks to model under-enumeration, by ascertaining the underlying
characteristics of under-enumerated areas, determined using the non-response rates.
These characteristics are found via logistic regression modelling. Logistic regression
analysis controls for all variables simultaneously during modelling and can therefore
reveal which factors are significant predictors of the outcome variable in question.
The results of the series of logistic regression models show that there are a variety of
factors that affect under-enumeration. However, from a strategic viewpoint, an area will
be susceptible to under-enumeration if it has a disproportionate number of people who
are transient and/or deprived. The deprivation characteristics are income and health
deprivation, non-car ownership, overcrowding and multi-occupancy. The transient
characteristics are single (never married), young people (aged 25-29), living in privately
rented accommodation and working in intermediate occupations. Some people – some
single mothers, for example – could be construed as being both transient and deprived.
1
1. Introduction
A Census provides a ‘snapshot’ of the entire national population and has the distinct
advantage of providing an opportunity to gather a wealth of information about
demographic, economic, social and housing characteristics of the population.
Furthermore, unlike surveys, Censuses allow the analysis of these population
characteristics right down to very small geographical areas. Population figures are also
used by international bodies like the UN and EU as a basis for formulating international
policy and world population estimates. Therefore, in most countries the requirement to
hold a Census, or a least have a methodology in place to obtain up-to-date population
figures, is enshrined in the constitutional legislation.
In the UK, a decennial Census aims to provide an accurate up-to-date count of all the
residents. The last Census, held in 2001, estimated that a response rate of 96.1% was
achieved in Scotland; which compares favourably with other countries (Boyle and Dorling
2004). However what was particularly noticeable from the Census was the differential
response rates that existed between different age-sex groups and areas (see Figure
1.1.1). This is known as “biased under-enumeration”, and the 2001 Census was the first
Census that sought to measure this non-uniformity in response.
The 2001 Census sought to accurately account for the differential levels of underenumeration that existed between different age-sex groups and areas. This was achieved
by the One Number Census project which produced an individual level Census database,
with synthetic individuals used to fully adjust for the differential levels of underenumeration.
An important aspect of the project was an independent follow-up survey (the Census
Coverage Survey, CCS) carried out after the official Census had taken place. The basic
statistical model - called dual system estimation - assumes that a first enumeration, the
Census, is followed by a second enumeration (the Census Coverage Survey) which
subsequently takes place a few weeks later. Then matching techniques can be used to
match records from the Census Coverage Survey to those from the Census.
From the results of the matching (and dual system estimation), an estimate of the people
missed by both the Census Coverage Survey and the Census can be made. A key
assumption here is that the first and second processes are statistically independent. That
is, the probability of enumeration in the second process given enumeration in the first is
identical to the probability of enumeration in the second process given the person was
missed in the first process. This identity assumption provides a basis for estimating the
number that were not enumerated in either the first or second processes – referred to as
the undercount or under-enumeration. Consequently, this population undercount is
combined with the Census population to give an estimate of the true population.
Therefore, the One Number Census methodology can be thought of as comprising three
processes that work together. Firstly, the Census provides a population count. Secondly,
the Census Coverage Survey provides an estimate of those missed by the Census.
Finally dual system estimation provides an estimate of the people missed by both the
Census and Census Coverage Survey. This produces a new estimate of the population
that has been adjusted to account for the missing, by imputation.
2
Figure 1.1.1: 2001 Scottish Census response rates by age-sex group
From Figure 1.1.1, it can be seen that females had better response rates overall than
males – with males aged 20-24 having the lowest coverage. Furthermore, it was found
that, accounting for geographical region, this demographic spread of non-response is
much wider, and there was an identification of a broader cross-section of the missed
population. As an example, some parts of West Dunbartonshire with a high percentage of
social housing stock had a large proportion of young males identified as missing.
Census under-enumeration is a function of a wide range of factors including
i.
demographic
ii.
social and economic
iii.
housing structure and composition
iv.
public attitudes and perceptions
v.
Census enumeration strategy
vi.
effectiveness of the Census under-enumeration measuring instrument
Assuming that (v) and (vi) are correctly accounted for by the Census methodology, the
other factors can be investigated by looking at the associations and relationships that
exist between these factors and under-enumeration. Under-enumeration is likely to be
higher in certain areas with particular social, economic and demographic attributes. This
paper aims to use logistic regression to model the Census under-enumeration by
identifying the attributes that are significant predictors of under-count.
The object of the analyses in this paper is two-fold:
1. To assess the association between under-enumeration and key deprivation and
Census variables,
2. To develop one or more models that could be used to predict areas of underenumeration based on the area attributes and their relationships with underenumeration.
It is important to note that the analysis does not establish any causal relationships with
under-enumeration. (Causality can only be determined by carefully controlled
experimental study.) Nor does it look at the effect of public perceptions and attitudes to
the Census.
3
The analysis was carried out at a ward level because this level of geography is low
enough to avoid any issues of heterogeneity in the population size of council areas, but
large enough so that the proportion imputed was greater than zero. Scotland is divided
into 1,222 wards. The ward boundaries are of similar size across Scotland with between
650 and 9,200 people in each ward. Figure 1.1.2 shows the distribution of residents in as
a total and broken down by sex. Although the spread is large the distribution is roughly
normal – most wards have a population size of between 3,000 and 5,000 people.
Figure 1.1.2.: Distribution of resident population in wards
300
250
N um be r of w a rds
150
100
50
250
200
Females
100
50
0
00
00
45
00
40
35
00
00
30
00
25
00
20
15
0
50
00
0
00
00
00
90
80
00
70
00
60
00
50
00
00
40
30
20
00
0
10
0
Males
150
10
N u m b e r o f w a rd s
200
Number of residents
Number of residents
1.1 The Scottish Index of Multiple Deprivation
Deprivation data was obtained from the Scottish Index of Multiple Deprivation (SIMD)
developed by the Social Disadvantage Research Centre, at Oxford University, and
commissioned by the Scottish Executive (for the full report see Noble et al, 2003). The
SIMD measures the different aspects of deprivation with the view of combining them into
a local index. It presents individual or household levels of deprivation in the form of an
aggregate score for each ward in Scotland. The fundamental assumption is that
individuals and families experiencing deprivation would cluster together in the same
geographical location.
The Scottish Index of Multiple Deprivation seeks to answer two questions:
a. How far do deprived individuals geographically cluster together?
b. How are other non-deprived individuals affected by the deprivation?
The advantage of the SIMD is that it recognises the different dimensions of deprivation
and provides a measure of the concentration of deprivation (independent of population
size).
1.1.1 Social exclusion and deprivation
Deprivation is defined as the lack of money or material possessions (Townsend, 1979,
p.31). Social exclusion on the other hand is difficult to define: the Prime Minister
described social exclusion as “a shorthand label for what happens when individuals or
areas suffer from a combination of linked problems such as unemployment, poor skills,
low incomes, poor housing, high crime environments, bad health and family breakdown”
(Scottish Office, 1999).
4
What is apparent is that social exclusion and deprivation are inextricably linked and
manifest themselves in a variety of ways. However, the vagueness of their definitions
means that statisticians and social scientists concentrate on material deprivation. It has
been relatively easy to measure material deprivation in relation to diet, housing,
environment, household facilities and employment. Because of the fact that deprivation
and social exclusion do not seem to separately exist - i.e. a person who is deprived can
be said to be (socially) excluded and vice versa – the two are used interchangeably in
literature. Hence, intuitively one would expect social exclusion (henceforth, deprivation) to
be connected with under-enumeration.
The SIMD is therefore based on a concept that articulates multiple deprivation as an
accumulation of single deprivations. Multiple deprivation is not a separate form of
deprivation but a combination of more specific forms of deprivation which, in essence, are
deemed more measurable.
1.1.2 The Domains of the Scottish Index of Multiple Deprivation
The SIMD uses a process to produce a score for each of five “domains”, which are
subsequently ranked across Scotland to give a relative picture of each dimension of
deprivation. The SIMD domains are Income Deprivation, Employment Deprivation, Health
Deprivation and Disability, Education, Skills and Training Deprivation and Geographical
Access to Services.
The domains consist of indicators that are determined to be directly relevant to the
purpose of each domain. Care was taken to prevent duplication by exploring intercorrelations of variables within (and outwith) the domains, eliminating those which largely
reproduce the more effective measures of the relevant deprivation. Therefore, the
indicators combine to give a direct measure of the particular aspect of deprivation defined
by the domain.
There are other domains that could have been measured. The previous Scottish Area
Deprivation Index, commissioned by the Scottish Office in 1998, included Housing
Deprivation and Crime and Social Order domains. There is great need to measure
housing deprivation (especially in the present economic climate of rising house prices) in
order to help inform policy and target particular groups of people. However the varying
approaches in the measurement of this domain – accessibility to housing, poor housing
quality, overcrowding etc. – makes it difficult to quantify directly. Secondly there is
currently no up-to-date data to address these issues at ward level.
Crime and Social Order are important elements in measuring deprivation at the small
area level, and can be used for local initiatives. Unfortunately small area data on crime or
social order for the whole of Scotland is currently not available (although it is expected to
be available in future as a result of the new Crime and Victimisation Survey). Secondly
there are some aspects of crime (e.g. city centre vandalism) that may be difficult to
attribute to a specific geographical area. In most cases these crimes can legitimately be
attributed to areas of those who live elsewhere.
It is important that attention is drawn to the domain of Access deprivation. It is a well
known fact that poor geographical access to services is deemed to be an important
aspect of multiple deprivation (Noble et al, 2000). However, this domain did not consider
non-geographical issues - in particular language and cultural barriers. Secondly, car
ownership (or more accurately, the lack of) was also not considered. A recent ONS report
found that the proportion of people in households without a car who claimed to
experience some difficulty in accessing services was nearly twice as great as those with
5
a car (Ruston, 2002). Nevertheless, car ownership was omitted from the index because
of its ambiguity – is the non-ownership of a vehicle (especially in cities) indicative of
deprivation or a reflection of personal choice, a more environmentally friendly approach
or good public transport?
1.1.3 Standardisation and weighting of the SIMD
Combining the individual domain deprivation indices into a single multiple deprivation
index needed a standardisation procedure. The procedure had to ensure that the most
deprived wards were easily identified. Also it was required that each domain had a
common distribution and did not conflate size with level of deprivation. The procedure
that best meets the criteria is the exponential transformation.
Weighting is particularly important when creating an index of multiple deprivation as a
sum of individual indices of deprivation. The obvious way is to sum each of the individual
indices. However research shows that some aspects of deprivation are more important
than others (Gordon et al, 2000 and Noble et al, 2000).
Thus the overall Scottish multiple index of deprivation is weighted as:
SIMD = (0.3 x Income) + (0.3 x Employment) +
(0.15 x Education) + (0.15 x Health) + (0.1 x Access)
(†)
1.2 The Scottish 2001 Census
The analysis of the factors that affect under-enumeration is given in Section 2 onwards.
The remaining part of this section gives a description of the Census variables used
presenting results from the 2001 Census.
As previously stated, under-enumeration is a resultant of a wide-ranging variety of
factors. The SIMD data provided some deprivation, social and economic attributes which
could be analysed to ascertain how associated they were with under-enumeration. The
rest were obtained from the Census database, available over the internet through the
Scottish Census Results OnLine (SCROL) website (www.scrol.gov.uk).
The Census results provided the demographic and household variables used in analyses.
It is imperative to realise that the variables chosen from the Census are by no means
exhaustive. It is very feasible that some variables not considered will be found to be
predictive of whether or not an area is under-enumerated.
6
A. Age Structure
Figure 1.2.1
Bar chart of the age distribution
100%
80%
75 & over
60 to 74
45 to 59
30 to 44
25 to 29
20 to 24
15 to 19
10 to 14
5 to 9
0 to 4
60%
40%
20%
0%
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Population: All people
Source: Census Table KS02
The distribution of the age structure in each local authority area shows that in general
there is an ageing population. The major cities – Glasgow, Edinburgh, Dundee and
Aberdeen – have a high percentage of 20-24 year olds; this is perhaps because of people
in university or on graduate schemes. Eilean Siar, Angus and Highland - areas with few
or no higher educational institutions - have a lower proportion of 20-24 year olds. In
Scotland as a whole there are migration peaks for males and females in their late teens
and early twenties. Furthermore, the 2003 Registrar General’s Annual Report showed a
marked net migration gain at ages 19 and 20, and net migration loss at ages 23 and 24.
These patterns are consistent with an influx of students from the rest of the UK (and
overseas) starting higher education, followed by a further move after graduation.
7
0%
West Lothian
20%
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
40%
Shetland Islands
Widowed
Divorced
Separated
Not living together - Married
Single
Cohabiting
Living together - Married
60%
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Source: Census Table KS17
Population: All households
Midlothian
Inverclyde
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
8
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Source: Census Table KS03
Population: All adults aged 16 and above
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Marital status and living arrangements
C.
50
40
30
20
10
Percentage of households with no cars
B. Car Ownership
Figure 1.2.2
Car Ownership
60
0
Glasgow and Dundee have a high proportion of households with no cars. This could be
due to the cost of parking in inner city areas, a much more environmentally friendly
approach or good public transport - although, as remarked earlier, non ownership of a car
is considered to be an indicator for access deprivation.
Figure 1.2.3
Living Arrangements
100%
80%
Figure 1.2.4
Marital Status
100%
90%
80%
70%
Widowed
Divorced
Separated
Re-married
Married
Single
60%
50%
40%
30%
20%
10%
0%
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Population: All adults aged 16 and above
Source: Census Table KS03
In general Scottish household structures have been changing. These changes are in line
with the trend observed in most Western countries. One of the most significant changes
in Scottish household composition over the last 20 years has been the rapid expansion of
one-person households. In Scotland, the rate of one person households has increased
from 21.9% in 1981 to 32.9% in 2001 (Miller, 2003). Furthermore, there has been an
increase in the number of couples in cohabiting unions. This can be shown by comparing
the proportion of single people in the two figures above - Figure 1.2.3 looks at the
breakdown of the adult population by living arrangement, while Figure 1.2.4 shows the
respective breakdown by marital status. Figure 1.2.4 shows how the inclusion of
cohabitating couples reduces the number of single people at each ward. By using the
marital status definition of ‘single’ we fail to account for co-habitation. This was one of the
strengths of the One Number Census, in that part of its remit was to seek to accurately
identify the more complex (less traditional, per se) household types.
D.
Single parenthood
The number of single parent households in a ward has previously been used as an
indicator of economic deprivation. In Scotland, 91.6% of single parent households (with
dependent children) have a female head of household. This is similar to the UK standard.
However if this is broken down by age, there are lot more young female single parents:
94.6% of lone parents aged 24 or under with a child under four years old are female. This
could be attributed to the fact that Scotland has one of the highest rates of teenage
motherhood in Europe – approximately four times the Western European average despite
a decline in fertility rates across Europe (Miller, 2003).
However the changing household composition in Scotland, and the UK as a whole,
shows that this group can be used to identify the more complex households, mentioned
earlier. As an example, because a single parent is eligible for higher social security
benefits than a cohabiting couple (with dependents), the number of single parent
households might have been over-stated by the Census. Such households differ from the
conventional definition of cohabitation, although they have similar household
9
compositions. It is best illustrated by an example. Suppose a single mother is in receipt of
housing benefits. During the Census she has a boyfriend who lives with her, but might
use his parents’ residence as a mailing address. When his parents were filling in the
Census form, he was omitted from the usual household residents, since they consider
him to be living with his girlfriend. At the same time he is omitted from the form at his
girlfriend’s address. Evidence from previous censuses and the pilot studies prior to the
actual Census taking place in 2001 showed that a much more rigorous approach to
under-enumeration was needed. Therefore, the Census Coverage Survey was carried
out on a much larger scale, with enumerators adept at clarifying any ambiguities during
the interview process * . They were able to identify these young males and the results were
fed into the imputation process. Placing these males into the appropriate households
alters them, and in most cases changed single parent households to a new household
structure with a female and a live-in boyfriend.
E. Household Tenure
Figure 1.2.5
Tenure
100%
percentage
80%
Other
Private landlord/letting agency
Housing Association
Council (Local Authority) Homes
Shared ownership
Owns with a mortgage or loan
Owns outright
60%
40%
20%
0%
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Population: All households
Source: Census Table KS18
Public sector (or social) housing has also been previously used as a measure of
deprivation. However the SIMD did not have a housing deprivation domain, solely for the
reason that there is a lot of ambiguity surrounding the accurate measurement of such a
domain. Nonetheless, residents in social tenures are much more likely to suffer from one
or more forms of deprivation.
But although there could be a confounding effect † , it is still possible that tenure is an
independent predictor of under-enumeration, in its own right.
*
An intriguing discovery of the CCS process was that it identified a large number of babies that were missed
during the Census.
†
A confounding effect makes it difficult to interpret the direction of an association. It results when a
predictor variable is associated with both the outcome and other predictors.
10
Figure 1.2.6
Tenants in each Local Authority
35
30
percentage
25
20
Council (Local Authority) Homes
Housing Association
Private landlord/letting agency
Other
15
10
5
0
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Population: All people who are tenants - i.e. resident in Council homes or other rented accommodation.
Source: Census Table KS18
The household tenure variable is also important in determining the mobility of residents in
wards. People who rent from private landlords or letting agencies are much more likely to
move house. This section of the population would include not only students but also
young professional adults, particularly at a time when the housing market, especially in
cities, is highly competitive, making it difficult for first-time buyers to gain a foothold on the
property ladder. So a lot of young professionals, for example nurses, are resident in
council housing (and other social tenures) because of their affordability. However since
residents in council accommodation are much more likely to be deprived, the public
sector housing variable can be used to model under-enumeration.
F.
Population density
The population density (which measures the number of residents per hectare) is a further
important variable. As the enumeration districts are grouped based on contiguous
postcodes, it is important to know how many residents there are, in order to evenly
allocate the workloads of each Census enumerator.
During the preliminary analysis, carried out prior to embarking on modelling the underenumeration, it was found that a proxy measure of population density was the lowest floor
level of the household or dwelling. Figures 1.2.7 and 1.2.8 show how areas that are
densely populated have a larger proportion of tenements.
The floor level was also found to be easier to interpret during the analysis since it is a
categorical variable – with three distinct categories: first floor, second floor and third floor
or higher. On the other hand, the population density is a continuous variable and cannot
be easily split into distinct non-overlapping groups. The local authority with the lowest
density is Highland with a population density of 0.08, while Glasgow City has the highest
with a population density of 32.93. This difference between the least and most densely
populated areas is much more pronounced when comparing at a ward level – parts of
11
West Lothian
Source: Census Table UV02
Population: All people (area measured in hectares)
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
12
Renfrewshire
Perth & Kinross
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Source: Census Table UV61
Population: All households
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
0.15
First
Second
Third floor or higher
0.1
proportion
inner Glasgow have densities in excess of 100 – and leads to some interpretational
difficulties.
Figure 1.2.7
Population Density
35
30
25
20
15
Density (Number of persons per hectare)
10
5
0
Figure 1.2.8
Lowest floor level
0.25
0.2
0.05
0
Figure 1.2.8(a)
Relationship between floor level and population density (at a ward level)
180
Population Density (number of persons per hectare)
160
140
120
100
80
60
40
20
0
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Proportion resident on third floor or higher
Population: All people (area measured in hectares)
Source: Census Tables UV02 and UV61
Figure 1.2.8(a) depicts the linear relationship between floor level and population density
much more clearer. This relationship is much more pronounced at a local authority level,
where the correlation between the number of third floor or higher households and the
population density is 0.863.
G. Educational Qualifications
The qualification variable details the highest educational attainment of each individual.
This is then aggregated at both the ward level and local authority level. The higher levels
of educational qualification are Groups 3 and 4 and cover residents who have attained an
HNC or higher. The highest educational attainment is very similar to the Education
domain of the Scottish Index of Multiple Deprivation.
The purpose of the Education domain is to measure how the key educational
characteristics of a local area contribute to the overall level of deprivation and
disadvantage. Previous measures have not been able to differentiate between social and
educational deprivation (Noble et al, 2000 and Carstairs and Morris, 1992). Furthermore,
measurement of education deprivation does not focus only on the adult population but
also on school age children. Noble et al (2000) showed that the pupil performance of an
area has a contributory effect on the educational and skills deprivation.
Since the educational attainment variable in the Census only considers adults aged 16
and over, it is a better proxy of educational deprivation when it comes to ascertaining the
relationship between wards that were education deprived and under-enumerated. This is
intuitive, as adults (in particular between ages 20-29) were more likely to be missed
during the Census, than children.
13
Figure 1.2.9
Highest educational qualification attainment
100%
90%
80%
Percentage
70%
Group 4
Group 3
Group 2
Group 1
Group 0
60%
50%
40%
30%
20%
10%
0%
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Population: All people aged 16-74
Source: Census Table UV25
Figure 1.2.9 shows the breakdown of the highest educational attainment by local
authority. Glasgow has the highest proportion of individuals with no qualifications, while
there are more people in Edinburgh with a degree.
H. National Statistics Socio-economic Classification
The National Statistics socio-economic classification [NS-SeC] has been introduced by
the government to replace social class based on occupation and socio-economic groups.
It is an occupational based classification but has rules to provide coverage of the whole
adult population. This classification is internationally accepted since it is based on the
International Labour Organisation (ILO) employment definitions. It is also conceptually
clear and has been validated both in criterion terms and in construction terms as a good
predictor of health and educational outcomes (Office of National Statistics, 2002). The
NS-SeC categories distinguish different positions (not persons) as defined by social
relationships in the work place. People who are temporarily out of work can still be
classified, so ‘unemployed’ in this case refers to people who have never worked or are
long-term unemployed.
14
50%
40%
30%
20%
10%
0%
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Source: Census Table KS14a
Population: All Adults aged 16-74
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
15
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Source: Census Table UV59
Population: All households
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
rating -2 or less
rating -1
rating 0
rating +1
rating +2 or more
60%
percentage
Unclassifiable
Students
Unemployed
Routine
Intermediate
Professional
60%
40%
percentage
Figure 1.2.10
Occupation - based on the National Statistics socio-economic classification
100%
80%
20%
0%
It is important to realise that the NS-SeC cannot solely explain what socio-economic
differences mean. Like the Index of Deprivation, not everything can be explained by what
a classification directly measures. However it can clearly show any relationships that
exist. Fundamentally, it is of interest to determine if the different socio-economic
categories have different relationships with the level of imputation.
I. Occupancy Rating
Figure 1.2.11
Occupancy rating
100%
90%
80%
70%
Occupancy rating is a measure of under-occupancy and overcrowding. A positive
measure refers to the number of rooms in addition to the minimum requirements. A
negative measure refers to the number of rooms short of the minimum, and gives some
indication of over-crowding. It is expected that in wards with a large number of
overcrowded households, there is a greater likelihood of people being missed.
J. Multi-occupancy
Figure 1.2.12
Proportion of shared dwellings in each council area
0.140%
0.120%
percentage
0.100%
0.080%
0.060%
0.040%
0.020%
0.000%
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Islands
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Islands
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh, City of
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyll & Bute
Angus
Aberdeenshire
Aberdeen City
Scotland
Population: All dwellings
Source: Census Table UV55
Residents who live in multi-occupied households are often hard to enumerate. The
Census defines this group as being resident in a shared dwelling. More precisely, a
household’s accommodation [or space] is defined as being a shared dwelling if:
1. it has accommodation type “part of a converted or shared house”,
2. not all rooms (including toilet/bathroom) are behind a door that only that
household can use, and
3. there is at least one other household space at the same address with which it
can be combined to form the shared dwelling.
If any of these conditions is met, the household space forms a shared dwelling. Therefore
a dwelling can consist of one household space (an unshared dwelling) or two or more
household spaces (a shared dwelling). A shared dwelling is distinct from university halls
of residence, sheltered accommodation, hotels etc., which are known as communal
establishments. Most shared dwellings are intended for temporary occupation (e.g. a
group of bedsits) and residents are likely to be transient.
16
2. Using logistic regression modelling to investigate
Census under-enumeration
The main objective of the analysis in this paper is to find a multivariable logistic
regression model that can investigate the independent effect of specific deprivation and
Census variables on the proportion of residents imputed in a ward (i.e. synthetic
individuals). The advantage of logistic regression over other forms of statistical regression
is that it can be used to predict group membership of a binary outcome.
The analysis was carried out in the statistical analysis package SAS, and the output
produced first gives the response profile (see SAS Output). The response profile refers to
the outcome variable, which is defined to be whether a person is imputed (an event) or
not (a nonevent) ‡ .
SAS Output
Response Profile
Ordered
Value
1
2
Binary
Outcome
Event
Nonevent
Total
Frequency
198987
4863024
5062011
Logistic regression allows a discrete outcome, such as group membership, to be
predicted from a set of variables that may be continuous, discrete, binary or a mixture of
any of these. It results in a model that selects only the variables that are significant
predictors of the outcome variable. The outcome variable in logistic regression is binary
and takes the value 1 with a probability of success (an event) p or the value 0 with
probability of failure (nonevent) 1-p.
Another advantage of logistic regression analysis is that it makes no assumption about
the distribution of the independent variables. They do not have to be normally distributed,
linearly related or of equal variance. The difference between logistic and linear regression
is reflected both in the choice of a parametric model and the underlying assumptions.
Once this difference is accounted for, the methods employed in logistic regression
analysis follow the same general principles as linear regression.
After fitting a model, the emphasis shifts from the computation and assessment of
significance of the estimated coefficients to the interpretation of their values. So the
power of a logistic model is assessed by examination of the classificatory tables. The
higher the percentage correct of positive cases (in this paper, accurately predicting the
number imputed and not imputed) the better the predictive power of the model.
The strength of this type of analysis lies in the fact that an entire set of variables can
simultaneously be taken into account. The results then show which variables are
independent predictors of the binary variable. For example, it might be shown in the
preliminary analyses that the floor level is highly correlated to under-enumeration, and
wards with a high proportion of residents on the 3rd floor or above have a high number of
imputed persons. However, when account is taken of the fact that residents of the 3rd
‡
The population of Scotland after the 2001Census was 5,062,011 – this was made up of 4,863,024 people
counted by the Census and an additional 198,987 people imputed after the Census Coverage Survey.
17
floor or above are much more likely to be resident in flats, and much more likely to be
renting, the impact of floor level on the likelihood of a person being imputed is removed.
Any impact that floor level has on under-enumeration is better accounted for by other
variables and it can be concluded that it is not floor level that is relevant but tenure and
dwelling type. It might be true, however, that floor level is a good predictor of tenure and
dwelling and through its effect on these variables it (indirectly) affects under-enumeration.
Odds ratios
The relationship between the predictor and outcome variables is not a linear function in
logistic regression. Instead, the logistic regression function is used, which is:
⎡ p ⎤
Logit ( p ) = ln ⎢
⎥ = β 0 + β 1 x1 + β 2 x 2 + ... + β k x k
⎣1 − p ⎦
and
(‡)
p=
( β0 + β1x1 + β 2 x2 +...+ β k xk )
e
1 + e ( β0 + β1x1 + β2 x2 +...+ βk xk )
Thus the interpretation of any fitted model relies on an understanding of odds ratios.
These odds ratios are derived from the estimated coefficients in the model. Odds can be
understood as the ratio of the probability of an event occurring over the probability of it
not occurring.
An odds ratio of 1 indicates equal odds in the two groups, therefore when the 95%
confidence interval of the odds ratio estimates contains 1, no inference can be made as
to the direction of the change in odds. Thus such a variable is not considered a useful
predictor in the logistic model.
Overdispersion
Over-dispersion occurs when binary data exhibit variances larger than those permitted by
the binomial model. It is usually caused by clustering and lack of independence. It is
therefore sometimes referred to as ‘extra variation’. It seems to be the norm and not the
exception (McCullagh and Nelder, 1989). For a correctly specified model the Pearson chisquared statistic (and the deviance) divided by the degrees of freedom should be
approximately equal to one. When their values are much larger, the assumption of
binomial variability may not be valid and the data exhibits over-dispersion. In effect, when
data is over-dispersed, terms in the model appear more significant than they actually are.
The preliminary analysis carried out prior to the logistic modelling suggested that overdispersion was a problem that existed in this particular data set. Similar individuals are
likely to reside in the same area. However this similarity between individuals in a ward
implies that the clustering is not properly explained by the model. There are other factors
that the modelling process cannot take into account which affect the variability – e.g.
geography, terrain, and enumerator effect. Therefore, in the logistic regression analyses,
in order to correct for over-dispersion, the covariance matrix was multiplied by a
heterogeneity factor.
18
3. Preliminary Analysis
3.1 The outcome variable: proportion imputed in each ward
Figure 3.1.1
Proportion imputed in each local authority
0.09
0.08
Proportion imputed
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
West Lothian
West Dunbartonshire
Stirling
South Lanarkshire
South Ayrshire
Shetland Isles
Scottish Borders
Renfrewshire
Perth & Kinross
Orkney Isles
North Lanarkshire
North Ayrshire
Moray
Midlothian
Inverclyde
Highland
Glasgow City
Fife
Falkirk
Eilean Siar
Edinburgh City
East Renfrewshire
East Lothian
East Dunbartonshire
East Ayrshire
Dundee City
Dumfries & Galloway
Clackmannanshire
Argyle & Bute
Angus
Aberdeenshire
Aberdeen City
Figure 3.1.2
Distribution of im pute d pers ons in e ac h w ard
180
160
140
Number of wards
120
100
80
60
40
20
0
0
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0 .04 5 0.05 0.055 0.06 0.065 0.07
0.08
0.09
0.1
0.12
0.14
0.16
0.18
Proportion of persons Im puted
As shown above, the outcome variable is skewed, and the number of people imputed
varied from area to area. Therefore any analysis that assumes normality could produce
flawed results. Logistic regression analysis makes no distributional assumptions and is
thus a suitable modelling tool in giving an insight into what attributes are more or less
likely to predict event outcome – in this case whether or not a person is imputed.
19
0.2
3.2 The independent variables
The analysis in this paper examined data collected from two sources – the Scottish
Indices of Multiple Deprivation and the 2001 Census.
The Census edit and imputation process - which identified the “synthetic individuals” was based on a Hard to Count (HtC) Index derived to account for the disproportionate
distribution of under-enumerated households. Therefore, the variables that were used to
construct the HtC Index could unduly influence the logistic modelling procedure.
To allow for that influence, the Census variables were divided into:
● Hard to Count variables
● Other Census variables
The Hard to Count variables
shared – shared dwellings
rent – private rented accommodation
over_crowd – over crowded households
students – students as defined by the National Socio Economic classification
_0to4_ - residents aged 0-4
_20to24_ - residents aged 20-24
_25to29_ - residents aged 25-29
over_85 – residents aged over 85
Other Census variables
Group_0 – ‘no’ educational qualifications
council – Council accommodation
house_assoc – other social rented accommodation
sin_par_f – female single parents
NS_SeC_3 – intermediate occupations
no_car – non owner-ship of a car/van
floor – lowest floor level
single – single (never married)
density – population density
A series of logistic regression analyses were carried out on the three sets of variables Deprivation, Hard to Count and Other Census - to yield three logistic models. Initially
univariate models - in which the effects of the individual variables are considered
separately with the view of determining how significant they were as predictors of the
outcome variable – are fitted. Subsequently multivariate models are fitted. Lastly, a final
model is developed that looks at which variables – pre-determined from the above - are
significant and independent predictors of the binary outcome variable: whether or not a
person is imputed.
20
4. Results
The results of the logistic analyses are presented below. For each model I have
presented the Maximum Likelihood parameter estimates and the odds ratios of these
parameter estimates. Like other regression techniques, a p-value less than 0.05
represents a significant result. Additionally, since the relationship between the predictor
and outcome variables is not a linear, I have presented each model as a logistic equation
as follows.
⎡ p ⎤
Logit ( p ) = ln ⎢
⎥ = β 0 + β 1 x1 + β 2 x 2 + ... + β k x k
⎣1 − p ⎦
Forgetting about the logistic transformation on the left-hand side, the right-hand side has
an intercept term β0 and slope terms β1 , β2 , … , βk – the same as in linear regression.
For reasons of parsimony, early on in the model selection process it was decided not to
consider interaction effects between the independent variables. Interaction effects are
difficult to interpret, and can introduce additional complexity to the analysis. In some
cases although a model with an interaction effect might improve the fit the data, in
comparison with a model without, the benefit of the improved model fit is not significant
when other factors are taken into account.
The objective of the logistic analyses in this paper was to find the simplest model that
uncovered which independent factors were significant predictors of number of people
imputed in a ward.
4.1 Deprivation Models
Table 4.1.1: Multiple Deprivation Index
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Intercept
SIMD
1
1
-3.5951
0.0154
Standard
Error
0.00388
0.000109
Chi-Square
Pr > ChiSq
859150.595
19971.7635
<.0001
<.0001
Odds Ratio Estimates
Effect
SIMD
Point
Estimate
1.016
95% Wald
Confidence Limits
1.015
1.016
Model 1
⎡ p ⎤
ln ⎢
⎥ = −3.595 + 0.015SIMD
⎣1 − p ⎦
The slope of 0.015 represents the change in log odds for one unit increase in the SIMD
score. Similarly, its odds ratio of 1.016 is the ratio for one unit change in the SIMD score.
It would be more significant to look at increments other than one unit. For example, an
increase in SIMD by 10 units increases the imputation probability by 17% § .
§
Assuming that the logit is linear in the continuous covariate, then
21
(1.016) 10 = 1.17
Looking at the component domains of the SIMD score, the model is:
Model 2
⎡ p ⎤
ln ⎢
⎥ = −3.453 + 0.034income − 0.027employ − 0.094educ + 0.236health − 0.294access
⎣1 − p ⎦
All the deprivation variables entered into the model yield significant probabilities, implying
that there is a significant relationship. As it is assumed that the relationship between log
odds of an event and the deprivation score is linear, this model tells us that a 10-unit
increment in the income deprivation score (holding everything else fixed) increases the
probability of imputation by 40%.
Table 4.1.2: Singular Deprivation Indices
Analysis of Maximum Likelihood Estimates
Parameter
DF
Intercept
income
employ
educ
health
access
1
1
1
1
Estimate
Standard
Error
Chi-Square
Pr > ChiSq
-3.4529
0.0873
1564.8121
0.0336
0.00773
18.8580
-0.0265
0.00882
9.0213
0.0027
-0.0941
0.0403
5.4588
0.0195
1
0.2355
0.0609
14.9544
1
-0.2939
0.0285
106.6630
<.0001
<.0001
0.0001
<.0001
Odds Ratio Estimates
Effect
Point
Estimate
income
employ
educ
health
access
1.034
0.974
0.910
1.265
0.745
95% Wald
Confidence Limits
1.019
0.957
0.841
1.123
0.705
1.050
0.991
0.985
1.426
0.788
It is peculiar that the education and employment deprivation domain maximum likelihood
estimates are negative, although they are positively correlated with the income domain. A
likely explanation could be that the assumption of linearity might be flawed and this is
investigated later on.
4.2 Hard to Count Index Model
The Hard to Count Index was constructed from Census variables known to be associated
with under-enumeration. Under-enumeration was also found to be unevenly spread
across certain groups. Since this was the basis of the Census Coverage Survey design,
these variables are likely to be significant predictors of the percentage imputed.
The following variables were used in the logistic regression analysis:
• Shared dwelling
• Private rented accommodation
• Percentage in the following age groups: under 4’s, 20-24, 25-29, over 85’s
• Over-crowded households
• Students
Here ‘shared dwellings’ is the variable that looks at whether people in dwellings
occupied by more than one household will have a high probability of being underenumerated.
22
Of the variables entered into the initial model, only private rented accommodation, multioccupancy (shared dwellings), over-crowded households and people aged 25-29 –
indicate a significant relationship based on the maximum likelihood estimates.
Table 4.2.1: Hard to Count Index variables
Parameter
Intercept
shared
rent
over_crowd
_25to29_
DF
Estimate
Standard
Error
Chi-Square
Pr > ChiSq
1
1
1
1
1
-2.9510
0.0145
0.0157
-1.8450
0.0331
0.0821
0.00553
0.00393
0.1408
0.00900
1291.5114
6.8765
15.9237
171.7077
13.5404
<.0001
0.0087
<.0001
<.0001
0.0002
Odds Ratio Estimates
Effect
shared
rent
over_crowd
_25to29_
Point
Estimate
1.015
1.016
0.158
1.034
95% Wald
Confidence Limits
1.004
1.008
0.120
1.016
1.026
1.024
0.208
1.052
Model 3
⎡ p ⎤
ln ⎢
⎥ = −2.951 + 0.015shared + 0.016rent − 1.845over _ crowd + 0.033 _ 25to29 _
⎣1 − p ⎦
Note how over_crowd has a large negative coefficient – this would give cause for further
investigation of the linearity assumption.
Fitting univariate models rarely provides an adequate analysis of the data as the
independent variables are usually associated with each other and may have different
distributions within the levels of the outcome variable. A multivariable analysis gives more
comprehensive modelling of the data. In multivariate logistic modelling, each estimated
coefficient provides an estimate of the log odds adjusting for all other variables included
in the model. It is therefore of interest to investigate the changes in the odds ratio
estimates.
The results shows that while the number of students and persons aged 20-24 are
significant predictors of under-enumeration when using the single variable model, at the
multivariable level they are not. Students are known to be transient and hence more
difficult to enumerate. However, most students are likely to live in (private) rented
accommodation. Thus the inclusion of the variable ‘rent’ renders ‘students’ redundant.
The same applies to the variable ‘_20to24_’. So knowing the number of residents in
rented accommodation is a significant predictor of students and persons aged 20-24.
However ‘_25to29_’ is still predictive in the multivariable model perhaps because people
in this age group are much more likely to be on the first rung of the property ladder, and
live alone.
23
4.3 The remaining Census variables
Table 4.3.1: Other Census variables
Analysis of Maximum Likelihood Estimates
Parameter
DF
Estimate
Intercept
Group_0
house_assoc
sin_par_f
NS_SeC_3
no_car
single
1
1
1
1
1
1
1
-3.0960
-0.7843
0.00707
-1.7026
0.0472
0.0217
0.0169
Standard
Error
Chi-Square
0.3475
0.2664
0.00226
0.3924
0.00719
0.00239
0.00356
79.3967
8.6646
9.8328
18.8264
43.1444
82.8764
22.5240
Pr > ChiSq
<.0001
0.0032
0.0017
<.0001
<.0001
<.0001
<.0001
Odds Ratio Estimates
Effect
Group_0
house_assoc
sin_par_f
NS_SeC_3
no_car
single
Point
Estimate
0.456
1.007
0.182
1.048
1.022
1.017
95% Wald
Confidence Limits
0.271
1.003
0.084
1.034
1.017
1.010
0.769
1.012
0.393
1.063
1.027
1.024
Model 4
⎡ p ⎤
ln ⎢
⎥ = −3.10 − 0.08Group _ 0 + 0.01house _ assoc − 1.70 sin_ par _ f + 0.05NS _ SeC _ 3 + 0.02no _ car + 0.02single
⎣1 − p ⎦
In the above model, the coefficients of sin_par_f and Group_0 go in the opposite
direction to the other variables. The correlation plots give no clue why this should be the
case, so an investigation into the assumption of linearity might give an explanation.
The inclusion of NS_SeC_3 as a predictor of under-enumeration is surprising at first.
However on close inspection there are a number of possible explanations. Throughout
this analysis, it has become obvious that transience has a strong bearing on underenumeration. NS_SeC_3 is the socio-economic classification that caters for the working
population in intermediate occupations. This group includes clerical, sales and
administrative staff – in particular, call-centre workers. With the increasing
competitiveness of graduate schemes, many students on graduation take temporary work
until they find suitable professional jobs. On closer inspection, the majority of these
intermediate workers are in the university cities. There may however be two effects in
existence because university cities also serve as financial centres and would therefore
attract sales and administrative jobs (also classified as “intermediate”). A second
explanation of this could be the Holiday-Maker Visa Scheme. This scheme allows foreign
nationals to work in the UK for up to two years, although most stay for an average of six
months. Most people on these schemes find jobs in the “intermediate” sector - this
reinforces the link between migration and under-enumeration.
The NS_SeC_3 “intermediate” variable in effect picks up areas where there is a large
mobile young population. Mobility increases the probability of non-contact by Census
enumerators – though sending out the Census form by post may reduce this possibility.
NS_SeC_3, therefore, is an important variable in its own right because, even with ‘single’
in the model, it remains significant.
24
4.4 Further investigation into the assumption of linearity
The assumption of linearity simplifies the regression process and allows the calculation of
the change in log-odds for any increase other than a unit increment. This provides a
useful interpretation for independent variables measured on a continuous scale - a
feature of this particular data set. By fitting all the independent variables, particularly the
deprivation variables, as continuous it is being assumed that this relationship is linear.
Based on the previous model results, there is reason to question the assumption of
linearity for Group_0, the educational attainment variable, and income and access, the
sub-indices of deprivation. The results from the preliminary analysis suggest that the
relationship between these variables and the imputation probability is non-linear. If the
variables are fitted in their present format as continuous, the multivariable model results
would not yield useful conclusions. Therefore, based on a quartile design variable
analysis, Group_0, income and access are fitted as categorical variables.
Quartile design analysis creates categorical variables with 4 levels using three cutpoints
based on the distributional quartiles. Thus the continuous variables are replaced by these
(new) 4-level categorical variables. The lowest quartile serves as the reference group;
this category is used as a comparison group for all the other groups, and hence results
can be interpreted relative to this reference group. For the deprivation model, there is
justification in using the sub-indices of multiple deprivation because it shows a more
careful pattern of differences that have a better ability to predict non-response. Therefore
the Access and Income singular deprivation indices are fitted as categorical variables (the
remaining domains are left as continuous). In doing so, it is possible to ascertain which
deprivation sub-indices have a comparatively better predictive ability.
Table 4.4.1(a)
Type III Analysis of Effects
Wald
Effect
DF
Chi-Square
Pr > ChiSq
income
access
health
employ
educ
3
3
1
1
1
22.4409
109.0892
47.9210
1.0548
0.1146
<.0001
<.0001
<.0001
0.3044
0.7350
Table 4.4.1(b)
Type III Analysis of Effects
Effect
DF
Wald
Chi-Square
Pr > ChiSq
income
access
health
3
3
1
22.4496
109.1964
82.2696
<.0001
<.0001
<.0001
The Type III Analysis of Effects (shown above in Table 4.4.1(a) and (b)) tests all factors in
the model as if each one were the last one included in the model (i.e. this controls for all
other effects). This indicates that, with the other indices in the model, the employment
and education indices are rendered redundant.
The tables show that the employment and education sub-indices have the least
significance, after controlling for other effects. While it is widely acknowledged that
unemployment is a good predictor of under-enumeration, a person with a low income is
25
much more likely to be unemployed (although the reverse is not necessarily true). In this
respect, employment deprivation is said to be confounded as it is associated with both
the outcome variable (imputation probability) and a primary independent variable (low
income).
The following table shows the how the Other Census model changes when Group_0 is
fitted as a categorical variable.
Table 4.4.2: Type III Analysis with Group_0 fitted as a class covariate
Type III Analysis of Effects
Effect
Group_0
sin_par_f
NS_SeC_3
no_car
single
DF
Wald
Chi-Square
Pr > ChiSq
3
1
1
1
1
7.4160
20.7668
55.2018
83.2786
49.8101
0.0598
<.0001
<.0001
<.0001
<.0001
When fitted as a class variable, Group_0 becomes non-significant. Also notice how
house_assoc – the variable looking at social tenures not administered by the council – is
absent from the model. The reason for this is that the odds ratio of the covariate in this
model ends up being very close to 1 which is indicative of no effect ** . Hence the final
model is fitted with sin_par_f, NS_SeC_3, no_car and single.
4.5 Final Modelling Stage
Having determined which amalgamated Deprivation, Hard to Count and Other Census
variables are associated with under-enumeration, the final modelling stage seeks to
identify which independent variables are significant predictors, controlling for all other
variables.
A series of logistic regression analyses were carried out and the results are shown below.
Table 4.5.1(a) Census and Multiple Deprivation Variables: Type III Analysis
Type III Analysis of Effects
Effect
income
access
health
sin_par_f
NS_SeC_3
no_car
single
shared
rent
over_crowd
_25to29_
DF
Wald
Chi-Square
Pr > ChiSq
3
3
1
1
1
1
1
1
1
1
1
17.5617
5.0613
37.7709
13.4854
111.6820
34.6778
12.7539
8.8587
21.7330
31.2054
6.2902
0.0005
0.1674
<.0001
0.0002
<.0001
<.0001
0.0004
0.0029
<.0001
<.0001
0.0121
Access has a non-significant p-value; hence another model is fitted with the sub-index
removed. The analysis yields the model below – the best model. Since income and
access have been fitted as categorical variables they have
**
The odds ratio estimate of 1.035, in the univariate model, drops down to 1.005 in the multivariate model i.e. with the addition of other variables in the model, the differences between areas with a high number of
housing association tenants and those without becomes non-significant.
26
Table 4.5.1(b) Census and Multiple Deprivation Variables: Type III Analysis
Type III Analysis of Effects
Effect
income
health
sin_par_f
NS_SeC_3
no_car
single
shared
rent
over_crowd
_25to29_
DF
Wald
Chi-Square
Pr > ChiSq
3
1
1
1
1
1
1
1
1
1
16.7498
40.3774
15.1413
107.5792
44.0331
10.1418
7.7653
31.9308
34.4162
6.7151
0.0008
<.0001
<.0001
<.0001
<.0001
0.0014
0.0053
<.0001
<.0001
0.0096
Table 4.5.2 Final Model ††
Analysis of Maximum Likelihood Estimates
Parameter
Intercept
income
income
income
health
sin_par_f
NS_SeC_3
no_car
single
shared
rent
over_crowd
_25to29_
1
2
3
DF
Estimate
1
1
1
1
1
1
1
1
1
1
1
1
1
-4.4200
-0.0786
-0.0436
0.1020
0.2427
-1.3972
0.0738
0.0177
0.0114
0.0121
0.0229
1.4946
0.0224
Standard
Error
Chi-Square
0.3828
0.0537
0.0628
0.0805
0.0382
0.3591
0.00711
0.00267
0.00359
0.00436
0.00405
0.2548
0.00863
133.2996
2.1453
0.4817
1.6074
40.3774
15.1413
107.5792
44.0331
10.1418
7.7653
31.9308
34.4162
6.7151
Pr > ChiSq
<.0001
0.1430
0.4876
0.2049
<.0001
<.0001
<.0001
<.0001
0.0014
0.0053
<.0001
<.0001
0.0096
The final model reveals which Deprivation and Census variables are significant and
independent predictors of under-enumeration. When all the variables found to be
significant in Models 1- 4 are placed in the regression model simultaneously, any
association that Group_0 and access have with under-enumeration is better accounted
for by the remaining variables.
Final Model
⎡ p ⎤
ln ⎢
⎥ = −4.42 − 0.08income 1 − 0.04income 2 + 0.10 income 3 + 0.24 health + 0.01single + 0.01shared +
⎣1 − p ⎦
0.07 NS _ SeC _ 3 + 0.02 no _ car + 0.02 _ 25to 29 _ + 0.02 rent + 1.50 over _ crowd − 1.40 sin_ par _ f
Notice how the over_crowd coefficient changes from a large negative (-1.84) from
Model 3 to a large positive (+1.50) in this final model. The large positive is more plausible
– it would be expected that the more overcrowded a household is, the greater the
likelihood of being missed. However, there is evidence to suggest that this is true only to
a certain extent. Changing household structures, working patterns etc. have meant that it
is increasingly difficult to contact one person and adult couple households, which have
significantly increased in the past 30 years (Miller, 2003).
††
Since income has been fitted as a categorical variable to account for the non-linearity of its relationship
with the number imputed, there are three ‘income’ variables shown in Table 4.5.2.
These variables can be thought of as representing the quartiles of income deprivation.
27
4.6 Odds Ratios
The table below gives the odds ratio estimates derived from the coefficients in the model.
Odds ratios are more easily interpretable in comparison with the logistic regression
coefficients. For example, it can be concluded that residents in income deprivation group
3 (upper quartile: least deprived) are 1.1 times more likely to be under-enumerated
compared to the reference category (group 0: most deprived). However, for residents in
group 1 the comparative value is 0.9 i.e. a reduction. Thus the greatest amount of
imputation occurred in the lowest and highest income groups.
Table 4.6.1: Odds ratio estimates of the variables in final model
Odds Ratio Estimates
Point
Estimate
Effect
income
income
income
health
sin_par_f
NS_SeC_3
no_car
single
shared
rent
over_crowd
_25to29_
1 vs 0
2 vs 0
3 vs 0
0.924
0.957
1.107
1.275
0.247
1.077
1.018
1.011
1.012
1.023
4.457
1.023
95% Wald
Confidence Limits
0.832
0.846
0.946
1.183
0.122
1.062
1.013
1.004
1.004
1.015
2.705
1.005
1.027
1.083
1.297
1.374
0.500
1.092
1.023
1.019
1.021
1.031
7.344
1.040
Overcrowding and non-car ownership still remain significant even with the addition of the
Deprivation Indices. As stated earlier, the Scottish Index of Multiple Deprivation was not
exhaustive in the number of domains considered. It did not consider housing deprivation
– of which overcrowding is an indicator. Car ownership was also not used in any of the
indices because of the ambiguity over whether the non-ownership of a vehicle is
indicative of deprivation or a reflection of personal choice. Therefore, this could be a
possible explanation why these two variables are shown to be still significant in the final
model.
Secondly, single parentage has a comparatively large negative coefficient - meaning that
the fewer single parents there are in an area, the higher the number of “synthetic
individuals” imputed. This seems peculiar, until we look at the One Number Census
methodology. The Census Coverage Survey identified certain household structures into
which imputed individuals were placed – for instance a large number of female single
parent households with live-in boyfriends (particularly in the Glasgow region). The
imputation process placed young males into these households, hereby changing their
structure.
28
4.7 Assessment of how well the models fit the data
The assessment of fit, and the predictive power, of the model is a multi-faceted
investigation involving summary tests and measures as well as diagnostic statistics.
The most straightforward way to summarise results of a fitted logistic regression
model is by a classification table. This table is the result of cross-classifying the
outcome variable with a dichotomous variable whose values are derived from the
estimated logistic probabilities [Hosmer & Lemeshow (2000) p.156]. Classification
tables use different estimated probabilities to predict group membership. Therefore a
model is said to be of reasonable fit if, based on some criterion, it predicts group
membership correctly.
The accuracy of the classification is measured by the sensitivity and specificity.
Sensitivity is defined as the ability to correctly predict an event, while specificity is the
ability to predict a non-event correctly. A more complete description of the
classification accuracy is given by the area under the Receiver Operating
Characteristic (ROC) curve. ROC curves plot the probability of detecting a true signal
(sensitivity) against a false signal (1 – specificity) for an entire range of possible cutpoints.
Figure 4.7.1 Assessment of fit of the different models
ROC curves
100
90
80
sensitivity
70
60
Model A - Final Model
Model B - Other Census Variables Model
Model C - Hard-to-Count Variables Model
50
Model D - Singular Deprivation Model
Model E - Multiple Deprivation Model
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
100
1- specificity
As a general rule we require c, the area under the ROC curve, to be 0.5 < c ≤ 1, in
order to ascertain how good the model is at discriminating between events (synthetic
individuals) and non-events (actual individuals).
A good prediction curve is one that is well above the diagonal line. The graph
(Figure 4.7.1) shows the ROC curves for all the models. The diagonal line has been
superimposed in order to determine which curve achieves the best ‘lift’ and is
therefore the best model.
29
Furthermore, Figure 4.7.1 shows that even the ‘best’ model does not achieve optimal
lift (the area under the ‘best’ model, is 0.631, which is not greatly above the minimum
value of 0.5). Classification is sensitive to the relative sizes of the two component
groups and always favours classification into the larger group. This is particularly true
for this dataset, where the model is trying to split the population of Scotland
(5,062,011) into 198,987 (events i.e. imputed persons) and 4,863,024 (non-events).
Although the final model lift is not optimal, it does have some ability to discriminate
between the events and non-events. More importantly, the analysis results show
which variables are significantly associated with under-enumeration.
4.8 Summary
The results of the modelling exercise showed that there were a number of
Deprivation and Census independent variables that had some power at predicting
under-enumeration. The best model was achieved by firstly, formulating a basic plan
for selecting the model variables and secondly, using variety of methods to assess
the adequacy of the model both in terms of its individual variables and its overall fit.
Therefore, the final model with the best fit included the following variables:
• Income Deprivation
• Health Deprivation
• Shared dwellings
• Private rented accommodation
• Over-crowding
• Residents aged 25-29
• Single residents
• Female Single Parents
• Residents in intermediate occupations
• Non-car ownership
Therefore, for brevity, under-enumeration can be said to be a function of deprivation
factors – income and health deprivation, non-car ownership, over-crowding and multioccupancy; and transience factors – privately rented accommodation, intermediate
occupational class, single (never married) and young people (aged 25-29). The
number of young single mothers in an area could be construed to fit into both
transience and deprivation categories
30
5. Conclusions and Recommendations
5.1 Conclusions
This paper indicates that deprivation and certain household, demographic, economic
and social characteristics appear to have some association with the level of underenumeration in an area. For some characteristics, however, the associations are very
modest and it is clear that, while there may be a variety of factors that affect whether
or not an area suffers from under-enumeration, much of the variability remains
unaccounted for by the deprivation and socio-economic factors alone.
However these factors can parsimoniously be split into transience or deprivation
characteristics. By definition an area is said to have a transient population if it has a
one (or more) of the following characteristics:
o high proportion of single people
o high proportion of people aged 20-29
o high proportion of people in intermediate occupations
o high proportion of people in private rented accommodation
These findings appear to be in line with previous research establishing that underenumeration is higher in areas characterised by certain social, economic and
demographic characteristics. However, this paper quantifies how the different
characteristics affected under-enumeration, by the successful use of logistic
regression models.
The paper did not however look at all possible variables. There are a number of other
factors (e.g. enumeration strategy) known to have an effect on under-enumeration.
Additionally the paper’s emphasis is on associations; only an experimental study
can provide evidence for causality. It is however, possible that other confounding
variables, not considered, could account for the associations.
While a selection of demographic, social and household variables was found to
predict under-enumeration, these variables were obtained from the post-imputation
Census database and thus any associations could be due the underlying One
Number Census methodology. More definitive results could be achieved by using the
pre-imputation database.
31
5.2 Future Analysis
The following future analysis may be justified:
ƒ
The One Number Census methodology was principally concerned with
identifying and adjusting for under-enumeration and assumes that overenumeration (people being counted twice in the Census e.g. students
counted both at home and at place of study) is negligible. An investigation
into this assumption would be worthwhile.
ƒ
Further investigation into whether using the pre-imputation database yields
similar results would help verify the validity of the models.
ƒ
The UK is the only country (currently) with a fully adjusted Census individuallevel database. This makes comparison of the accuracy of the One Number
Census project with other countries difficult. As other countries adopt similar
approaches, the logistic models developed could be applied to their underenumeration.
ƒ
The Census variables chosen in the modelling process were by no means
exhaustive. Further investigation could examine other attributes not
considered.
ƒ
The Scottish Executive has recently developed a new level of geography –
the datazone. Datazones are fixed small areas, created from 2001 Census
Output Areas, which avoid the homogeneity and confidentiality issues posed
by other small areas such as postcodes or settlements. They have been
designed to be compact in shape and contain households with similar social
characteristics. Future work could ascertain if the logistic regression results
are replicated at the datazone level.
ƒ
Public perceptions of a Census would have been a good indicator to examine
in the modelling process. Has a decline in individual commitment to civic duty
– or an increase in apathy – been a factor? The last general election
registered the lowest voter turnout post-war. Future work might investigate if
voter turnout rates are associated with the level of under-enumeration.
32
References
Boyle, P. and Dorling, D. (2004) The 2001 UK Census: remarkable resource or bygone
legacy of the ‘pencil and paper era’? Area, Vol.36 Issue 2, 101-111.
Brown, J., Diamond, I., Chambers, R., Buckner L.J. and Teague, A.D. (1999).
A methodological strategy for a one number census in the UK. J. R. Statist. Soc. A, 162, 247267
Carstairs, V. and Morris, R. (1992) Deprivation and Health in Scotland. Aberdeen University
Press.
Gordon, D., Adelman, L., Ashworth, K., Bradshaw, J., Levitas, R., Middleton, S., Pantazis, C.,
Patsios, D., Payne, S., Townsend P. and Williams, J. (2000)
Poverty and Social Exclusion in Britain. York: Joseph Rowntree Foundation
Hosmer, D.W. and Lemeshow, S. (2000) Applied Logistic Regression, 2nd Edition.
Wiley Series in Probability and Statistics.
Macniven, D. (2004) Scotland’s Population 2003 – the Registrar General’s Annual Review of
Demographic Trends.
McCullagh, P.and Nelder, J.A. (1989) Generalised Linear Models, 2nd Edition. Chapman and
Hall / CRC
Miller, G. (2003) The Relationship Matrix - measuring the changing face of Scotland’s
households. MSc. Dissertation. Napier University. Edinburgh.
Noble, M., Penhale, G., Smith, G. and Wright, G. (2000) Measuring Multiple Deprivation at the
Local Level, Index of Deprivation 1999 Review.
Noble, M., Wright, G., Lloyd, M., Smith, G., Ratcliffe, A., McLennan, D., Sigala, M. and Antilla,
C. (2003) Scottish Indices of Deprivation (2003) Report.
Office of National Statistics (2002) User Manual for the National Statistics Socio-economic
classification [NS-SeC].
Ruston, D. (2002) Difficulty in accessing key services, Office of National Statistics publication.
Scottish Office (1999) Social Inclusion – opening the door to a better Scotland
Townsend, P. (1979) Poverty in the United Kingdom: a survey of household resources and
standards of living. Harmondsworth: Penguin
33