FREE - Mapping population data from zone centroid locations

Mapping Population Data from Zone Centroid Locations
Author(s): David Martin
Reviewed work(s):
Source: Transactions of the Institute of British Geographers, New Series, Vol. 14, No. 1 (1989),
pp. 90-97
Published by: Blackwell Publishing on behalf of The Royal Geographical Society (with the Institute of
British Geographers)
Stable URL: http://www.jstor.org/stable/622344 .
Accessed: 26/07/2012 11:28
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
.
Blackwell Publishing and The Royal Geographical Society (with the Institute of British Geographers) are
collaborating with JSTOR to digitize, preserve and extend access to Transactions of the Institute of British
Geographers.
http://www.jstor.org
90
Mapping
population
data
from
zone
centroid
locations
DAVID MARTIN
ESRCResearchStudent,Departmentof TownPlanning,Universityof Wales,Cardiff,PO Box 906,
CardiffCFl 3YN
RevisedMS received8 November,1988
ABSTRACT
Geographers are familiarwith the difficulties encountered in the use of choropleth maps to represent census-type data,
especially those associated with the modifiable areal unit problem. Here, an alternative, raster representation is suggested
which seeks to recover some of the underlying distribution from population weighted centroids. These issues are
considered, and found to be of great importance to any attempt to build a geographic information system for populationbased data.
KEYWORDS: Choropleth map, Census data, Population-weighted centroid, Surface,Geographic information system
INTRODUCTION
The choropleth map, despite its extensive and
continued use, has long been acknowledged as a poor
mediumfor the presentationof population-baseddata.
Uncertainty is introduced both by the inappropriate
location of zone boundaries for the variable being
mapped, and some potentially misleading features of
the choropleth mapping process itself. However,
human geographers and planners necessarily make
heavy use of such data in the computation of indices
of deprivation and need, and for basic population
distribution information in many application areas
including health, migration, and retailing studies.
There is a real danger that these problems may be
carriedforward unresolved into the rapidly growing
field of geographic information systems (GIS),where
errorsmay be compounded by users unfamiliarwith
the nature of the data. However, the use of GIS techniques with such data also offers the potential for
alternative representations. Here, a review of the
problems associated with population mapping is followed by a discussion of a rastermethod for handling
census data,based on population-weighted centroids,
and its implications for GIS.
PROBLEMS OF MAPPING POPULATION
DATA1
Analysis of census-derived data frequently involves
Trans.Inst. Br. Geogr.N.S. 14: 90-97 (1989) ISSN: 0020-2754
the mapping of these data in choropleth (vector)
form, and the logic of non-graphic processes is
frequently vector in nature, such as the use of pointin-polygon or nearest-neighbour techniques for the
association of unit postcoded data and census
enumeration districts (e.d.s). Frequent use has been
made of combinations of census retrieval and
mapping software such as SASPAC and GIMMS in
the production of hardcopy maps of this type (for
example, Carruthers,1985).
A number of problems are acknowledged to exist
with this type of mapping, which relate to the essentially unknown relationshipbetween the data and the
highly irregulararealunits with which they are associated. This uncertainty is due to the aggregation of
individualdata to 'imposed'arealunits (Unwin, 1981),
whose boundariesare not data-derived,but designed
for ease of enumeration.As a consequence the data
values for each zone may be as much a function of
the zone boundary locations as of the underlying
distribution. This has become known as the modifiable area unit problem (Openshaw and Taylor, 1981;
Openshaw, 1984; Forbes, 1984). By contrast, many
area data relating to the physical environment, such
as land parcels, soil types etc. may be mapped using
their 'natural'boundaries, and there is a direct correspondence between the zones representedin the data
and those existing in the 'real'world.
Printed in Great Britain
91
Mapping populationdata
CARDIFFBAY
fS %
UNEMPLOYED
OF ECON. RCTIVE
AT 1981 CENSUS
31
i........
_
1
0
3
Kms
FIGUREI. A choropleth map showing percentage unemployment in the CardiffBay area. (The zones are census enumeration districts, and
the visual impression is highly misleading)
An additional difficulty with the application of
choropleth mapping to small area census data is that
the areal units are frequently least appropriate for
areas displaying the most extreme socio-economic
characteristics, giving them disproportionate and
misleading visual impact in the map image. A classic
example would be undeveloped docklands areas and
other areas of industrialdecline, which are frequently
adjacent to areas of poor quality, overcrowded
housing, containing residents of particularconcern in
policy terms. E.d.s in such areas typically encompass
large areas of industrial or derelict land and areas of
open water. The census data for many of these areas
fall into the extremes of the recorded values, and
their presentation by inappropriateareal units makes
the resulting statistical maps highly misleading, as
illustrated by Figure 1, which shows percentage
unemployment for the Cardiff Bay area. Figure 2
shows the residential areas on which the data in
Figure 1 are based. It is apparent that in a number of
enumeration districts high percentage values are
based on residential areas which occupy only a small
proportion of the e.d. area and are frequently based
on small population counts (Rhind,1983). Such zones
frequently dominate the visual impact of the map.
Similarproblems will be encountered in ruraldistricts
where settlements occupy a very low percentage of
the land area, and the population of distinct village
communities may be divided between a number of
zones comprising mostly agricultural land. The
existence of linear settlement, for example, is largely
impossible to determine from such data, a notable
example being the valley communities of South
Wales. Dense valley-floor settlements are separated
by relatively unpopulated watersheds, but these
regions of highly divergent character are grouped
92
DAVID MARTIN
CARDIFFBAY
DISTRICTS
ENUMERRTION
RT 1981 CENSUS
RESIDENTIRLnRERS
3
0
l
i
I
I
Kms
FIGURE2. The location of residential areas in the CardiffBay area.(Many of the e.d.s. contain large areas of open water and industrialland,
and very small populations)
together within enumeration districts which must
encompass the entire land surface. This is one of
the fundamental weaknesses of the conventional
approach: that it fails to identify the (extensive)
regions in which the mapped phenomena are not
present. The presence of a set of zone boundaries
imposes a single geography on all variables derived.
This is almost certainly invalid in most contexts:
more appropriate would be the geography of the
settlement pattern.
Besides these theoreticalissues, there are also pragmatic incentives to find alternativerepresentationsof
these data:the automated production of vector maps
requires digital boundary data, and digitizing is an
extremely time consuming and error-proneprocess.
Also, it is to a certain extent application specific, as
such boundaries are not of the type to be included in
national topographic databasesor used in non-census
applications.Software for the manipulationof vector
datasets is necessarily complex, as are the more
advanced vector data structures. However, despite
the difficulties outlined, vector systems do have the
endearing feature of looking very much like the
conventional map products with which geographers
and planners are familiar, separating the locational
and statisticalinformationinto 'zone boundaries'and
'data'.
TOWARDS A SOLUTION
It will never be possible to fully reconstructthe detail
of the spatial structure from the aggregate census
data, but some spatialdisaggregation should be possible either by the use of data collected for grid cells
or by interpolation from the population-weighted
93
Mapping populationdata
centroid locations availablefor census e.d.s. The A CENTROID DATA DISTRIBUTION
raster (grid cell) representationproduced in this ALGORITHM
way is implicitlya density surfaceof the variable What is required is a technique which draws out all
concerned.
the latent distributional information available from
One approachhas been the aggregationof indi- the centroid locations. The centroid grid references
vidual data to regulargrid cells, which have the from the SASPAC matrix file (LAMSAC, 1983) may
advantagesof beingconstantovertime,andallowing be used as 'summarypoints' of the local geographic
for the existence of cells with zero populationto distributionin addition to the use of their population
representareaswhereno populationis present.The counts as a summary of local population totals.
UK 1971 censussmallareastatisticswere available A centroid data distribution algorithm has been
for I km grid squares,and these formedthe basis developed which makes possible the construction of
for the census atlas 'People in Britain'(CRU/ detailed, higher-resolution population surfaces. It
OPCS/GRO(S),1980), but the need to preserve should be noted that this is by no means the
meantthatdatahadto be suppressed only approach to the problem, but merely serves
confidentiality
fora largenumberof ruralgridsquares.An additional to illustrate the possibility of constructing raster
problemproved to be that I km squaresare rather representations from the existing data. Moreover,
too coarsefor the analysisof urbanareas.In the UK as these are estimations based on aggregate data
1981 censushowever,no grid cell data were made they do not conflict with the confidentiality restricavailableandthee.d.centroidsaretheonly locational tions placed on data collected for small areas. The
referencesavailablewithoutadditionaldigitizingof following basic assumptions of the model should be
boundaries.The locations of these points were noted:
determinedby eye at the Office of Population
Censusesand Surveys (OPCS)at the time of the
(1) A centroiddefinesa locationwith above average
census to represent the 'population-weighted'
populationdensityfor the localarea,of whichit is a
summarypoint;
centroidsof each e.d. (Denhamand Rhind,1983).
in thesurround(2)A centroid'spopulationis distributed
Precisetechniquesexist for the productionof raster
areaaccordingto some distancedecay function,
ing
mapsfromgeometriczone centroids(Tobler,1979),
whichhasfiniteextent;
but these involve no more informationabout the
(3) Regionsmay exist in the populationplanein which
underlyingdistributionthan the boundariesthemno populationis present.In some cases these areas
selves,and do not offerany solutionto the issue of
maybe determineda priori(i.e.,it is possibleto assign
zeropopulationvaluesto waterareas).
identifyingunpopulatedareas.
The only majorprojectto addressthe creationof
rastermapsfrompopulation-weighted
centroiddata The model proceeds with the placement of a window
has been the creationof the populationdatabase over each centroid in turn. The size of this window
for the BBC Domesday Project.Two approaches determines the maximum possible extent of a cenwere adopted for the assignmentof population troid's influence in (2) above, and is of fundamental
values to 1 km grid cells (Rhind and Mounsey, importanceto the model's operation. Eachcell falling
1986; Flowerdewand Openshaw,1987; Rhindand within the window is then assigned a weighting
Openshaw,1987). The firstof these was simply to representing the probability of its receiving a shareof
assignto a cell the totaldatavaluesof any centroids the current centroid's population. These weightings
fallingwithinthatcell.Thesecondapproachinvolved are primarilyinfluencedby the distance decay model,
theconstructionof a dirichlettesselationbasedon the but are also scaled according to the local density of
centroidlocations,andcalculationof cellpopulations the centroids, so that in more densely settled areas,
on the basis of the proportionalcontributionsof where centroids are closer together, the kernel size
the Theissen polygons in which they fell. Some used for population distributionmay be considerably
areaswhich were known from other sourcesto be smaller than the window size. The logic employed is
unpopulatedwere then restored.The problemwith essentially that of a variable-kerneldensity estimator
thesetechniquesis thatthefirstmakestheassumption (Bowman, 1985). At this point, a pre-preparedmask
thatall an enumerationdistrict'spopulationarecon- file may be consulted to see if any cells in the current
centratedat one point,and the secondassumesthat window have been masked from receiving any poputhe populationareuniformlydistributedthroughout lation, as in (3) above. Once the weightings have been
the Theissenpolygonapproximation
of the e.d.
crudely determined, they are re-scaled to sum to 1-0,
94
DAVIDMARTIN
so as to preserve the total population assigned within
the kernel. As the estimation procedure proceeds,
some cells will receive a share of the population from
a number of different centroids, while others will
receive none at all. A more detailed discussion of the
estimation procedure may be found in the appendix.
The same proceduremay be followed for any variable
based on the same centroids. These variable populations may then be expressed as percentages of the
total populations in the corresponding cells. The
resulting percentage scores are then effectively distance weighted averages of the percentage scores at
surroundingcentroids.
APPLICATION EXAMPLE
To illustrate the utility of such a technique,
an application to a 40 x 40 km area of South
Wales, including Cardiffand a number of significant
'Valleys' towns is described. As already demonstrated, the settlement patterns of the region are
poorly represented by traditionalcensus geography.
A mask was preparedby digitizing the coastline and
rasterizingthe smallareaof the BristolChannelfalling
within the area of the study. A vector overlay of the
primary roads indicates that the population surface,
with clearly identifiable settlements, gives a very
'good' pictureof the actualpopulation distribution,as
illustratedin Figure 3. However, if the model is to be
evaluated and the optimum window size determined,
some objective measure of the goodness-of-fit of the
estimated surfaceis required.
One approachto the evaluation of real world data
is to obtain a rasterimage of the residentialareasfrom
an alternative source. This map may then be used as
an empiricaldataset against which the model estimate
may be evaluated, treating each cell as simply 'populated' or 'unpopulated'.Perhapsthe most appropriate
source of such an image is satellite imagery, but suitable images have not been availableduring the course
of this work, and a map of residential areas has been
digitized and rasterizedto provide a mask of the areas
which should be populated. This image can only be
viewed as an approximationto reality, and may contain errors which are not present in the estimated
surface, but the majority of populated cells will be
correctly represented. Using this technique for the
image illustrated, 70 per cent of populated cells are
correctly identified, and 65 per cent of those estimated to be populated are in fact so. If very small
centres (representedby an isolated centroid) could be
digitized, and the constituent source maps (1:50 000
sheets) were all exactly correct for the situation
in 1981, it is estimated that these figures could
be improved by 5-10 per cent. At present, 92 per
cent of all 40 000 cells in the map are assigned to
the correct category. The difficulty with such an
approach to evaluation is that it merely tests the
spatialdistributionof populated cells, as illustratedby
Figure 3, but offers no measure of the accuracyof the
population values assigned to each cell. In an attempt
to understand the effects of the distance decay
function more fully, some consideration has been
given to simulation techniques (Martin,1988).
More sophisticated evaluation techniques would
allow a more carefulanalysis of the effects of cell and
window size. It should be remembered that centroid
references are only to the nearest 100 m. The most
appropriate window size in any area will be some
function of the areal extent of the zones represented
by the centroids. The crude evaluation outlined
above suggests that a window size of 200 m is most
appropriate. This is consistent with applications to
other areas of South Wales and the County of Avon.
IMPLICATIONS FOR GIS
Bracken and Martin (1989) discuss this technique
with reference to census mapping, but the issues
related are also of considerable relevance to recent
work on the integration of socio-economic data into
GIS. If the manner in which spatial data are transformed by the static map is of fundamentalimportance to geographical analysis, then it is of even
greater significance to GIS which offer the potential
for complex analyses and modelling of these data. In
this context, the distinction between systems using
vector and raster representations of spatial data is
important,and each should be considered separately.
The logic of vector systems is essentially that of the
choropleth map, while raster systems assume data
such as that derived from the technique suggested
above.
The most significant application of vector techniques to census data has been the US Bureauof the
Census DIME system (Teng, 1983), although more
generally the most common uses for such systems
have been in cadastral and utility information systems. In these instances the data represent spatial
objects whose location may be precisely defined:
according to the criteria outlined above, data are
mapped according to 'natural'rather than 'imposed'
areal units. Tentative applications of GIS to socioeconomic data in the UK have generally involved
Mapping populationdata
95
FIGURE3. A raster map of populated areas generated from census centroids for a 40 x 40 km area of Cardiffand the South Wales
Valleys.
The cell size is 200 m. (The spatial form of the settlement pattern is recovered in considerable detail. A
population estimate is provided for
each shaded cell)
vector systems and have frequently been microcomputer based (Wiggins, 1986; Bracken et al.,
1987). Unfortunately, this automated duplication
of traditional mapping is prey to all the wellacknowledged difficultieswith these techniques, and
may indeed offer opportunities for the compounding
of errors and proliferation of basically invalid
analyses (Slaby et al., 1979).
Raster maps, in contrast to the 'fixed geography'
concept of the vector systems outlined above, are
essentially models of the variabledepicted, where the
geography (locational information)is integral to the
data layer. Eachvariablemay therefore have a unique
data-derived geography, with no fixed boundaries,
within the limits of the raster resolution. It should be
stressed that with the exception of remotely sensed
imagery very few datasets are actually captured in
raster form. A more common approach is to gather
point, line or area (i.e., vector) data and to use one of a
variety of interpolationalgorithms to derive a surface
model (Lam,1983). Once the data are in rasterform, a
wide range of cartographic modelling operations
may be performed,using combinations of mathematical, boolean and neighbourhood functions on the
map sheets, producing new map coverages at each
stage of the analysis (Berry, 1987). These operations
are generally easy to program, basically involving
simple matrix operations, and hence a number of
96
DAVIDMARTIN
packages are available for the PC with considerable
functionality (e.g., Eastman, 1987; Sandhu and
Amundson, 1987). Other advantages of raster representations include the compatibility of data over
time and from different sources, providing they may
be transformed,enhanced or degraded into a common raster frame. These types of operation are
notoriously complex using vector-based systems.
CONCLUSIONS
to be mapped and W0jis the unique weighting based on the
distance from i to i and the clustering of other local centroids (this will often be zero). The weighting Wj is determined as
k
Wij= f, (di,/
j,)for i 5l
(2)
1=I
where fj is an appropriately specified distance decay function, incorporatingsome additional scaling according to the
dispersion of the centroids within the local window; dj,is the
distance from centroid j (the current centroid in (1)) to
centroid 1, and 1= 1,2,... k are all centroids falling within
the window centred on centroid j. The maximum distance
of influence is determined by the smaller of dj or the window size w, so that the population of an isolated centroid
will be spread as far as the window size allows, while a
centroid in a dense cluster may contribute population only
to the cell in which it falls. Hence the spatial extent of the
populated cells is determined by the density of the local
centroids, and within these cells, the estimated values are
determined by the distance decay model. To preserve the
total population distributed, the WVs associated with each
centroid are constrained to equal 1-0.
At the edges of the estimation area, centroids less than
distance w from the limits may lose some of their population to the areabeyond the edge of the map, and centroids
just outside the area may contribute population to the edge
cells. For this reason, the list of centroids used in the creation of a map should include all centroids falling outside
the mapped area,but within a margin of width w around the
map. If zone boundaries and populations are available for
some larger areal units (e.g., wards), a mask may be used to
constrain the distribution process so that the population
totals of these known zones are preserved.
A numberof enhancements may be made to the procedure outlined here and to the methods adopted for
its evaluation;for example the use of a more sophisticated distance model. When the information gain
over the conventional choropleth map and its vector
GIS equivalents is considered, the type of product
proposed in this paper would appear to be of
considerable utility in many of the conventional
applications of census data. Reasonably accurate
surfaces may be derived for any census-based variable from the data themselves with no digitizing
requirement. It may be possible to use a similar
approachto construct a model of residentiallocation
from small user postcode references, but at the
moment, both methods are prone to inaccuracies
in the data. In the future, it may be possible to
overcome some of these problems by more careful
geo-referencing, especially of the results of the UK
1991 census (Wrigley, 1987), which should make
such techniques considerably more accurate, and
facilitatethe full implementationof a raster-basedGIS NOTE
for population-based data.
If human geographers are to grasp the full 1. In this discussion, use is made of UK census data in the
discussion of population-based data. This illustrates a
opportunities of GIS, it is suggested that appropriate
number of problems commonly associated with such
the
creation
raster
databases
be
for
of
techniques
data. It should be noted that the technique suggested has
given serious consideration, before the widespread
been devised to utilize specific features of the UK data.
adoption of vector GIS for such data makes possible
the reproduction of acknowledged and fundamental
geographic errors on a scale never encountered REFERENCES
before.
in
K.
'Fundamental
BERRY,J. (1987)
operations computerassisted map analysis', Int. J. Geogr.InformationSystems
1,2: 119-36
BOWMAN, A. W. (1985) 'A comparative study of some
kernel-based nonparametric density estimators', J. Stat.
Comp.Simul.21: 313-27
BRACKEN, I., HOLDSTOCK, S. AND MARTIN, D.
APPENDIX: THE CENTROID DATA
DISTRIBUTION ALGORITHM
Thebasicformof the modelis as follows:
c
Pi=
EPjW
(I)
j=I
whereP is the populationestimatedto fallwithincell i of
the matrix;Pjis the empiricalpopulationassociatedwith
centroidj;C is the totalnumberof centroidswithinthe area
(1987) 'Map manager:intelligent software for the
display of spatialinformation',(TechnicalReportsin
Geo-information
Systems,ComputingandCartography
3, Wales and South West Regional ResearchLaboratory,
Cardiff)
97
data
Mappingpopuilation
OPENSHAW, S. and TAYLOR, P. J.(1981) 'The modifiable
arealunit problem',in WRIGLEY,N. and BENNETT,R. J.
(eds) Quantitativegeography:a British view (Routledge
and Kegan Paul, London)
RHIND, D. (1983) 'Mapping census data', in RHIND, D.
(ed.) op. cit. pp. 171-98
RHIND, D. and MOUNSEY, H. (1986) 'The land and
people of Britain:a Domesday record, 1986', Trans.Inst.
Br. Geogr.N.S 11: 315-25
RHIND, D. and OPENSHAW, S. (1987) 'The BBC
Domesday system: a nationwide GIS for $4448', Proc.
Auto Carto8: 595-603
SANDHU, J. S. and AMUNDSON, S. (1987) The map
analysispackagefor the PC (State University of New York
at Buffalo)
SLABY,D. R., CASADY, R. J., MALIN, J. and COAKLEY,
J.F. (1979) 'Obstacles to accurateand valid assessment of
vital event data', Proc.Auto Carto4, 1: 188-97
TENG, A. T. (1983) 'Cartographic and attribute database
creation for planning analysis through GBF/DIME
and census data processing', Proc.Auto Carto 6, 2: 34850
TOBLER,W. R. (1979) 'Smooth pycnophylactic interpolation for geographical regions', J. Am. Stat. Ass. 74, 367:
519-30
UNWIN, D. (1981) Introductoryspatial analysis (Methuen,
London)
WIGGINS, L. L. (1986) 'Three low-cost mapping packages for microcomputers', J. Am. Plann. Ass. Autumn:
480-88
WRIGLEY,N. (1987) 'Quantitative methods: gearing up
for 1991', Prog.Hum. Geogr.11: 565-79
BRACKEN,I. AND MARTIN, D. (1989) 'The generation
of spatial population distributions from census centroid
data', Environ.and Plann.A. forthcoming
CARRUTHERS, A. W. (1985) 'Mapping the population
census of Scotland', Cartogr.J. 22,2: 83-87
CRU/OPCS/GRO(S) (1980) Peoplein Britain:a censusatlas
(HMSO, London)
DENHAM, C. AND RHIND, D. (1983) 'The 1981 census
and its results', in RHIND, D. (ed.) A census user's
handbook(Methuen, London) pp. 17-88
EASTMAN, J. R. (1987) IDRISI:A grid-basedgeographic
analysis system (Clark University Graduate School of
Geography, Worcester, Mass.)
FLOWERDEW,R. and OPENSHAW, S. (1987) 'A review
of the problems of transferring data from one set of
arealunits to another incompatible set', (ResearchReport
4, Northern Regional Research Laboratory, Newcastle
upon Tyne)
FORBES,J.(1984) 'Problemsof cartographicrepresentation
of patterns of population change', Cartogr.J. 21,2: 93102
LAM, N. S. (1983) 'Spatialinterpolation methods: a review',
Am. Cartogr.10, 2:129-49
LAMSAC (1983) 'SASPAC usermanual'Version 3.0
MARTIN, D. (1988) 'An approach to surface generation
from centroid-type data', (Technical Reports in GeoInformation Systems, Computing and Cartography 5,
Wales and South West Regional Research Laboratory,
Cardiff)
OPENSHAW, S. (1984) The modifiableareal unit problem
Concepts and Techniques in Modern Geography, 38
(Geo Books, Norwich)