Mapping Population Data from Zone Centroid Locations Author(s): David Martin Reviewed work(s): Source: Transactions of the Institute of British Geographers, New Series, Vol. 14, No. 1 (1989), pp. 90-97 Published by: Blackwell Publishing on behalf of The Royal Geographical Society (with the Institute of British Geographers) Stable URL: http://www.jstor.org/stable/622344 . Accessed: 26/07/2012 11:28 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . Blackwell Publishing and The Royal Geographical Society (with the Institute of British Geographers) are collaborating with JSTOR to digitize, preserve and extend access to Transactions of the Institute of British Geographers. http://www.jstor.org 90 Mapping population data from zone centroid locations DAVID MARTIN ESRCResearchStudent,Departmentof TownPlanning,Universityof Wales,Cardiff,PO Box 906, CardiffCFl 3YN RevisedMS received8 November,1988 ABSTRACT Geographers are familiarwith the difficulties encountered in the use of choropleth maps to represent census-type data, especially those associated with the modifiable areal unit problem. Here, an alternative, raster representation is suggested which seeks to recover some of the underlying distribution from population weighted centroids. These issues are considered, and found to be of great importance to any attempt to build a geographic information system for populationbased data. KEYWORDS: Choropleth map, Census data, Population-weighted centroid, Surface,Geographic information system INTRODUCTION The choropleth map, despite its extensive and continued use, has long been acknowledged as a poor mediumfor the presentationof population-baseddata. Uncertainty is introduced both by the inappropriate location of zone boundaries for the variable being mapped, and some potentially misleading features of the choropleth mapping process itself. However, human geographers and planners necessarily make heavy use of such data in the computation of indices of deprivation and need, and for basic population distribution information in many application areas including health, migration, and retailing studies. There is a real danger that these problems may be carriedforward unresolved into the rapidly growing field of geographic information systems (GIS),where errorsmay be compounded by users unfamiliarwith the nature of the data. However, the use of GIS techniques with such data also offers the potential for alternative representations. Here, a review of the problems associated with population mapping is followed by a discussion of a rastermethod for handling census data,based on population-weighted centroids, and its implications for GIS. PROBLEMS OF MAPPING POPULATION DATA1 Analysis of census-derived data frequently involves Trans.Inst. Br. Geogr.N.S. 14: 90-97 (1989) ISSN: 0020-2754 the mapping of these data in choropleth (vector) form, and the logic of non-graphic processes is frequently vector in nature, such as the use of pointin-polygon or nearest-neighbour techniques for the association of unit postcoded data and census enumeration districts (e.d.s). Frequent use has been made of combinations of census retrieval and mapping software such as SASPAC and GIMMS in the production of hardcopy maps of this type (for example, Carruthers,1985). A number of problems are acknowledged to exist with this type of mapping, which relate to the essentially unknown relationshipbetween the data and the highly irregulararealunits with which they are associated. This uncertainty is due to the aggregation of individualdata to 'imposed'arealunits (Unwin, 1981), whose boundariesare not data-derived,but designed for ease of enumeration.As a consequence the data values for each zone may be as much a function of the zone boundary locations as of the underlying distribution. This has become known as the modifiable area unit problem (Openshaw and Taylor, 1981; Openshaw, 1984; Forbes, 1984). By contrast, many area data relating to the physical environment, such as land parcels, soil types etc. may be mapped using their 'natural'boundaries, and there is a direct correspondence between the zones representedin the data and those existing in the 'real'world. Printed in Great Britain 91 Mapping populationdata CARDIFFBAY fS % UNEMPLOYED OF ECON. RCTIVE AT 1981 CENSUS 31 i........ _ 1 0 3 Kms FIGUREI. A choropleth map showing percentage unemployment in the CardiffBay area. (The zones are census enumeration districts, and the visual impression is highly misleading) An additional difficulty with the application of choropleth mapping to small area census data is that the areal units are frequently least appropriate for areas displaying the most extreme socio-economic characteristics, giving them disproportionate and misleading visual impact in the map image. A classic example would be undeveloped docklands areas and other areas of industrialdecline, which are frequently adjacent to areas of poor quality, overcrowded housing, containing residents of particularconcern in policy terms. E.d.s in such areas typically encompass large areas of industrial or derelict land and areas of open water. The census data for many of these areas fall into the extremes of the recorded values, and their presentation by inappropriateareal units makes the resulting statistical maps highly misleading, as illustrated by Figure 1, which shows percentage unemployment for the Cardiff Bay area. Figure 2 shows the residential areas on which the data in Figure 1 are based. It is apparent that in a number of enumeration districts high percentage values are based on residential areas which occupy only a small proportion of the e.d. area and are frequently based on small population counts (Rhind,1983). Such zones frequently dominate the visual impact of the map. Similarproblems will be encountered in ruraldistricts where settlements occupy a very low percentage of the land area, and the population of distinct village communities may be divided between a number of zones comprising mostly agricultural land. The existence of linear settlement, for example, is largely impossible to determine from such data, a notable example being the valley communities of South Wales. Dense valley-floor settlements are separated by relatively unpopulated watersheds, but these regions of highly divergent character are grouped 92 DAVID MARTIN CARDIFFBAY DISTRICTS ENUMERRTION RT 1981 CENSUS RESIDENTIRLnRERS 3 0 l i I I Kms FIGURE2. The location of residential areas in the CardiffBay area.(Many of the e.d.s. contain large areas of open water and industrialland, and very small populations) together within enumeration districts which must encompass the entire land surface. This is one of the fundamental weaknesses of the conventional approach: that it fails to identify the (extensive) regions in which the mapped phenomena are not present. The presence of a set of zone boundaries imposes a single geography on all variables derived. This is almost certainly invalid in most contexts: more appropriate would be the geography of the settlement pattern. Besides these theoreticalissues, there are also pragmatic incentives to find alternativerepresentationsof these data:the automated production of vector maps requires digital boundary data, and digitizing is an extremely time consuming and error-proneprocess. Also, it is to a certain extent application specific, as such boundaries are not of the type to be included in national topographic databasesor used in non-census applications.Software for the manipulationof vector datasets is necessarily complex, as are the more advanced vector data structures. However, despite the difficulties outlined, vector systems do have the endearing feature of looking very much like the conventional map products with which geographers and planners are familiar, separating the locational and statisticalinformationinto 'zone boundaries'and 'data'. TOWARDS A SOLUTION It will never be possible to fully reconstructthe detail of the spatial structure from the aggregate census data, but some spatialdisaggregation should be possible either by the use of data collected for grid cells or by interpolation from the population-weighted 93 Mapping populationdata centroid locations availablefor census e.d.s. The A CENTROID DATA DISTRIBUTION raster (grid cell) representationproduced in this ALGORITHM way is implicitlya density surfaceof the variable What is required is a technique which draws out all concerned. the latent distributional information available from One approachhas been the aggregationof indi- the centroid locations. The centroid grid references vidual data to regulargrid cells, which have the from the SASPAC matrix file (LAMSAC, 1983) may advantagesof beingconstantovertime,andallowing be used as 'summarypoints' of the local geographic for the existence of cells with zero populationto distributionin addition to the use of their population representareaswhereno populationis present.The counts as a summary of local population totals. UK 1971 censussmallareastatisticswere available A centroid data distribution algorithm has been for I km grid squares,and these formedthe basis developed which makes possible the construction of for the census atlas 'People in Britain'(CRU/ detailed, higher-resolution population surfaces. It OPCS/GRO(S),1980), but the need to preserve should be noted that this is by no means the meantthatdatahadto be suppressed only approach to the problem, but merely serves confidentiality fora largenumberof ruralgridsquares.An additional to illustrate the possibility of constructing raster problemproved to be that I km squaresare rather representations from the existing data. Moreover, too coarsefor the analysisof urbanareas.In the UK as these are estimations based on aggregate data 1981 censushowever,no grid cell data were made they do not conflict with the confidentiality restricavailableandthee.d.centroidsaretheonly locational tions placed on data collected for small areas. The referencesavailablewithoutadditionaldigitizingof following basic assumptions of the model should be boundaries.The locations of these points were noted: determinedby eye at the Office of Population Censusesand Surveys (OPCS)at the time of the (1) A centroiddefinesa locationwith above average census to represent the 'population-weighted' populationdensityfor the localarea,of whichit is a summarypoint; centroidsof each e.d. (Denhamand Rhind,1983). in thesurround(2)A centroid'spopulationis distributed Precisetechniquesexist for the productionof raster areaaccordingto some distancedecay function, ing mapsfromgeometriczone centroids(Tobler,1979), whichhasfiniteextent; but these involve no more informationabout the (3) Regionsmay exist in the populationplanein which underlyingdistributionthan the boundariesthemno populationis present.In some cases these areas selves,and do not offerany solutionto the issue of maybe determineda priori(i.e.,it is possibleto assign zeropopulationvaluesto waterareas). identifyingunpopulatedareas. The only majorprojectto addressthe creationof rastermapsfrompopulation-weighted centroiddata The model proceeds with the placement of a window has been the creationof the populationdatabase over each centroid in turn. The size of this window for the BBC Domesday Project.Two approaches determines the maximum possible extent of a cenwere adopted for the assignmentof population troid's influence in (2) above, and is of fundamental values to 1 km grid cells (Rhind and Mounsey, importanceto the model's operation. Eachcell falling 1986; Flowerdewand Openshaw,1987; Rhindand within the window is then assigned a weighting Openshaw,1987). The firstof these was simply to representing the probability of its receiving a shareof assignto a cell the totaldatavaluesof any centroids the current centroid's population. These weightings fallingwithinthatcell.Thesecondapproachinvolved are primarilyinfluencedby the distance decay model, theconstructionof a dirichlettesselationbasedon the but are also scaled according to the local density of centroidlocations,andcalculationof cellpopulations the centroids, so that in more densely settled areas, on the basis of the proportionalcontributionsof where centroids are closer together, the kernel size the Theissen polygons in which they fell. Some used for population distributionmay be considerably areaswhich were known from other sourcesto be smaller than the window size. The logic employed is unpopulatedwere then restored.The problemwith essentially that of a variable-kerneldensity estimator thesetechniquesis thatthefirstmakestheassumption (Bowman, 1985). At this point, a pre-preparedmask thatall an enumerationdistrict'spopulationarecon- file may be consulted to see if any cells in the current centratedat one point,and the secondassumesthat window have been masked from receiving any poputhe populationareuniformlydistributedthroughout lation, as in (3) above. Once the weightings have been the Theissenpolygonapproximation of the e.d. crudely determined, they are re-scaled to sum to 1-0, 94 DAVIDMARTIN so as to preserve the total population assigned within the kernel. As the estimation procedure proceeds, some cells will receive a share of the population from a number of different centroids, while others will receive none at all. A more detailed discussion of the estimation procedure may be found in the appendix. The same proceduremay be followed for any variable based on the same centroids. These variable populations may then be expressed as percentages of the total populations in the corresponding cells. The resulting percentage scores are then effectively distance weighted averages of the percentage scores at surroundingcentroids. APPLICATION EXAMPLE To illustrate the utility of such a technique, an application to a 40 x 40 km area of South Wales, including Cardiffand a number of significant 'Valleys' towns is described. As already demonstrated, the settlement patterns of the region are poorly represented by traditionalcensus geography. A mask was preparedby digitizing the coastline and rasterizingthe smallareaof the BristolChannelfalling within the area of the study. A vector overlay of the primary roads indicates that the population surface, with clearly identifiable settlements, gives a very 'good' pictureof the actualpopulation distribution,as illustratedin Figure 3. However, if the model is to be evaluated and the optimum window size determined, some objective measure of the goodness-of-fit of the estimated surfaceis required. One approachto the evaluation of real world data is to obtain a rasterimage of the residentialareasfrom an alternative source. This map may then be used as an empiricaldataset against which the model estimate may be evaluated, treating each cell as simply 'populated' or 'unpopulated'.Perhapsthe most appropriate source of such an image is satellite imagery, but suitable images have not been availableduring the course of this work, and a map of residential areas has been digitized and rasterizedto provide a mask of the areas which should be populated. This image can only be viewed as an approximationto reality, and may contain errors which are not present in the estimated surface, but the majority of populated cells will be correctly represented. Using this technique for the image illustrated, 70 per cent of populated cells are correctly identified, and 65 per cent of those estimated to be populated are in fact so. If very small centres (representedby an isolated centroid) could be digitized, and the constituent source maps (1:50 000 sheets) were all exactly correct for the situation in 1981, it is estimated that these figures could be improved by 5-10 per cent. At present, 92 per cent of all 40 000 cells in the map are assigned to the correct category. The difficulty with such an approach to evaluation is that it merely tests the spatialdistributionof populated cells, as illustratedby Figure 3, but offers no measure of the accuracyof the population values assigned to each cell. In an attempt to understand the effects of the distance decay function more fully, some consideration has been given to simulation techniques (Martin,1988). More sophisticated evaluation techniques would allow a more carefulanalysis of the effects of cell and window size. It should be remembered that centroid references are only to the nearest 100 m. The most appropriate window size in any area will be some function of the areal extent of the zones represented by the centroids. The crude evaluation outlined above suggests that a window size of 200 m is most appropriate. This is consistent with applications to other areas of South Wales and the County of Avon. IMPLICATIONS FOR GIS Bracken and Martin (1989) discuss this technique with reference to census mapping, but the issues related are also of considerable relevance to recent work on the integration of socio-economic data into GIS. If the manner in which spatial data are transformed by the static map is of fundamentalimportance to geographical analysis, then it is of even greater significance to GIS which offer the potential for complex analyses and modelling of these data. In this context, the distinction between systems using vector and raster representations of spatial data is important,and each should be considered separately. The logic of vector systems is essentially that of the choropleth map, while raster systems assume data such as that derived from the technique suggested above. The most significant application of vector techniques to census data has been the US Bureauof the Census DIME system (Teng, 1983), although more generally the most common uses for such systems have been in cadastral and utility information systems. In these instances the data represent spatial objects whose location may be precisely defined: according to the criteria outlined above, data are mapped according to 'natural'rather than 'imposed' areal units. Tentative applications of GIS to socioeconomic data in the UK have generally involved Mapping populationdata 95 FIGURE3. A raster map of populated areas generated from census centroids for a 40 x 40 km area of Cardiffand the South Wales Valleys. The cell size is 200 m. (The spatial form of the settlement pattern is recovered in considerable detail. A population estimate is provided for each shaded cell) vector systems and have frequently been microcomputer based (Wiggins, 1986; Bracken et al., 1987). Unfortunately, this automated duplication of traditional mapping is prey to all the wellacknowledged difficultieswith these techniques, and may indeed offer opportunities for the compounding of errors and proliferation of basically invalid analyses (Slaby et al., 1979). Raster maps, in contrast to the 'fixed geography' concept of the vector systems outlined above, are essentially models of the variabledepicted, where the geography (locational information)is integral to the data layer. Eachvariablemay therefore have a unique data-derived geography, with no fixed boundaries, within the limits of the raster resolution. It should be stressed that with the exception of remotely sensed imagery very few datasets are actually captured in raster form. A more common approach is to gather point, line or area (i.e., vector) data and to use one of a variety of interpolationalgorithms to derive a surface model (Lam,1983). Once the data are in rasterform, a wide range of cartographic modelling operations may be performed,using combinations of mathematical, boolean and neighbourhood functions on the map sheets, producing new map coverages at each stage of the analysis (Berry, 1987). These operations are generally easy to program, basically involving simple matrix operations, and hence a number of 96 DAVIDMARTIN packages are available for the PC with considerable functionality (e.g., Eastman, 1987; Sandhu and Amundson, 1987). Other advantages of raster representations include the compatibility of data over time and from different sources, providing they may be transformed,enhanced or degraded into a common raster frame. These types of operation are notoriously complex using vector-based systems. CONCLUSIONS to be mapped and W0jis the unique weighting based on the distance from i to i and the clustering of other local centroids (this will often be zero). The weighting Wj is determined as k Wij= f, (di,/ j,)for i 5l (2) 1=I where fj is an appropriately specified distance decay function, incorporatingsome additional scaling according to the dispersion of the centroids within the local window; dj,is the distance from centroid j (the current centroid in (1)) to centroid 1, and 1= 1,2,... k are all centroids falling within the window centred on centroid j. The maximum distance of influence is determined by the smaller of dj or the window size w, so that the population of an isolated centroid will be spread as far as the window size allows, while a centroid in a dense cluster may contribute population only to the cell in which it falls. Hence the spatial extent of the populated cells is determined by the density of the local centroids, and within these cells, the estimated values are determined by the distance decay model. To preserve the total population distributed, the WVs associated with each centroid are constrained to equal 1-0. At the edges of the estimation area, centroids less than distance w from the limits may lose some of their population to the areabeyond the edge of the map, and centroids just outside the area may contribute population to the edge cells. For this reason, the list of centroids used in the creation of a map should include all centroids falling outside the mapped area,but within a margin of width w around the map. If zone boundaries and populations are available for some larger areal units (e.g., wards), a mask may be used to constrain the distribution process so that the population totals of these known zones are preserved. A numberof enhancements may be made to the procedure outlined here and to the methods adopted for its evaluation;for example the use of a more sophisticated distance model. When the information gain over the conventional choropleth map and its vector GIS equivalents is considered, the type of product proposed in this paper would appear to be of considerable utility in many of the conventional applications of census data. Reasonably accurate surfaces may be derived for any census-based variable from the data themselves with no digitizing requirement. It may be possible to use a similar approachto construct a model of residentiallocation from small user postcode references, but at the moment, both methods are prone to inaccuracies in the data. In the future, it may be possible to overcome some of these problems by more careful geo-referencing, especially of the results of the UK 1991 census (Wrigley, 1987), which should make such techniques considerably more accurate, and facilitatethe full implementationof a raster-basedGIS NOTE for population-based data. If human geographers are to grasp the full 1. In this discussion, use is made of UK census data in the discussion of population-based data. This illustrates a opportunities of GIS, it is suggested that appropriate number of problems commonly associated with such the creation raster databases be for of techniques data. It should be noted that the technique suggested has given serious consideration, before the widespread been devised to utilize specific features of the UK data. adoption of vector GIS for such data makes possible the reproduction of acknowledged and fundamental geographic errors on a scale never encountered REFERENCES before. in K. 'Fundamental BERRY,J. (1987) operations computerassisted map analysis', Int. J. Geogr.InformationSystems 1,2: 119-36 BOWMAN, A. W. (1985) 'A comparative study of some kernel-based nonparametric density estimators', J. Stat. Comp.Simul.21: 313-27 BRACKEN, I., HOLDSTOCK, S. AND MARTIN, D. APPENDIX: THE CENTROID DATA DISTRIBUTION ALGORITHM Thebasicformof the modelis as follows: c Pi= EPjW (I) j=I whereP is the populationestimatedto fallwithincell i of the matrix;Pjis the empiricalpopulationassociatedwith centroidj;C is the totalnumberof centroidswithinthe area (1987) 'Map manager:intelligent software for the display of spatialinformation',(TechnicalReportsin Geo-information Systems,ComputingandCartography 3, Wales and South West Regional ResearchLaboratory, Cardiff) 97 data Mappingpopuilation OPENSHAW, S. and TAYLOR, P. J.(1981) 'The modifiable arealunit problem',in WRIGLEY,N. and BENNETT,R. J. (eds) Quantitativegeography:a British view (Routledge and Kegan Paul, London) RHIND, D. (1983) 'Mapping census data', in RHIND, D. (ed.) op. cit. pp. 171-98 RHIND, D. and MOUNSEY, H. (1986) 'The land and people of Britain:a Domesday record, 1986', Trans.Inst. Br. Geogr.N.S 11: 315-25 RHIND, D. and OPENSHAW, S. (1987) 'The BBC Domesday system: a nationwide GIS for $4448', Proc. Auto Carto8: 595-603 SANDHU, J. S. and AMUNDSON, S. (1987) The map analysispackagefor the PC (State University of New York at Buffalo) SLABY,D. R., CASADY, R. J., MALIN, J. and COAKLEY, J.F. (1979) 'Obstacles to accurateand valid assessment of vital event data', Proc.Auto Carto4, 1: 188-97 TENG, A. T. (1983) 'Cartographic and attribute database creation for planning analysis through GBF/DIME and census data processing', Proc.Auto Carto 6, 2: 34850 TOBLER,W. R. (1979) 'Smooth pycnophylactic interpolation for geographical regions', J. Am. Stat. Ass. 74, 367: 519-30 UNWIN, D. (1981) Introductoryspatial analysis (Methuen, London) WIGGINS, L. L. (1986) 'Three low-cost mapping packages for microcomputers', J. Am. Plann. Ass. Autumn: 480-88 WRIGLEY,N. (1987) 'Quantitative methods: gearing up for 1991', Prog.Hum. Geogr.11: 565-79 BRACKEN,I. AND MARTIN, D. (1989) 'The generation of spatial population distributions from census centroid data', Environ.and Plann.A. forthcoming CARRUTHERS, A. W. (1985) 'Mapping the population census of Scotland', Cartogr.J. 22,2: 83-87 CRU/OPCS/GRO(S) (1980) Peoplein Britain:a censusatlas (HMSO, London) DENHAM, C. AND RHIND, D. (1983) 'The 1981 census and its results', in RHIND, D. (ed.) A census user's handbook(Methuen, London) pp. 17-88 EASTMAN, J. R. (1987) IDRISI:A grid-basedgeographic analysis system (Clark University Graduate School of Geography, Worcester, Mass.) FLOWERDEW,R. and OPENSHAW, S. (1987) 'A review of the problems of transferring data from one set of arealunits to another incompatible set', (ResearchReport 4, Northern Regional Research Laboratory, Newcastle upon Tyne) FORBES,J.(1984) 'Problemsof cartographicrepresentation of patterns of population change', Cartogr.J. 21,2: 93102 LAM, N. S. (1983) 'Spatialinterpolation methods: a review', Am. Cartogr.10, 2:129-49 LAMSAC (1983) 'SASPAC usermanual'Version 3.0 MARTIN, D. (1988) 'An approach to surface generation from centroid-type data', (Technical Reports in GeoInformation Systems, Computing and Cartography 5, Wales and South West Regional Research Laboratory, Cardiff) OPENSHAW, S. (1984) The modifiableareal unit problem Concepts and Techniques in Modern Geography, 38 (Geo Books, Norwich)
© Copyright 2026 Paperzz