GeoDaTM 0.9.5-i Release Notes Luc Anselin Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign Urbana, IL 61801 http://sal.agecon.uiuc.edu/ Center for Spatially Integrated Social Science http://www.csiss.org/ Revised, January 20, 2004 c 2003-2004 Luc Anselin, All Rights Reserved Copyright Contents Preface 1 What’s New in GeoDa 0.9.5-i New Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Refinements and Improvements of Existing Features . . . . . . . . Bug Fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 4 6 Menu Structure and Toolbar Buttons Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Menu Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 9 Manipulating Spatial Data Creating Grid Polygon Shape Files Creating Polygon Shape Files from Creating Spatial Weights . . . . . Thiessen Polygons . . . . . . . . . . . . . . . . BND Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 15 18 20 Mapping 23 Cartogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Map Movie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Exploratory Data Analysis Parallel Coordinate Plot . . 3D Scatter Plot . . . . . . . Conditional Plot . . . . . . Histogram . . . . . . . . . . Box Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 29 33 38 43 43 Spatial Regression Analysis Regression Interface . . . . . . . . . . . . . . Ordinary Least Squares with Diagnostics . . Maximum Likelihood in Spatial Lag Model . Maximum Likelihood in Spatial Error Model Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 48 53 55 58 ii List of Figures 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 The initial menu and toolbar . . . . . . . . . . . . . . . . . . Opening window after loading the SIDS2 sample data set . . The complete menu and toolbar buttons . . . . . . . . . . . . The tools menu item . . . . . . . . . . . . . . . . . . . . . . . The methods menu item . . . . . . . . . . . . . . . . . . . . . Map menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Map toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . Explore menu . . . . . . . . . . . . . . . . . . . . . . . . . . . Explore toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . The table menu item . . . . . . . . . . . . . . . . . . . . . . . Space menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . Space toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . The creating grid dialog . . . . . . . . . . . . . . . . . . . . . A 10 by 10 regular lattice . . . . . . . . . . . . . . . . . . . . Format of bounding box text input file . . . . . . . . . . . . . Create a grid from bounding box in a shape file . . . . . . . . North Carolina counties with matching 5 by 20 regular lattice Shape file from a boundary text file . . . . . . . . . . . . . . Boundary file format . . . . . . . . . . . . . . . . . . . . . . . Columbus shape and table from text boundary file . . . . . . Options for higher order contiguity . . . . . . . . . . . . . . . Distance cutoff in miles . . . . . . . . . . . . . . . . . . . . . Distance in k nearest neighbor weights files . . . . . . . . . . Weights characteristics with islands . . . . . . . . . . . . . . . Bounding box option for Thiessen polygons . . . . . . . . . . Default bounding box for Thiessen polygons . . . . . . . . . . Polygon-based bounding box for Thiessen polygons . . . . . . Circular cartogram for North Carolina Sids rates (SIDR74) . . Selection of outlier hinge in cartogram . . . . . . . . . . . . . Outliers in Sids rate cartogram using a hinge of 3 . . . . . . . iii 7 8 8 9 9 10 10 11 11 11 12 12 13 14 14 15 16 16 16 17 18 19 19 20 20 22 22 24 24 25 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 Improving the layout of the cartogram . . . . . . . . . . . . . Outliers in Sids rate cartogram linked to base map . . . . . . Starting a cumulative map movie . . . . . . . . . . . . . . . . Pausing a cumulative map movie . . . . . . . . . . . . . . . . A completed cumulative map movie . . . . . . . . . . . . . . Variable selection for PCP . . . . . . . . . . . . . . . . . . . . PCP variables selected . . . . . . . . . . . . . . . . . . . . . . PCP for Columbus variables . . . . . . . . . . . . . . . . . . . PCP change variable order . . . . . . . . . . . . . . . . . . . PCP options . . . . . . . . . . . . . . . . . . . . . . . . . . . PCP using standardized variables . . . . . . . . . . . . . . . . Brushing the PCP . . . . . . . . . . . . . . . . . . . . . . . . Variable selection for 3D scatter plot . . . . . . . . . . . . . . 3D scatter plot initial view . . . . . . . . . . . . . . . . . . . Rotated 3D scatter plot . . . . . . . . . . . . . . . . . . . . . Selection box in 3D scatter plot . . . . . . . . . . . . . . . . . Brushing the 3D scatter plot using the slider . . . . . . . . . Free form brushing of the 3D scatter plot . . . . . . . . . . . Linking between 3D scatter plot and other windows . . . . . Linking between map and 3D scatter plot . . . . . . . . . . . Types of conditional plots . . . . . . . . . . . . . . . . . . . . Conditional plot variable selection . . . . . . . . . . . . . . . Starting up the conditional plots . . . . . . . . . . . . . . . . Conditional map plot . . . . . . . . . . . . . . . . . . . . . . . Moving the handles in the conditional plot . . . . . . . . . . . Conditional box plot . . . . . . . . . . . . . . . . . . . . . . . Conditional histogram . . . . . . . . . . . . . . . . . . . . . . Conditional scatter plot . . . . . . . . . . . . . . . . . . . . . New look histogram . . . . . . . . . . . . . . . . . . . . . . . New look box plot . . . . . . . . . . . . . . . . . . . . . . . . Regression analysis output settings . . . . . . . . . . . . . . . Variable selection for regression analysis . . . . . . . . . . . . Spatial weights file selection . . . . . . . . . . . . . . . . . . . Starting the regression analysis, classic model . . . . . . . . . Saving predicted values and residuals, classic model . . . . . Selecting variable names for saved predicted values and residuals, classic model . . . . . . . . . . . . . . . . . . . . . . . . Predicted values and residuals added to table . . . . . . . . . Finishing the regression analysis, classic model . . . . . . . . Results, classic model . . . . . . . . . . . . . . . . . . . . . . iv 25 26 27 28 28 30 30 30 31 31 32 32 33 34 34 35 36 36 37 37 38 39 39 39 40 41 42 43 44 44 46 47 48 49 50 50 51 51 52 70 71 72 73 Spatial regression analysis, lag model . Results, spatial lag model . . . . . . . Spatial regression analysis, error model Results, spatial error model . . . . . . v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 54 55 56 Preface These release notes pertain to the third official release of the GeoDa TM software for geodata analysis, an upgrade to Version 0.9.5-i, released on January 23, 2004. The first release dates back to February 5, 2003. The notes complement the GeoDa TM 0.9 User’s Guide (Anselin 2003) that accompanied the second official release of the software, Version 0.9.3, released on June 4, 2003. In the remainder of the release notes, that document will be referred to as the User’s Guide. Many important aspects of the use of the software are not repeated here. All the basic functions, background information on the software and the full text of all relevant licenses are included in the User’s Guide. The current release notes only document additions and changes to the software, and should be used together with the Version 0.9 User’s Guide. The development behind this release of GeoDa TM has been facilitated by the continued research support through the U.S. National Science Foundation grant BCS-9978058 to the Center for Spatially Integrated Social Science (CSISS), and by grant RO1 CA 95949-01 from the National Cancer Institute. Funding sources for earlier version of the software and its antecedents can be found in the User’s Guide. Many thanks go to the students in the Fall 2003 classes of ACE 492SA, Spatial Analysis, and ACE 492SE, Spatial Econometrics, offered through the Department of Agricultural and Consumer Economics, University of Illinois, Urbana-Champaign, for being such good sports in serving as guinea pigs for various iterations of what became called “version 095i.” GeoDa’s growing user community contributed considerably as well, with bug reports, requests for features and other useful feedback from too many users to be listed individually. Their continued interest is greatly appreciated. Trademarks and Licenses • GeoDa TM is a trademark of Luc Anselin, All Rights Reserved. • GeoDa incorporates licensed libraries from ESRIs MapObjects LT2; ESRI, ArcView, ArcGIS and MapObjects are trademarks of Environmental Systems Research Institute, Redlands, CA • GeoDa incorporates code derived from publicly available sources under various generous licenses (the detailed licenses are listed in the Appendix of the User’s Guide): – the MFC Grid Control 2.24 by Chris Mauder – the ANN Code by David Mount and Sunil Arya – the Thiessen polygon algorithm of Yasuaki Oishi • other companies and products mentioned herein are trademarks or registered trademarks of their respective trademark owners The GeoDa Team • Project Director: Luc Anselin • Software Design and Development: Luc Anselin, Ibnu Syabri, Youngihn Kho and Oleg Smirnov • Technical Documentation and Training Materials: Luc Anselin and Julia Koschinsky 2 What’s New in GeoDa 0.9.5-i GeoDa 0.9.5-i contains several minor improvements and bug fixes to the previous version, as well as totally new functionality for mapping (cartogram), exploratory data analysis (parallel coordinate plot, 3D scatter plot and conditional plots) and spatial regression. A brief outline of the major changes and innovations is given next.1 A more complete discussion of the features, methodological background and relevant user interface is given in the remaining sections. New Features Data Manipulation • the creation of polygon shape files for regular lattices or grids from basic user input on the structure of the lattice • the construction of polygon shape files from boundary information contained in an ascii input file Mapping • a circular cartogram, implementing Dorling’s cellular automata algorithm (Dorling 1996), fully linked and brushable • conditional maps (see conditional plots) Exploratory Data Analysis • parallel coordinate plot (PCP) for multivariate data exploration, with linking and brushing 1 To highlight the new items, they are given blue section headings in the remainder of the release notes. 3 • three-dimensional scatter plot, with linking and brushing • conditional plots, using two conditioning variables to explore the distribution of a third variable – conditional map, box plot, histogram and scatter plot Spatial Regression • Ordinary Least Squares regression with full diagnostics for spatial effects (Moran’s I, Lagrange Multiplier statistics), as well as the usual tests against heteroskedasticity and non-normality • Maximum Likelihood estimation of the spatial lag and the spatial error models, with asymptotic inference Refinements and Improvements of Existing Features User Interface • the standard menu structure has been slightly reorganized – new items for Table, Space, and Regress – the spatial autocorrelation analysis was moved from the earlier Explore menu to Space • an item for Methods has been added to the opening menu to allow regression analysis without starting a project (i.e., without loading a shape file into the project) • the toolbar buttons have been slightly reorganized – new toolbar buttons have been added for the Cartogram, PCP, Conditional Plot, and the 3-D Scatter Plot – a new dockable toolbar is included with buttons to activate the various map functions (quantile map, box map, standard deviational map, percentile map, cartogram and map movie) – the spatial autocorrelation toolbar buttons were separated from the EDA toolbar • the toolbar buttons for opening a new map and duplicating a map have a new look 4 • the default for the map window now shows the legend; in earlier versions the user needed to explicitly open the legend pane by dragging the left side of the window to the right Spatial Data Manipulation • a custom bounding box can be specified in the creation of Thiessen polygons; previously, only the bounding box for the points themselves was used • higher order contiguity calculation now contains an option for the inclusion of all lower order neighbors (the previous default) or the computation of a “pure” higher order contiguity • distance weights files and k-nearest neighbor weights now contain the “correct” distance between the points as the third column of the GWT file; previously this value was rescaled and not useful for interpretation • the “treshold” typo has been fixed • the weights characteristics histogram has a new look and more flexible classifications, with islands properly included as having zero neighbors Mapping • the default selection tool for point shape files is now the circle (previously, it was a rectangle) • the Map Movie was thoroughly revised, with a new interface, allowing interactive starting and stopping (pause) as well as rewind and step through Exploratory Data Analysis • the Histogram sports a new look and uses a continuous color ramp • the Box Plot has been slightly redesigned and shows the median more distinctly Table • Table functions can now be invoked from the main menu; previously, this was only by right-clicking in the table 5 • tables can be saved after the deletion of columns (variables) Bug Fixes • the classification for the percentile map is now correct; in the earlier versions, an extra observation may have been included in the top percentile • the computation of the standard deviation is fixed; this could affect many functions, including the classifications in the standard deviational map and the computations in the Moran scatter plot and the Local Moran • the coordination between LISA maps for different variables is fixed; there were situations where all the maps became identical when a second variable was analyzed • problems with selection in the map when using different selection tools are fixed; when switching between selection tools there was some strange behavior • various minor bugs in table and rate calculations were fixed • several issues related to “weird” out of memory errors when loading tables or shape files were fixed; the out of memory error had in fact nothing to do with memory, but indicated problems with file formats, all known such problems have been fixed (including the one where the file name could not start with a “T”) 6 Menu Structure and Toolbar Buttons The menu structure has been slightly reorganized and new toolbars have been added to facilitate map construction and spatial autocorrelation analysis. Below, an overview of the main structure is given, followed by a detailed look at the changes in and additions to individual menus and toolbars. Overview As in Version 0.9.3, the window that appears after the program has been launched contains a simplified menu that allows access to Tools, such as spatial weights construction and spatial data transformations, without having to explicitly start a project (and load a shape file). A Methods item has been added to this menu, to invoke the spatial regression functionality directly. This is especially useful when analyzing larger data sets, since it avoids the need to update all linked windows, including potentially a very large data table. The initial menu is shown in Figure 1. As before, only two items on the toolbar are active, the first of which is used to launch a project, as illustrated in the figure. Figure 1: The initial menu and toolbar After opening the project, the usual dialog requests the file name of the 7 shape file and the Key variable. After clicking on the OK button, a map window is opened, showing the base map for the analyses, as in Figure 2. The main difference with earlier versions is that the default window shows (part of) the legend pane on the left hand size. As before, this can be resized by dragging the separator between the two panes (the legend pane and the map pane) to the right or left. Figure 2: Opening window after loading the SIDS2 sample data set With a shape file loaded, the complete menu and all toolbars are active, as in Figure 3. The menu bar contains three new items: Table, Space and Regress. The toolbar has two new dockable sets of buttons, Space and Map, and a slightly reorganized Explore toolbar. The Weights toolbar has been moved to the left. Two icons on the Edit toolbar sport a new look. Figure 3: The complete menu and toolbar buttons 8 Menu Items Tools Menu The Tools menu is available both with and without a loaded shape file and is identical in both cases. As show in Figure 4, there are two new items. Tools > Shape > Polygons from Grid constructs a polygon shape file for a regular lattice or grid, based on simple user input, such as the coordinates of the lower-left and upper-right corners, the number of rows and the number of columns (see p. 13). Tools > Shape > Polygons from BND creates a polygon shape file based on the boundary coordinates contained in an ascii input file (see p. 15). Figure 4: The tools menu item Methods Menu The Methods menu is only available when no shape file has been loaded into the project. Its only use is to invoke the spatial regression functionality, as shown in Figure 5. The interface is identical to that used in the Regress menu inside a project (see p. 45). Figure 5: The methods menu item 9 There is a major difference between the use of the regression functionality through the menu shown in Figure 5 and through the Regress item in the main menu in Figure 3. When invoked without a shape file with the Methods menu, the regression analysis reads the data directly from the dBase file, without showing the table contents in the window. This avoids the overhead required for the linking and brushing and is typically the only practical way to analyze large data sets (10,000 observations and more). Map Menu The Map menu (Figure 6) contains one new item, the Cartogram (see p. 23). The Map toolbar, shown in Figure 7, is new. It contains buttons to invoke the familiar choropleth map types (from left to right, quantile, percentile, standard deviational, and box map –with two fences), the cartogram and the map movie (both single and cumulative, see p. 26). Figure 7: Map toolbar Figure 6: Map menu 10 Explore Menu The Explore menu includes three new items: the Parallel Coordinate Plot (see p. 29), the 3D Scatter Plot (see p. 33), and the Conditional Plot (see p. 38), as shown in Figure 8. The spatial autocorrelation analysis items and the table have been moved to their own menu (see p. 12). The Explore toolbar, Figure 9, contains the icons for the six EDA functions, as well as a button to activate the data Table. Figure 9: Explore toolbar Figure 8: Explore menu Table Menu The Table menu (Figure 10) contains all the operations on table elements. Figure 10: The table menu item 11 Note that the items in the Table menu are identical to what is obtained when right clicking in an active table. Space Menu The Space menu is new and groups the functions to carry out spatial autocorrelation analysis, as illustrated in Figure 11. In previous versions, these were included in the Explore menu. The matching toolbar buttons (Figure 12) are combined together in a separately dockable toolbar. Figure 12: Space toolbar Figure 11: Space menu 12 Manipulating Spatial Data Two new spatial data input functions have been added to the Tools menu. There are also minor changes in the spatial weights calculations, and a new option was added to the construction of Thiessen polygons. Creating Grid Polygon Shape Files Tools > Shape > Polygons from Grid gives the ability to construct a polygon shape file for a regular lattice or grid from simple user input. The regular grid is either square or rectangular and has the observation numbers starting in the upper left corner and increasing to the right, and then down, row by row. Figure 13 illustrates the main dialog. Figure 13: The creating grid dialog 13 The simplest approach is to enter the coordinates for the lower left and upper right corner and to specify the number of rows and columns, as shown in Figure 13. For the example given there, the result is a 10 by 10 regular lattice, as in Figure 14. The shape file contains three fields: an identifier (POLYID), the area of the grid cell (AREA), and the perimeter (PERIMETER). Any other data need to be added by means of the table join functionality. Figure 14: A 10 by 10 regular lattice A second approach is to read the bounding box information from a text file. This file has a very simple format, as shown in Figure 15. It contains the number of rows, the number of columns, the X,Y coordinates for the lower left corner, and the X,Y coordinates for the upper right corner. These items can be on the same line, separated by white space (space or tab), or on consecutive lines; a comma- separated file does not work. Figure 15: Format of bounding box text input file Yet a third approach bases the grids on the bounding box associated with a shape file. Note that this approach is only correct for projected shapes. When the coordinates are unprojected lat-lon, there may be distortions for 14 larger extents. The corner coordinates of the bounding box (as read from the shape file) determine the extent of the lattice. The size of the individual grid cells follows from the number of rows and columns specified, as shown in Figure 16, using the extent of the SIDS shape file as the bounding box. Figure 16: Create a grid from bounding box in a shape file The result is illustrated in Figure 17. To illustrate the effect of the bounding box choice, the original outline of the North Carolina counties is superimposed on the 5 by 20 lattice, after applying an Edit > Add Layer command. Note the slight distortion in the grid cells, due to the fact that the SIDS shape file is unprojected. Creating Polygon Shape Files from BND Input Tools > Shape > Polygon from BND creates a polygon shape file from the boundary information contained in a text input file. The dialog, as shown in Figure 18, requires the name of the output shape file and the input text file. The input file must follow a very specific format, similar to the formats used in the shape to BND function. The only format supported so far is the “1a” BND format, as specified in GeoDa’s shape output function. The format is spelled out when the help feature is invoked, by clicking the question mark in the dialog shown in Figure 18, yielding Figure 19. 15 Figure 17: North Carolina counties with matching 5 by 20 regular lattice Figure 18: Shape file from a boundary text file Figure 19: Boundary file format The supported file format for the text file consists of a header line, containing the number of observations and the variable name for the Key variable, separated by a comma. Next, for each observation follows a line with the ID and the number of vertices that define the polygon, again commaseparated. Then, the X,Y coordinates are given, comma-separated and on a separate line for each point. This is repeated for each polygon in the data set. For example, the contents of the input file for the Columbus data would be: 49,POLYID 1,14 8.62413,14.237 16 8.5597,14.7424 8.80945,14.7344 ... 8.6429,14.0897 8.63259,14.1706 8.62583,14.2237 2,46 ... The resulting shape file can be loaded into GeoDa in the usual way. For example, using the input text file for Columbus yields the shape file shown in Figure 20. The data table, also illustrated in the figure, contains three fields: the original identifier (POLYID), AREA, PERIMETER, and a simple sequential identifier (RECORD ID). Figure 20: Columbus shape and table from text boundary file 17 Creating Spatial Weights The Weights functionality in the Tools menu has been revised slightly. This affects higher order contiguity computation, distance-based weights and the weights characteristics. Higher Order Contiguity Tools > Weights > Create invokes the usual dialog. There is a new check box below the selection of the order of contiguity, as shown in Figure 21. Selecting this option includes all the lower order neighbors up to the order specified. The default (check box left unchecked) only computes “pure” higher order contiguity, which does not include the lower order neighbors. Figure 21: Options for higher order contiguity Creating Distance Weights The distance weights calculation now uses the correct distance metric, both in the user interface as well as in the resulting weights file. The distance units depend on the units for the coordinates of the base map. When those points are stored as unprojected lat-lon decimal degrees, the resulting distance will be in miles. Previously, the distance shown was rescaled and did not have a meaningful interpretation. 18 In Figure 22, the cut off distance shown in the interface using the North Carolina counties is (approximately) 29.9 miles. The distances calculated are included as the third column in the GWT file, both for distance-based contiguity as well as for k-nearest neighbors. For example, in Figure 23, the distances are listed (in miles) for the 4 nearest neighbors in the North Carolina example. Note that in the current version of GeoDa, the distances themselves are not used, but only the resulting contiguity information is taken into account. Figure 22: Distance cutoff in miles Figure 23: Distance in k nearest neighbor weights files Weights Characteristics The design of the histogram used to depict the connectivity structure in spatial weights has been revised. The classification into discrete categories has been made more flexible and allows the adjustment to the necessary number through the Options > Intervals command (in the Options menu or by right clicking on the histogram). Also, a continuous color ramp is used for the histogram bars (see also p. 43). Islands are properly identified and shown as polygons with 0 contiguities. In Figure 24, this is illustrated for distance-based contiguity using a cut off distance of 28 miles for the North Carolina counties. As shown in Figure 22 this is less than the necessary distance to ensure connectivity for all counties. As a result, two counties are identified as islands. Their location is shown by linking with the base map. 19 Figure 24: Weights characteristics with islands Thiessen Polygons Tools > Shape > Points to Polygons brings up a dialog to specify the options for the creation of Thiessen polygons from a point shape file. A new option has been included, which allows the use of an external bounding box to determine the extent of the enclosing “rectangle” for the polygons. In the interface, a check box selects this option, which requires a shape file to be specified, as in Figure 25. Figure 25: Bounding box option for Thiessen polygons 20 The difference between the default and the use of this option is illustrated in Figures 26 and 27, using the centroids of the North Carolina counties as the input point shape file. The default (Figure 26) uses the bounding box for the point file, which has the extreme points on the boundary. Typically, the resulting rectangle will be smaller than the extent of the original counties. In Figure 27, the county polygon shape file was specified as the bounding box. Note that there is now some space between the centroids and the outer boundary of the rectangle. The latter is identical to the bounding box of the county shape file, facilitating overlay in a GIS. While this option provides a degree of flexibility in setting the bounding box, it does not allow for an external shape bounding box that would be internal to the default for the point shape. In other words, the bounding box will never exclude points from the Thiessen polygons. 21 Figure 26: Default bounding box for Thiessen polygons Figure 27: Polygon-based bounding box for Thiessen polygons 22 Mapping A Cartogram has been added as a new type of map and the Map Movie functionality has been fine tuned considerably. Cartogram A cartogram is a map where the original layout of the areal units is replaced by a layout in which the size of the area is proportional to a given variable. GeoDa implements a so-called circular cartogram, in which the original irregular polygons are replaced by circles. The placement of the circles is such that the original pattern is mimicked as much as possible, both in terms of absolute location as in terms of relative location (neighbors, or topology). This is based on a non-linear cellular automata algorithm due to Dorling (1996). The size (area) of the circles is proportional to the value of the selected variable. The cartogram is invoked by selecting Map > Cartogram from the menu or by clicking on the cartogram toolbar button. In the usual fashion, the variable selection dialog appears. After selecting the variable and clicking on the OK button, the cartogram is drawn. For example, in Figure 28 a cartogram is shown for the 1974 Sids rates (SIDR74) for North Carolina counties. The cartogram uses a color code to provide additional information about specific values, such as negative values, zero and outliers. The default color is green. Negative values are shown as black and zeros as transparent (white in the default background). Upper outliers are red and lower outliers are blue. The default hinge used to identify outliers is 1.5, which results in four such observations in Figure 28. The default for the outlier criterion can be changed: in the Options menu; by right clicking in the cartogram; or by clicking on the matching Box Map toolbar button. This dialog is illustrated in Figure 29. Selecting 3 as the value results in a cartogram with only one outlier, as in Figure 30. 23 Figure 28: Circular cartogram for North Carolina Sids rates (SIDR74) Figure 29: Selection of outlier hinge in cartogram 24 Figure 30: Outliers in Sids rate cartogram using a hinge of 3 Figure 31: Improving the layout of the cartogram 25 The cartogram uses a nonlinear algorithm to position and size the circles, which does not necessarily converge to an acceptable solution after the default number of iterations. An option is provided to compute an additional 100, 500 or 1000 iterations and improve upon the current solution, as illustrated in Figure 31. The cartogram is treated in the same way as other windows when it comes to brushing and linking. Any selection in another window will also be highlighted in the cartogram, and vice versa. For example, Figure 32 shows the outliers in the cartogram linked to their actual locations in the North Carolina county map. Figure 32: Outliers in Sids rate cartogram linked to base map Map Movie The Map Movie is an attempt at providing a simple form of map animation in GeoDa. This is accomplished by highlighting locations according to their order for a given variable, from low to high. This gives the same effect as when a box plot would be brushed from the bottom to the top, one observation at a time. The Map Movie is implemented either in a Cumulative form or in a Single form. In the Cumulative version, the observations are added to a cumulative selection set, which ultimately covers the whole map. In contrast, in the Single form, only one location is shown at any time. The Map Movie is invoked from the main menu by selecting Map > Map Movie > Cumulative , or Map > Map Movie > Single, or by clicking the 26 toolbar button. Once a variable is chosen in the usual dialog, the map movie window opens, as in Figure 33 for the Columbus neighborhoods. This consists of some controls at the top and the usual areal outline. Figure 33: Starting a cumulative map movie There are five main controls and one slider bar. The Play button starts (or re-starts) the operation of the movie. The speed by which locations are shown on the map depends on the setting for the slider bar. This is a function of the machine clock speed and is hardware dependent. Moving the button on the slider bar to the left speeds things up, moving it to the right slows the movie down. The Pause button stops the movie, as in Figure 34, and Reset clears the map. After the movie has been paused (or at the start), the arrow buttons, >> and <<, step through the movie one observation at a time, either forward (>>) or backward (<<). At the end of a cumulative map movie, all locations are selected, as in Figure 35. The map movie is linked to all the other windows in the current project. However, this linking is only one-way, in the sense that the selected locations that appear in the map movie will also be highlighted in the other windows. However, selections in other windows do not affect the map movie (this would defeat the purpose of the animation). More importantly, any change to the selections in other windows during the operation of a map movie will break the linking mechanism. For the same reason, there is no brushing in the map movie. Finally, while it is possible to do so, it is not a good idea to run two map movies at the same time. This is fine in terms of the movies themselves, but the linking mechanism will be inconsistent. 27 Figure 34: Pausing a cumulative map movie Figure 35: A completed cumulative map movie 28 Exploratory Data Analysis GeoDa’s functionality for exploratory data analysis has been extended with three new types of dynamically linked graphs: the parallel coordinate plot, the three dimensional scatter plot, and four conditional plots (conditional map, box plot, histogram and scatter plot). In addition, the histogram and box plot graphs were redesigned slightly. Parallel Coordinate Plot The Parallel Coordinate Plot (PCP) is a method to explore multivariate relationships. Each variable under consideration is drawn as a parallel line on which the (coordinates of the) observations are recorded as points. The matching points for each observation are connected and form a line. As a result there are as many lines as observations in the PCP. Background on the fundamental ideas and methodological issues can be found in, among others, Inselberg (1985) and Wegman (1990). The PCP can be used to discover “clusters” among observations when their lines show similar patterns (i.e., group together in a distinct way in the graph). In addition, a common pattern in the slopes of the lines connecting coordinates on different variable axes indicates the nature of the “correlation” between those variables (positive or negative, or no patterning). The PCP is linked to all the other graphs and maps and can be brushed. The Parallel Coordinate Plot is launched by selecting it from the main menu, using Explore > Parallel Coordinate Plot, or by clicking on the PCP toolbar button. This opens up the PCP variable selection dialog, as in Figure 36. Variables are included by selecting them in the left hand side panel and using the > arrow button. Alternatively, >> selects all variables, but this is usually not advised for a PCP. The selection can be edited by means of the reverse button. Click on OK (Figure 37) to launch the plot, which yields the PCP as shown in Figure 38. 29 Figure 36: Variable selection for PCP Figure 37: PCP variables selected Figure 38: PCP for Columbus variables A closer look at the graph shows the range for each variable listed in parentheses next to the variable name (on the left hand size). The order of the axes (variables) can be changed by clicking on the small dot next to the variable name (as in Figure 39) and dragging it to “drop” it on top of another variable. As a result, the two axes switch places in the plot. Rearranging the order of variables in this manner can sometimes facilitate the discovery of clusters and patterns. 30 The PCP implemented in GeoDa has a limited number of options, which are invoked by right clicking on the graph, illustrated in Figure 40. The first three of these are standard options for any graph: saving the image as a bitmap file, adding the selected observations as a dummy variable to the table, and changing the Background Color. The latter is often useful for better visibility of selected observations, since the default selection color of yellow is not easy to see on the default white background in the plot. The last two of the five options PCP options are non-standard. They pertain to the scale used for the horizontal axes. The default is to keep the variables in their original scales (this is not necessarily a good idea when the scales are very different). The alternative is to convert the variables to standard deviational units, which is obtained with the Standardize Data Set option. This is a toggle switch, so one of the two is always selected. Figure 41 illustrates the standardization on a dark grey background. The PCP can be brushed like any other graph. A rectangular selection can be moved over the lines as in Figure 42. This selects the matching observations in all the other open graphs and maps. Figure 39: PCP change variable order Figure 40: PCP options 31 Figure 41: PCP using standardized variables Figure 42: Brushing the PCP 32 3D Scatter Plot Multivariate data exploration in GeoDa is further facilitated by the inclusion of a three-dimensional scatter plot. This feature is still somewhat experimental and may not be totally stable at this point. It implements the usual 3-D point manipulations, such as rotating, zooming and translation of the graph, as well as linking and brushing. The 3D Scatter Plot is started as Explore > 3D Scatter Plot from the menu, or by clicking the matching toolbar button. This brings up the Axis Selection dialog, as shown in Figure 43. For each of the axes in Figure 43: Variable selection for 3D scatter plot the plot, the variable is selected from the drop down list in the usual way. Clicking OK generates the initial view of the 3-D plot, as in Figure 44. Note the position of the axes, with the z-axis coming out towards the viewer (the axes are color coded to facilitate keeping track of them during rotation and translation). The plot is manipulated by means of the mouse buttons. The left button is used to rotate the plot, the right button to zoom in or out (by moving the mouse up or down), and both buttons to translate the plot (move it up or down, or sideways). Figure 45 shows a rotation, where the z-axis is made vertical (the highest crime locations are the most vertical) and the x-y axes form the horizontal plane. Figure 45 also illustrates the projection of the points onto one of the side planes. In the left hand side of the interface, the check box next to Project x-y is checked, which yields the points on the horizontal plane. In the illustration, since the X and Y axes are the coordinates, these are the locations of the Columbus neighborhood centroids. 33 Figure 44: 3D scatter plot initial view Figure 45: Rotated 3D scatter plot The selection of observations in the 3D scatter plot is implemented by means of a three-dimensional selection box or volume. Checking the Select box in the left hand pane generates the default volume. This can be resized by moving the sliders on the right hand side for each of the dimensions, as shown in Figure 46. The selected points (spheres) are highlighted in yellow. 34 Figure 46: Selection box in 3D scatter plot The selection can be changed (brushing) in two different ways. In one, the sliders on the left hand side in the pane next to each of the dimensions can be moved to change the position of the selection box along this axis. For example, moving the slider for the X-axis, as shown in Figure 47, will change the position of the box along the X dimension, but will keep its position along the two other dimensions fixed. Alternatively, CTRL-left mouse button allows free movement of the selection box in all dimensions (Figure 48). The selected points in the 3D Scatter Plot are linked to all the other graphs and maps. This is slightly different from the standard approach, in the sense that the direction of selection matters. When the Select check box is activated in the 3D plot, the points selected are highlighted in the other plots. However, this is not continuous (as in other brushing), but the selection is refreshed each time the brush stops, i.e., each time the red box on the plot stops moving. This is illustrated in Figure 49. Alternatively, when brushing is carried out in a different map or graph, this invalidates the Select check box in the 3D plot. The selection from the other graphs is highlighted as yellow in the 3D plot, but without the red selection box, as shown in Figure 50. 35 Figure 47: Brushing the 3D scatter plot using the slider Figure 48: Free form brushing of the 3D scatter plot 36 Figure 49: Linking between 3D scatter plot and other windows Figure 50: Linking between map and 3D scatter plot 37 Conditional Plot The conditional plots are yet another way to carry out multivariate data exploration. The main principle behind these plots is to use two conditioning variables to subset the data sample into distinct categories. The observations in each of these categories fall into a specific range for the conditioning variables. A separate graph or map is drawn for a third variable in each of the subsets. The fundamental ideas behind this approach are outlined in Becker et al. (1996) and Carr et al. (2002), among others. In GeoDa, each of the conditioning variables can have three subsets, yielding a total of nine subgraphs. Four types of conditional plots are supported: a conditional map, conditional box plots, conditional histogram and conditional scatter plots. The conditional plots are invoked as Explore > Conditional Plot from the menu, or by clicking the matching toolbar button. This brings up a simple dialog to select the type of graph, as in Figure 51. With the radio button checked next to the desired plot type, clicking OK brings up the variable selection dialog. Variables are moved to the respective axes by selecting them from the drop down list and clicking on the matching > button, as shown in Figure 52. After the variables are entered for all axes, OK (Figure 53) will start the selected graph. Since the map, box plot and histogram are univariate plots, only three axes are required. For the conditional scatter plot, a fourth axis is needed (the third is for the dependent variable, or vertical axis in the scatter plot, the fourth the explanatory variable, or horizontal axis in the scatter plot). Figure 51: Types of conditional plots 38 Figure 53: Starting up the conditional plots Figure 52: Conditional plot variable selection Figure 54: Conditional map plot 39 The four types of conditional plots are illustrated using the Columbus example and a very simple form of conditioning. The X-axis is for the X coordinates and the Y axis for the Y coordinates. In other words, the nine subplots are for selected locations that fit the specified X-Y range. This is shown in Figure 54 for a choropleth map of the variable CRIME. Note that a continuous color ramp is used for the choropleth map. The categories of the conditioning variables can be changed by moving the handles on the X-axis to the right or left, and on the Y-axis up or down. This will alter the number of observations falling in each cell and thus highlight how the pattern of the variable under consideration changes in different subsets of the data. For example, in Figure 55, additional neighborhoods are included into the second Y level by moving the Y handle lower. This is easiest to see in the second highest cell on the left hand side, which was empty in Figure 54. Also, moving the handles to the right (X axis) or up (Y axis) collapses the categories together. If this is done for all handles, the plot in the lower left corner will be for the complete data set. Figure 55: Moving the handles in the conditional plot 40 In Figure 56, the conditioning is illustrated for the box plots. In each of the cells, a new box plot is drawn, using the range for the complete data set as the reference (the height of the box is the same in each cell, and provides a reference with respect to the complete data set). However, the distribution in each cell is potentially different, with different medians (the red horizontal line), fences and outliers. The box plots follow the new format (see also p. 43) and show the number of observations in each cell in parentheses. When there are fewer than five observations in the cell (as in the upper right corner of Figure 56), no box plot is drawn. Figure 56: Conditional box plot A similar approach is taken for the conditional histogram, shown in Figure 57. The categories in the histogram are fixed and pertain to the complete distribution. In order to change this, they need to be adjusted in each cell individually. Each of the cells in the conditional plot shows the 41 observations that meet the conditioning criteria and where they stack up on the histogram. Each histogram bar shows the number of observations in that class at the top. Figure 57: Conditional histogram Finally, the conditional scatter plot is illustrated in Figure 58 for the variables CRIME and INC. In each cell a regression line and its slope are given if at least two observations are present. The location of the points in the plot is always given. For example, in the upper right cell of Figure 58, there is only one observation (one point in the scatter plot). Different slopes in the different cells suggest an interaction effect between the conditioning variables and the linear relation between the two variables considered. If there is no such interaction, then the slopes should be the same in all cells. 42 Figure 58: Conditional scatter plot Histogram The histogram now uses a different color scheme for the histogram bars. Instead of random color assignment, a continuous color ramp is used, as illustrated in Figure 59. Box Plot The box plot has been redesigned as well. Instead of the blue dot to represent the median, this is now shown as a red line that sticks out slightly on both sides of the box. In addition, the number of observations is listed in parentheses at the upper right hand corner, as illustrated in Figure 60. 43 Figure 59: New look histogram Figure 60: New look box plot 44 Spatial Regression Analysis GeoDa now includes some spatial regression functionality. In the current version, this is still fairly limited and experimental, but it works. The user interface in particular is still rudimentary. The basic diagnostics for spatial autocorrelation, heteroskedasticity and non-normality are implemented for the standard ordinary least squares regression. Estimation of spatial lag and spatial error models is supported by means of the Maximum Likelihood method. An extensive overview of the relevant methodology is beyond the scope of this document, but can be found in Anselin and Bera (1998). The estimation techniques implemented for the Maximum Likelihood approach are based on the algorithms outlined in Smirnov and Anselin (2001). These algorithms were developed to address the estimation of spatial regression models in very large data sets. GeoDa has been successfully applied to spatial regression in a data set of 330,000 observations (estimation and inference were complete in a few minutes). A spatial regression using the 3000+ US counties takes a few seconds. The asymptotic inference consists of a Likelihood Ratio test as well as an estimate of the asymptotic covariance matrix, using a new algorithm developed by Smirnov (2003). All methods use sparse weights of either GAL or GWT format. However, so far, estimation only works for weights that reflect a symmetric spatial arrangement, such as contiguity weights or distance based weights (row-standardized), but not for k-nearest neighbor weights. The regression functionality can be invoked in two different ways. In the opening screen, without loading a shape file, it is activated by selecting Methods > Regress (see also p. 9). This is the suggested approach for large data sets (1,000 and up) since it avoids the overhead due to the linking of a large data table. In smaller data sets, the regression can also start within a project, by selecting Regress on the main menu. This approach is more appropriate when predicted values and residuals will be used in mapping and further exploratory analysis. 45 Regression Interface The Regress function starts with a dialog to set some basic parameters for the results and output, as illustrated in Figure 61. The Report Title can be ignored, the Output file name is the name of the text file to which the results will be written. The default is Regression.OLS, which will be the file name used unless a different name is specified, even when the analysis is for a lag or error model. The next three items determine some additional information that may be included in the output file: • the Predicted Value and Residual: note that this is not the same as the option to save these values to the data table; it only affects the listed output in the output file • the Coefficient Variance Matrix: note that the (asymptotic) standard errors are reported with the coefficient estimates; this option pertains to the complete variance-covariance matrix (including the covariances) • the Moran’s I z-value: the default is that this value is not reported since the computations involved are substantially slower than those for the Lagrange Multiplier statistics (in Figure 61 this option has been checked) Figure 61: Regression analysis output settings Clicking the OK button will invoke the variable specification dialog for the regression model. 46 The variable selection dialog is still rudimentary. It uses the >, >>, < and << buttons to move variables from the drop down list to the Dependent Variable box and the Independent Variables list, as shown in Figure 62. Figure 62: Variable selection for regression analysis The final step in setting up the regression specification consists of selecting a spatial weights file, as shown in Figure 63. This uses the usual file selection dialog. Also note that a constant term is included by default. Only in very rare circumstances would it make sense to uncheck that box. The process outlined above is the same for all three regression models: first specify the output options, then select the variables, and close by choosing the spatial weights file. 47 Figure 63: Spatial weights file selection Ordinary Least Squares with Diagnostics The model specification is concluded by selecting one of the three radio buttons in the dialog and clicking Run to start the analysis. For example, for the classic regression model this is as shown in Figure 64. After the analysis is over, there are three important choices. The Save button (Figure 65) is to add the predicted value and residuals to the data table (e.g., for mapping or exploratory analysis). This must be done before any other options are selected. Clicking this button brings up a dialog to select variable names for these two items, as in Figure 66. Clicking OK finished this process and the new variables are added to the data table (Figure 67). 48 Checking the OK button (Figure 68) brings up the result file. Note that if several models will be run using the same specification, this button should not be checked. In the current setup, once the OK button is selected, the regression analysis is over. The result file, shown in Figure 69, contains the coefficient estimates, measures of fit and the diagnostics. These include the Bera-Jarque test for non-normality, the Breusch-Pagan, Koenker-Bassett and White tests for heteroskedasticity, and six test statistics for spatial autocorrelation: Moran’s I (including a z-value when that option has been checked), Lagrange Multiplier tests for lag, error and both forms, as well as their robust counterparts. Figure 64: Starting the regression analysis, classic model 49 Figure 65: Saving predicted values and residuals, classic model Figure 66: Selecting variable names for saved predicted values and residuals, classic model 50 Figure 67: Predicted values and residuals added to table Figure 68: Finishing the regression analysis, classic model 51 Figure 69: Results, classic model 52 Maximum Likelihood in Spatial Lag Model The process for the spatial lag model is essentially the same as for the classic model. The only difference is that the Spatial Lag radio button must be checked, as in Figure 70. The Save and OK work in the same way. For the spatial lag model, there is a distinction between the residual and the prediction error. The latter is the difference between the observed value and the predicted value that uses only exogenous variables, rather than treating the spatial lag W y as observed. Figure 70: Spatial regression analysis, lag model The results reported consist of the estimated coefficients and their asymptotic errors and t-test, measures of fit (log likelihood, AIC and SC), a 53 Breusch-Pagan test for heteroskedasticity, and a Likelihood Ratio test on the spatial lag parameter. Figure 71: Results, spatial lag model 54 Maximum Likelihood in Spatial Error Model The process for the spatial lag model is essentially the same as for the classic and lag models. The only difference is that the Spatial Lag radio button must be checked, as in Figure 72. The Save and OK work in the same way. For the spatial error, the prediction error is the difference between observed and predicted y, whereas the “residuals” are the spatially filtered residuals. Figure 72: Spatial regression analysis, error model The results reported consist of the estimated coefficients and their asymptotic errors and t-test, measures of fit (log likelihood, AIC and SC), a BreushPagan test for heteroskedasticity, and a Likelihood Ratio test on the spatial error parameter. 55 Figure 73: Results, spatial error model 56 Bibliography Anselin, L. (2003). GeoDa 0.9 User’s Guide. Spatial Analysis Laboratory (SAL). Department of Agricultural and Consumer Economics, University of Illinois, Urbana-Champaign, IL. Anselin, L. and Bera, A. (1998). Spatial dependence in linear regression models with an introduction to spatial econometrics. In Ullah, A. and Giles, D. E., editors, Handbook of Applied Economic Statistics, pages 237– 289. Marcel Dekker, New York. Becker, R. A., Cleveland, W., and Shyu, M.-J. (1996). The visual design and control of Trellis displays. Journal of Computational and Graphical Statistics, 5:123–155. Carr, D. B., Chen, J., Bell, S., Pickle, L., and Zhang, Y. (2002). Interactive linked micromap plots and dynamically conditioned choropleth maps. In Anselin, L. and Rey, S., editors, New Tools for Spatial Data Analysis: Proceedings of the Specialist Meeting. Center for Spatially Integrated Social Science (CSISS), University of California, Santa Barbara. CD-ROM. Dorling, D. (1996). Area Cartograms: Their Use and Creation. CATMOG 59, Institute of British Geographers. Inselberg, A. (1985). The plane with parallel coordinates. Visual Computer, 1:69–91. Smirnov, O. (2003). Computation of the information matrix for models of spatial interaction. Technical report, Regional Economics Applications Laboratory (REAL), University of Illinois, Urbana-Champaign, IL. Smirnov, O. and Anselin, L. (2001). Fast maximum likelihood estimation of very large spatial autoregressive models: A characteristic polynomial approach. Computational Statistics and Data Analysis, 35:301–319. 57 Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association, 85:664–675. 58
© Copyright 2026 Paperzz