The Generic Problem Model Selection in Geographically Weighted Regression We are commonly faced with the decision of which subset of a larger set of potential explanatory variables to include within a model. National Centre for Geocomputation National University of Ireland, Maynooth, IRELAND - choose too many – multicollinearity effects - choose too few – omitted variable bias The Generic Solution Standard techniques exist in global modelling that assist with this problem – e.g stepwise regression procedures – where the basic decision is whether or not to include a variable However, in local modelling the situation is more complex because the options are: - variable x sig globally; sig spatial variability - variable x sig globally; no sig spatial variability This leads to different types of models… Global model: yi = β0 + β1xi1 + β2xi2 +… βnxin + εi Full GWR model: yi = β0(ui vi) + β1 (ui vi) xi1+ β2 (ui vi) xi2 +… βn (ui vi) xin + εi Semi-Parametric GWR (SGWR) model: - variable x not sig globally; sig spatial variability - variable x not sig globally; no sig spatial variability So, the question is … Faced with a set of potential explanatory variables, how do we decide in GWR which variables to include and in what form should they be included? Brute force (trying every possible model) isn’t currently a solution… For instance, with just 18 variables there would be 524,288 possible models. Assuming it takes on average 5 minutes to run a GWR calibration on each model, it would take… ….5 years to examine every possible model An Example of the Irish Famine An alternative procedure… Step 1: select subset of variables based on common sense/expert opinion Step 2: check multicollinearity amongst remaining variables and reject further variables if necessary Step 3: run STEPAIC procedure for GWR 3.x to eliminate any redundant variables. This needs to be done locally and NOT globally Step 4: use Monte Carlo option in GWR 3.x to identify which relationships are global and which are spatial Step 5: run semi-parametric option in GWR 4.x with relationships specified from Step 4 to calibrate model Note: Current work on replacing Steps 4 and 5 by single AIC-based procedure in GWR 4.x Populations change 1821 to 1911 1821 6,801,827 1831 7,767,401 1841 8,196,597 1851 6,574,278 1861 5,798,967 1871 5,412,377 1881 5,174,836 1891 4,706,162 1901 4,458,775 1911 4,381, 951 Mapping Population change at 3,440 ED Level 9000000 8000000 7000000 6000000 5000000 Population level 4000000 3000000 2000000 What caused this variation? 1000000 0 1821 1831 1841 1851 1861 1871 1881 1891 1901 1911 1926 1936 1946 1951 1956 1961 1966 1971 1979 1981 1986 1991 1996 2001 2006 Why did some areas suffer greater decline than others? Not small numbers Step 1: Data Selection through expert opinion and Literature (100+ to 14) Over 100 potential explanatory variables reduced to 14 (highly skewed variables transformed to logs) To answer this, we have assembled over 100 potential explanatory variables at the ED level ln value per hectare 1841 % uninhabited dwellings 1841 Persons per building 1841 % of ED under agriculture 1851 % of cultivated land under grain 1851 ln Proximity to workhouses % of crop land under potatoes 1851 Dist to coast ln Pop 1841 / cropped land 1851 MF pop ratio 1841 Mean Elevation % pop living in towns 1841 ln accessibility to urban areas ln dist to coast ln Av Holding size Step 3: Perform StepAIC using GWR 3.x Step 2: Data Reduction through eliminating redundancy (14 to 12) (12 to 9) Step 3.1: Place each variable in turn in GWR model of % pop change 1841-51 regressed on single explanatory variable. Calculate AIC in each case (12 runs) Select variable yielding lowest AIC Global regression model calibrated with the 14 variables and potential collinearity problems checked through VIFs. No correlations were very high (highest was 0.74) and no VIF exceeded 5 although two exceeded 4 – these were lnValue and perc Agric. In both cases they were correlated with each other and also with lnPopCrop and lnUrbanacc. Both perc Agric and lnValue had counter-intuitive signs in the global model. Perc Agric removed and the R-squared only decreased very marginally but lnvalue still had a counter-intuitive sign and its VIF was the highest at 3.5. lnValue removed and the r-squared decreased again only very marginally. No VIFs were now greater than 3 Step 3.2: Place each of the remaining N-1 explanatory variables in GWR model regressed on variable selected at Step 3.1 and new variable. Calculate change in AIC between Step 3.1 and Step 3.2. (11 runs) Select variable yielding greatest reduction in AIC from that in Step 3.1. Add this variable to model. Step 3.3: As above with remaining N-2 variables Continue until AIC reduction falls to less than 3 AIC during calibration Step 4: Check Monte Carlo results for significance of spatial variation 22000 21950 Current Smallest AIC 21900 21850 21800 21750 21700 21650 21600 21550 • • • • • • • • • • • • • • • • ***************************************************************************** Global regression result ***************************************************************************** < Diagnostic information > Residual sum of squares: 313891.565910 Number of parameters: 10 (Note: this num does not include an error variance term for a Gaussian model) ML based global sigma estimate: 10.065819 Unbiased global sigma estimate: 10.082104 Log-likelihood: 23099.208210 Classic AIC: 23121.208210 AICc: 23121.293758 BIC/MDL: 23187.631842 CV: 102.119192 R square: 0.295515 Adjusted R square: 0.293233 • • • • • • • • • • • • Variable Estimate Standard Error t(Est/SE) -------------------- --------------- --------------- --------------Intercept 44.830862 8.116669 5.523308 PCorn_Cult -0.335679 0.021453 -15.646997 MEAN_ELEV 0.007511 0.003040 2.471006 lnPopCrop -14.919852 0.536873 -27.790273 lnPotCult 1.682499 0.463698 3.628434 lnWorkhouse -4.008397 0.496028 -8.080986 lnAHS -6.730804 0.443978 -15.160222 lnDistCoast -0.625783 0.181567 -3.446568 lnUrbanAcc -4.872450 0.823394 -5.917522 Percent_urban_41 0.183011 0.020085 9.111967 21500 UN IN B HA PC 41 18 tio Ra T V LE _E se ou kh or N EA B PP M t as Co c Ac an rb is t W ln U ln D ln S AH ln t ul C n_ or 41 PC n_ ba ur t_ en rc Pe p ro t ul tC Po ln pC Po ln Added Variable Step 4: Check Monte Carlo results for significance of spatial variation • • • • • • • • • • • • • • • • • • ********************************************************** * GWR ESTIMATION * ********************************************************** Fitting Geographically Weighted Regression Model... Number of observations............ 3098 Number of independent variables... 10 (Intercept is variable 1) Number of nearest neighbours...... 143 Number of locations to fit model.. 3098 Diagnostic information... Residual sum of squares......... 132561.647700 Effective number of parameters.. 479.272452 Sigma........................... 7.114818 Akaike Information Criterion.... 21565.942622 Coefficient of Determination.... 0.702484 Adjusted r-square............... 0.648013 ** Results written to .txt file Step 4: Check Monte Carlo results for significance of spatial variation • • • • • • • • • • • • • • • • • • • • • • • • • • ************************************************* * * * Test for spatial variability of parameters * * * ************************************************* Tests based on the Monte Carlo significance test procedure due to Hope [1968,JRSB,30(3),582-598] Parameter P-value ---------- -----------------Intercept 0.00000 *** Percent_ 0.33000 n/s PCorn_Cu 0.00000 *** MEAN_ELE 0.00000 *** lnPopCro 0.00000 *** lnPotCul 0.00000 *** lnWorkho 0.00000 *** lnAHS 0.00000 *** lnDistCo 0.00000 *** lnUrbanA 0.00000 *** *** = significant at .1% level ** = significant at 1% level * = significant at 5% level Therefore need to calibrate a semiparametric GWR model Step 5: Calibrate final model with GWR 4.x – semi-parametric GWR *********************************************************** << Fixed coefficients >>*********************************************************** Variable Estimate Standard Error t(Estimate/SE) -------------------- --------------- --------------- --------------Percent_urban_41 0.152553 0.016718 9.125142 *********************************************************** << Geographically varying coefficients >> *********************************************************** Estimates of varying coefficients have been saved in the following file. Listwise output file: C:\Documents and Settings\Administrator\Desktop\Famine Run\GWRlistwise_Famine_9.csv Summary statistics for varying coefficients Variable Mean STD -------------------- --------------- --------------Intercept 63.124587 148.367433 PCorn_Cult -0.268054 0.231437 MEAN_ELEV -0.007160 0.044809 lnPopCrop -12.142006 7.752879 lnPotCult 6.459728 7.208971 lnWorkhouse -0.001007 4.718654 lnAHS -4.417651 5.124817 lnDistCoast -0.574063 5.980647 lnUrbanAcc -8.812155 16.201102 As population density on cropped land increases = • • • • • • • • • ***************************************************************************** GWR (Geographically weighted regression) result ***************************************************************************** Bandwidth and geographic ranges Bandwidth size: 143.735569 Coordinate Min Max Range --------------- --------------- --------------- --------------X-coord 31562.850000 363700.560000 332137.710000 Y-coord 24792.950000 456391.560000 431598.610000 • • • • • • • • • • • • • • • Diagnostic information Residual sum of squares: 136727.552393 Effective number of parameters (model: trace(S)): 428.946300 Effective number of parameters (variance: trace(S'S)): 297.771190 Degree of freedom (model: n - trace(S)): 2669.053700 Degree of freedom (residual: n - 2trace(S) + trace(S'S)): 2537.878590 ML based sigma estimate: 6.643353 Unbiased sigma estimate: 7.339942 Log-likelihood: 20524.592640 Classic AIC: 21384.485240 AICc: 21523.427900 was 21565 under BIC/MDL: 23980.721141 CV: 61.540472 R square: 0.693135 Adjusted R square: 0.625381 full GWR As population density on cropped land increases = the effects of the famine are more severe the effects of the famine are more severe Note: critical value of t adjusted for multiple hypothesis testing using Fotheringham adjustment As mean elevation increases = the effects of the famine are more severe As mean elevation increases = the effects of the famine are less severe As mean elevation increases = the effects of the famine are more severe As mean elevation increases= the effects of the famine are less severe Proximity to towns increases = the famine more severe Proximity to towns increases = the famine more severe Proximity to towns increases = the famine less severe Proximity to towns increases = the famine less severe Distance to the coast increases = the famine less severe Distance to the coast increases = the famine less severe Distance to the coast increases = the famine more severe Distance to the coast increases = the famine more severe Summary Local Goodness-of-fit Model selection is a widespread challenge in spatial data handling. Model performing relatively well Model performing relatively poorly It is a particular issue in semi-parametric GWR because relationships can be either global or local Here we present a method for model selection in GWR using both the existing GWR 3.x software and new software GWR 4.x which allows calibration of semi-parametric GWR models. The method is applied to data on the Irish Famine and the results indicate intriguing spatial variations in the determinants of population loss during the Famine decade. End of Presentation As holding size increase = the famine more severe As holding size increases = the famine less severe As holding size increase = the famine more severe As holding size increases = the famine less severe Closer to the workhouse = effects of the famine are more severe Closer to the workhouse the famine are less severe Closer to the workhouse = effects of the famine are more severe Closer to the workhouse the effects of famine are less severe As percentage of cropped land under potatoes increases = the effects of the famine are more severe As percentage of potatoes on cropped land increases the effects of the famine are less severe As percentage of cropped land under potatoes increases = the effects of the famine are more severe As percentage of grain on cropped land increases = the effects of the famine are more severe As percentage of potatoes on cropped land increases the effects of the famine are less severe As percentage of grain on cropped land increases = the effects of the famine are more severe Global t = 7.0 Close proximity to workhouses → less severe effects Workhouses didn’t have any significant effect on population change in rest of the country Note: critical value of t adjusted for multiple hypothesis testing using Fotheringham adjustment Population change 1821 to 1911 1821 1831 1841 6,801,827 7,767,401 8,196,597 1851 6,574,278 1861 5,798,967 1871 5,412,377 1881 5,174,836 1891 4,706,162 1901 4,458,775 1911 4,381, 951 Model working relatively well Model working relatively poorly Data available at Electoral Division level Census 1851 (1841) Agricultural Other Variables Census 1851 Population change Population density Male-Female Ratio Number of Inhabited Houses Number of Uninhabited Houses Valuation per Hectare % ED urbanised Institutional population 1851 Geographically Weighted Regression % under agriculture Mean elevation Holding size Distance to the coast or water bodies Wheat Oats Proximity to urban centres Barley Distance to ports Potatoes Distance to workhouses Flax Meadow and Clover Geographically Weighted Regression Regression point Regression point Data point Data point
© Copyright 2026 Paperzz