Brute force (trying every possible model) isn`t currently a solution…

The Generic Problem
Model Selection in Geographically
Weighted Regression
We are commonly faced with the decision of
which subset of a larger set of potential
explanatory variables to include within a
model.
National Centre for Geocomputation
National University of Ireland, Maynooth, IRELAND
- choose too many – multicollinearity effects
- choose too few – omitted variable bias
The Generic Solution
Standard techniques exist in global modelling that
assist with this problem – e.g stepwise regression
procedures – where the basic decision is whether or
not to include a variable
However,
in local modelling the situation is more complex
because the options are:
- variable x sig globally; sig spatial variability
- variable x sig globally; no sig spatial variability
This leads to different types of models…
Global model:
yi = β0 + β1xi1 + β2xi2 +… βnxin + εi
Full GWR model:
yi = β0(ui vi) + β1 (ui vi) xi1+ β2 (ui vi) xi2
+… βn (ui vi) xin + εi
Semi-Parametric GWR (SGWR) model:
- variable x not sig globally; sig spatial variability
- variable x not sig globally; no sig spatial variability
So, the question is …
Faced with a set of potential
explanatory variables, how do we
decide in GWR which variables to
include and in what form should
they be included?
Brute force (trying every possible
model) isn’t currently a solution…
For instance, with just 18 variables there would
be 524,288 possible models.
Assuming it takes on average 5 minutes to run
a GWR calibration on each model, it would
take…
….5 years to examine every possible model
An Example of the Irish Famine
An alternative procedure…
Step 1: select subset of variables based on common sense/expert
opinion
Step 2: check multicollinearity amongst remaining variables and reject
further variables if necessary
Step 3: run STEPAIC procedure for GWR 3.x to eliminate any
redundant variables. This needs to be done locally and NOT
globally
Step 4: use Monte Carlo option in GWR 3.x to identify which
relationships are global and which are spatial
Step 5: run semi-parametric option in GWR 4.x with relationships
specified from Step 4 to calibrate model
Note: Current work on replacing Steps 4 and 5 by single AIC-based
procedure in GWR 4.x
Populations change 1821 to 1911
1821
6,801,827
1831
7,767,401
1841
8,196,597
1851
6,574,278
1861
5,798,967
1871
5,412,377
1881
5,174,836
1891
4,706,162
1901
4,458,775
1911
4,381, 951
Mapping Population change at 3,440 ED Level
9000000
8000000
7000000
6000000
5000000
Population level
4000000
3000000
2000000
What caused
this
variation?
1000000
0
1821 1831 1841 1851 1861 1871 1881 1891 1901 1911 1926 1936 1946 1951 1956 1961 1966 1971 1979 1981 1986 1991 1996 2001 2006
Why did some areas suffer greater
decline than others?
Not small
numbers
Step 1: Data Selection through expert
opinion and Literature (100+ to 14)
Over 100 potential explanatory variables reduced
to 14 (highly skewed variables transformed to logs)
To answer this, we have assembled over
100 potential explanatory variables at
the ED level
ln value per hectare 1841
% uninhabited dwellings 1841
Persons per building 1841
% of ED under agriculture 1851
% of cultivated land under grain 1851
ln Proximity to workhouses
% of crop land under potatoes 1851
Dist to coast
ln Pop 1841 / cropped land 1851
MF pop ratio 1841
Mean Elevation
% pop living in towns 1841
ln accessibility to urban areas
ln dist to coast
ln Av Holding size
Step 3: Perform StepAIC using GWR 3.x
Step 2: Data Reduction through
eliminating redundancy (14 to 12)
(12 to 9)
Step 3.1: Place each variable in turn in GWR model of
% pop change 1841-51 regressed on single
explanatory variable. Calculate AIC in each case (12
runs)
Select variable yielding lowest AIC
Global regression model calibrated with the 14 variables and
potential collinearity problems checked through VIFs. No
correlations were very high (highest was 0.74) and no VIF
exceeded 5 although two exceeded 4 – these were lnValue
and perc Agric. In both cases they were correlated with
each other and also with lnPopCrop and lnUrbanacc. Both
perc Agric and lnValue had counter-intuitive signs in the
global model.
Perc Agric removed and the R-squared only decreased very
marginally but lnvalue still had a counter-intuitive sign and
its VIF was the highest at 3.5.
lnValue removed and the r-squared decreased again only very
marginally. No VIFs were now greater than 3
Step 3.2: Place each of the remaining N-1 explanatory
variables in GWR model regressed on variable
selected at Step 3.1 and new variable. Calculate
change in AIC between Step 3.1 and Step 3.2. (11
runs)
Select variable yielding greatest reduction in AIC
from that in Step 3.1. Add this variable to model.
Step 3.3: As above with remaining N-2 variables
Continue until AIC reduction falls to less than 3
AIC during calibration
Step 4: Check Monte Carlo results for
significance of spatial variation
22000
21950
Current Smallest AIC
21900
21850
21800
21750
21700
21650
21600
21550
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
*****************************************************************************
Global regression result
*****************************************************************************
< Diagnostic information >
Residual sum of squares:
313891.565910
Number of parameters:
10
(Note: this num does not include an error variance term for a Gaussian model)
ML based global sigma estimate:
10.065819
Unbiased global sigma estimate:
10.082104
Log-likelihood:
23099.208210
Classic AIC:
23121.208210
AICc:
23121.293758
BIC/MDL:
23187.631842
CV:
102.119192
R square:
0.295515
Adjusted R square:
0.293233
•
•
•
•
•
•
•
•
•
•
•
•
Variable
Estimate Standard Error t(Est/SE)
-------------------- --------------- --------------- --------------Intercept
44.830862
8.116669
5.523308
PCorn_Cult
-0.335679
0.021453 -15.646997
MEAN_ELEV
0.007511
0.003040
2.471006
lnPopCrop
-14.919852
0.536873 -27.790273
lnPotCult
1.682499
0.463698
3.628434
lnWorkhouse
-4.008397
0.496028
-8.080986
lnAHS
-6.730804
0.443978 -15.160222
lnDistCoast
-0.625783
0.181567
-3.446568
lnUrbanAcc
-4.872450
0.823394
-5.917522
Percent_urban_41
0.183011
0.020085
9.111967
21500
UN
IN
B
HA
PC
41
18
tio
Ra
T
V
LE
_E
se
ou
kh
or
N
EA
B
PP
M
t
as
Co
c
Ac
an
rb
is t
W
ln
U
ln
D
ln
S
AH
ln
t
ul
C
n_
or
41
PC
n_
ba
ur
t_
en
rc
Pe
p
ro
t
ul
tC
Po
ln
pC
Po
ln
Added Variable
Step 4: Check Monte Carlo results for
significance of spatial variation
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
**********************************************************
*
GWR ESTIMATION
*
**********************************************************
Fitting Geographically Weighted Regression Model...
Number of observations............ 3098
Number of independent variables... 10
(Intercept is variable 1)
Number of nearest neighbours...... 143
Number of locations to fit model.. 3098
Diagnostic information...
Residual sum of squares.........
132561.647700
Effective number of parameters..
479.272452
Sigma...........................
7.114818
Akaike Information Criterion....
21565.942622
Coefficient of Determination....
0.702484
Adjusted r-square...............
0.648013
** Results written to .txt file
Step 4: Check Monte Carlo results for
significance of spatial variation
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
*************************************************
*
*
* Test for spatial variability of parameters *
*
*
*************************************************
Tests based on the Monte Carlo significance test
procedure due to Hope [1968,JRSB,30(3),582-598]
Parameter
P-value
---------- -----------------Intercept
0.00000 ***
Percent_
0.33000 n/s
PCorn_Cu
0.00000 ***
MEAN_ELE
0.00000 ***
lnPopCro
0.00000 ***
lnPotCul
0.00000 ***
lnWorkho
0.00000 ***
lnAHS
0.00000 ***
lnDistCo
0.00000 ***
lnUrbanA
0.00000 ***
*** = significant at .1% level
** = significant at 1% level
* = significant at 5% level
Therefore need to
calibrate a semiparametric GWR model
Step 5: Calibrate final model with GWR
4.x – semi-parametric GWR
***********************************************************
<< Fixed coefficients >>***********************************************************
Variable
Estimate
Standard Error t(Estimate/SE)
-------------------- --------------- --------------- --------------Percent_urban_41
0.152553
0.016718
9.125142
***********************************************************
<< Geographically varying coefficients >>
***********************************************************
Estimates of varying coefficients have been saved in the following file.
Listwise output file: C:\Documents and Settings\Administrator\Desktop\Famine
Run\GWRlistwise_Famine_9.csv
Summary statistics for varying coefficients
Variable
Mean
STD
-------------------- --------------- --------------Intercept
63.124587
148.367433
PCorn_Cult
-0.268054
0.231437
MEAN_ELEV
-0.007160
0.044809
lnPopCrop
-12.142006
7.752879
lnPotCult
6.459728
7.208971
lnWorkhouse
-0.001007
4.718654
lnAHS
-4.417651
5.124817
lnDistCoast
-0.574063
5.980647
lnUrbanAcc
-8.812155
16.201102
As population
density on
cropped land
increases =
•
•
•
•
•
•
•
•
•
*****************************************************************************
GWR (Geographically weighted regression) result
*****************************************************************************
Bandwidth and geographic ranges
Bandwidth size:
143.735569
Coordinate
Min
Max
Range
--------------- --------------- --------------- --------------X-coord
31562.850000 363700.560000 332137.710000
Y-coord
24792.950000 456391.560000 431598.610000
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Diagnostic information
Residual sum of squares:
136727.552393
Effective number of parameters (model: trace(S)):
428.946300
Effective number of parameters (variance: trace(S'S)):
297.771190
Degree of freedom (model: n - trace(S)):
2669.053700
Degree of freedom (residual: n - 2trace(S) + trace(S'S)):
2537.878590
ML based sigma estimate:
6.643353
Unbiased sigma estimate:
7.339942
Log-likelihood:
20524.592640
Classic AIC:
21384.485240
AICc:
21523.427900
was 21565 under
BIC/MDL:
23980.721141
CV:
61.540472
R square:
0.693135
Adjusted R square:
0.625381
full GWR
As population
density on
cropped land
increases = the
effects of the
famine are more
severe
the effects of the
famine are more
severe
Note: critical value
of t adjusted for
multiple hypothesis
testing using
Fotheringham
adjustment
As mean
elevation
increases = the
effects of the
famine are more
severe
As mean
elevation
increases =
the effects of the
famine are less
severe
As mean
elevation
increases =
the effects of the
famine are more
severe
As mean
elevation
increases=
the effects of the
famine are less
severe
Proximity to
towns
increases = the
famine more
severe
Proximity to
towns
increases = the
famine more
severe
Proximity to
towns
increases = the
famine less
severe
Proximity to
towns
increases = the
famine less
severe
Distance to the
coast increases
= the famine
less severe
Distance to the
coast increases
= the famine
less severe
Distance to the
coast increases
= the famine
more severe
Distance to the
coast increases
= the famine
more severe
Summary
Local Goodness-of-fit
Model selection is a widespread challenge in spatial
data handling.
Model
performing
relatively well
Model
performing
relatively
poorly
It is a particular issue in semi-parametric GWR
because relationships can be either global or local
Here we present a method for model selection in
GWR using both the existing GWR 3.x software and
new software GWR 4.x which allows calibration of
semi-parametric GWR models.
The method is applied to data on the Irish Famine
and the results indicate intriguing spatial variations
in the determinants of population loss during the
Famine decade.
End of Presentation
As holding size
increase =
the famine more
severe
As holding size
increases =
the famine less
severe
As holding size
increase =
the famine more
severe
As holding size
increases =
the famine less
severe
Closer to the
workhouse =
effects of the famine
are more severe
Closer to the
workhouse the
famine are less
severe
Closer to the
workhouse =
effects of the
famine are more
severe
Closer to the
workhouse the
effects of famine
are less severe
As percentage of
cropped land
under potatoes
increases =
the effects of the
famine are more
severe
As percentage of
potatoes on
cropped land
increases the
effects of the
famine are less
severe
As percentage of
cropped land
under potatoes
increases =
the effects of the
famine are more
severe
As percentage of
grain on cropped
land increases =
the effects of
the famine are
more severe
As percentage of
potatoes on
cropped land
increases the
effects of the
famine are less
severe
As percentage of
grain on cropped
land increases =
the effects of
the famine are
more severe
Global t =
7.0
Close proximity
to workhouses
→ less severe
effects
Workhouses
didn’t have
any significant
effect on
population
change in rest
of the country
Note: critical
value of t
adjusted for
multiple
hypothesis
testing using
Fotheringham
adjustment
Population change
1821 to 1911
1821
1831
1841
6,801,827
7,767,401
8,196,597
1851
6,574,278
1861
5,798,967
1871
5,412,377
1881
5,174,836
1891
4,706,162
1901
4,458,775
1911
4,381, 951
Model
working
relatively
well
Model
working
relatively
poorly
Data available at Electoral Division level
Census 1851 (1841) Agricultural Other Variables
Census 1851
Population change
Population density
Male-Female Ratio
Number of Inhabited Houses
Number of Uninhabited Houses
Valuation per Hectare
% ED urbanised
Institutional population 1851
Geographically Weighted Regression
% under agriculture
Mean elevation
Holding size
Distance to the coast or
water bodies
Wheat
Oats
Proximity to urban
centres
Barley
Distance to ports
Potatoes
Distance to workhouses
Flax
Meadow and Clover
Geographically Weighted Regression
Regression point
Regression point
Data point
Data point