0 - Regional Research Institute

Using AMOEBA to Create a Spatial Weights
Matrix and Identify Spatial Clusters, and a
Comparison to Other Clustering Algorithms
Arthur Getis* and Jared Aldstadt**
*San Diego State University
**SDSU/UCSB Joint PhD Program
Paper presented at the
Regional Research Institute, West Virginia University
Morgantown, West Virginia
December 8, 2005
AMOEBA
• A design for the construction of a spatial weights matrix using
empirical data.
• Multidirectional: Searches for spatial association in all specified
directions.
• Optimal: Optimum in the sense that the scale is local (the finest
scale) and the analysis reveals all spatial association.
• Ecotope-Based: The ecotope is a specialized region (a
particular habitat) within a larger region.
• Algorithm: The algorithm for finding the ecotope is based on an
analytical system that often finds highly irregular (amoeba-like)
sub-regions of spatial association.
The Issues
Question 1
• How does one create an appropriate
spatial weights matrix?
Question 2
• Can we have confidence in the
identification of spatial clusters?
Question 1
• How does one create an appropriate
spatial weights matrix?
The Spatial Weights Matrix
In a regression context
W is the formal expression of spatial
dependence between spatial units (the spatial
effects).
Used in, for example: y = ρWy + Xβ + ε
The Typical W Matrix
j------->
1
2
3
n
_________________________________________________________
w12
w13
...
w1n
i=1
w11
i=2
w21
w22
i=3
w31
i=n
wn1
wnn
________________________________________________________
Some Traditional W Schemes
•
•
•
•
•
•
•
Contiguity
Inverse Distances
Lengths of Shared Borders, Perimeters
nth Nearest Neighbor Distance
All Centroids within d
Ranked Distances
Network Links
Commentators on W
•
•
•
•
•
•
•
•
•
•
•
•
Anselin: Outlined the problem
Dacey: varying results given schemes
Cliff and Ord: rook’s and queen’s cases
Griffith: better under-specified
Florax & Rey: over-specification reduces power
Kooijman: maximize Moran’s
Openshaw: computer search for best model
Bartels: binary defensible
Hammersley-Clifford: near neighbors in Markov
Tiefelsdorf, Griffith, Boots: standardization
Florax and Graff: bias due to matrix sparseness
GEODA listserv
Some Recent W Schemes
• Fotheringham, Brunsdon, and Charlton’s
bandwidth distance decay (1996)
• LeSage’s Gaussian distance decline (1999)
• McMillen’s tri-cube distance decline (1998)
• Getis and Aldstadt’s local statistics model (2001,
2002)
• Fotheringham, Charlton, Brunsdon’s optimize
bandwidth (2002)
• LeSage’s Bayesian approach (2003)
• Aldstadt and Getis’ AMOEBA (2003)
W Theory or Reality?
•
•
•
Exogenous versus endogenous
Estimation versus prediction
Model driven versus data driven
• The AMOEBA approach
AMOEBA: Critical Number of Links
Identification
Local statistics’ values are computed around each observation
as the number of links (d) increases. When the absolute values
fail to rise, the cluster diameter is reached.
First peak equalsG i*dc .
2.5
|Gi*|
2
1.5
1
0.5
0
0
1
2
3
Distance
Links
4
5
AMOEBA: Weight Calculation
When dc > 0,
wij =
P( z ≤ Z d c ) − P( z ≤ Z d ij )
P( z ≤ Z dc ) − P( z ≤ Z 0 )
, for all j where dij  dc
wij = 0 , otherwise.
When dc = 0, for all j, wij = 0
P(z) is the cumulative probability associated with the standard variate
of the normal distribution
Weights vary between 0 and 1.
AMOEBA: Links Designations
dij is the number of links from the focus spatial unit i to another spatial
unit j
dc is the critical number of links: the number of links from i beyond
which no further autocorrelation exists.
AMOEBA as W and U in an
Autoregressive Spatial Lag Model
It is conceivable for rows of the weights matrix to be completely filled
with zeroes indicating that there is no local spatial autocorrelation
surrounding an observation. To compensate for the zero row effect,
we create a dummy variable, U, that assigns a 1 for all observations
with no dependence structure and 0 otherwise.
y = θWy + αU + Xβ + ε
AMOEBA as W and U in a
Autoregressive Spatial Error Model
y = αU + Xβ + (I - κW)-1ε
AMOEBA: The non-spatial and spatial
matrices
U=
0
0
0
0
1
1
0
0
0
1
0
0
1
0
0
w2,1
w3,1
w4,1
0
0
w7,1
W = w8,1
w9,1
0
w11,1
w12,1
0
w14,1
w1,2 w1,3
0 w2,3
w3,2 0
w4,2 w4,3
0 0
0 0
w7,2 w7,3
w8,2 w8,3
w9,2 w9,3
0 0
w11,2 w11,3
w12,2 w12,3
0 0
w14,2 w14,3
w1,4
w2,4
w3,4
0
0
0
w7,4
w8,4
w9,4
0
w11,4
w12,4
0
w14,4
w1,5 w1,6
w2,5 w2,6
w3,5 w3,6
w4,5 w4,6
0 0
0 0
w7,5 w7,6
w8,5 w8,6
w9,5 w9,6
0 0
w11,5 w11,6
w12,5 w12,6
0 0
w14,5 w14,6
w1,7 w1,8
w2,7 w2,8
w3,7 w3,8
w4,7 w4,8
0 0
0 0
0 w7,8
w8,7 0
w9,7 w9,8
0 0
w11,7 w11,8
w12,7 w12,8
0 0
w14,7 w14,8
w1,9
w2,9
w3,9
w4,9
0
0
w7,9
w8,9
0
0
w11,9
w12,9
0
w14,9
w1,10 w1,11 w1,12 w1,13 w1,14
w2,10 w2,11 w2,12 w2,13 w2,14
w3,10 w3,11 w3,12 w3,13 w3,14
w4,10 w4,11 w4,12 w4,13 w4,14
0 0 0 0 0
0 0 0 0 0
w7,10 w7,11 w7,12 w7,13 w7,14
w8,10 w8,11 w8,12 w8,13 w8,14
w9,10 w9,11 w9,12 w9,13 w9,14
0 0 0 0 0
w11,10 0 w11,12 w10,13 w11,14
w12,10 w12,11 0 w12,13 w12,14
0 0 0 0 0
w14,10 w14,11 w14,12 w14,13 0
Generalized AMOEBA
Yc 
1c 
 0  ε c 
Wcc Wc 0  Yc 
+β  + 
Y  = α 1  + ρ  0



0  Y0 

 0
 0
10  ε 0 
Total Fertility Rates
An Example
• Amman, Jordan
• 1994 (data by census units)
LEBANON
SYRIA
IRAQ
Mediterranean Sea
PALESTINIAN
AUTHORITY
Gaza
ISRAEL
EGYPT
SAUDI ARABIA
Explanatory Variables
Regressor social variables
• 1. Percent of females with higher education (called
“hi-ed”)
• 2. Percent females married (called “married”)
Ordinary Least Squares
AIC
No W or U
165.35
t VALUES
constant
hi-ed
married
6.266
-14.344
1.261
AMOEBA in Spatial Error Models
Contiguity
A M O E B A
Gi
Ii
ci
AIC
167.352
79.159
147.043
101.100
t VALUES
constant
hi-ed
married
lambda
non-spatial
6.499
-13.040
1.164
1.634
6.499
-11.550
1.978
98.792
12.588
7.201
-13.316
1.227
1.187
-4.048
6.342
-4.680
1.154
14.005
7.089
Comparison of Spatial Contiguity and
AMOEBA Spatial Error Model
Spatial Error Model:
• Gi AMOEBA has AIC much lower than contiguity
(79.159 to 160.625).
• All AMOEBA models are an improvement over
contiguity.
• Gi AMOEBA has an extremely high lambda and nonspatial vector: good descriptor of spatial and nonspatial effects.
• Gi AMOEBA shows social variables to be
significant in explaining TFR.
AMOEBA in Spatial Lag Models
Contiguity
AIC
t VALUES
constant
hi-ed
married
Rho
Non-Spatial
A M O E B A
Gi
Ii
ci
160.625
108.27
148.481
123.881
5.419
-9.927
1.164
-0.005
3.866
-0.087
2.160
7.435
7.594
5.068
-9.051
1.341
1.819
-1.657
4.742
-8.642
1.201
5.443
8.058
Comparison of Spatial Contiguity and
AMOEBA Spatial Lag Model
• Again all AMOEBA have lower AIC than contiguity;
Gi AMOEBA is best.
• All variables significant.
Question 2
• Can we have confidence in the
identification of spatial clusters?
Problems with Spatial Clusters
• Not explicit (what is a cluster?)
• Are they statistically significant? (degree of
confidence)
• What is the appropriate spatial scale?
• Often arbitrary, too general
• Over and under identification
• Appropriate shape (too circular, ellipsoidal)
• In general, the believability problem
AMOEBA Procedure I
For each observation i, local statistics values (e.g., Gi*, Z[Ii], Z[ci]) are
obtained for all combinations of near neighbors j of i within distance
d of i. The set of j observations that maximizes the local statistic
become members of the ecotope together with the ith observation.
1
1
0
1
1
AMOEBA Subsequent Procedures
The procedure is repeated at increasing distances from i. At each
distance d from i, only the j observations that are contiguous to the
already existing ecotope are evaluated. Again, using the local
statistic, all combinations together with the already existing ecotope
members are evaluated. That new set of j observations that
maximizes the local statistic become members of the ecotope.
2
2
2
1
2
1
0
1
2
1
2
2
3
4
3
2
3
4
4
3
2
1
2
3
4
3
2
1
0
1
2
3
3
2
1
2
3
4
6
5
4
3
4
6
5
4
mean = 0
variance = 1
Hypothetical Clusters
mean = 0
variance = 1
AMOEBA Example 1
LSM
AMOEBA Gi
AMOEBA Ii
AMOEBA ci
AMOEBA Example 2
LSM
AMOEBA Gi
AMOEBA Ii
AMOEBA ci
AMOEBA Example 3
LSM
AMOEBA Gi
AMOEBA Ii
AMOEBA ci
AMOEBA Example 4
LSM
AMOEBA Gi
AMOEBA Ii
AMOEBA ci
AMOEBA Example 5
LSM
AMOEBA Gi
AMOEBA Ii
AMOEBA ci
Heterogeneous Clusters
This is like the data used in the GA paper.
Homogeneous Clusters
This is the same 6 clusters with radii 2,4, and 6. The high clusters have a mean of 0.5 and the low clusters
have a mean of -0.5. These means are added to random values from the Normal(0,1) distribution.
Peaked Clusters
Real World Example
• Clustering of dengue hemorrhagic fever
in Thailand by province and by month.
• 14 years data: 168 monthly
observations
STARS: A GIS System
• Rey, Sergio. Space-Time Analysis of
Regional Systems (STARS). Available as
an open source program on the Internet.
Other Clustering Algorithms
• SaTScan by Kulldorff (1997, v4.0 2004),
(Communications in Statistics)
• FleXScan by Tango and Takahashi (2004,
2005) (International Journal of Health
Geographics)
Bases of Clustering Methods
AMOEBA
Based on values of the
local statistic as d
increases in many
directions from an
index location.
SaTScan
Based on a moving
circle of varying radii
searching for the circle
that is the least likely
to have occurred by
chance.
FleXScan
Based on spatial scan
statistic used on
irregularly shaped
windows formed by
connecting adjacent
neighbors.
Clustering Methods Tests
AMOEBA
•
•
Ho: The sum of the observed values within ecotopes is greater
(lesser) than expected by chance.
The p value is calculated based on the location of the local statistic
values of the observed ecotope within Monte Carlo permutations.
SaTScan
•
•
Ho: The sum of observed cases within the circular search region is
proportional to the population size.
The p value is calculated based on Poisson realizations using the
global rate.
FleXScan
•
Ho and p: Same as SaTScan, but within the irregular search region.
Clustering Comparison
High Risk Provinces  Low Risk Provinces
--------------------------------------------------Cluster No Cluster  Cluster  No Cluster
---------------------------------------------------
•
•
•
•
•
•
•
•
•
Relative Risk
Expected
AMOEBA
SaTScan
FleXScan
38
34
35
35
0
4
3
3
0
0
21
3
178
178
157
175