duczmal_JSM2003

A simulated annealing strategy
for the detection of arbitrarily
shaped spatial clusters
LUIZ DUCZMAL and
RENATO ASSUNÇÃO
Department of Statistics
Universidade Federal de Minas Gerais
Laboratório de Estatística Espacial (LESTE)
30161-970 – Belo Horizonte – MG , Brazil
ABSTRACT
We propose a new graph based strategy for the detection of spatial clusters
of arbitrary geometric form in a map of geo-referenced populations and
cases. Our test statistic is based on the likelihood ratio test previously
formulated by Kulldorff and Nagarwalla for circular clusters. A new
technique of adaptive simulated annealing is developed, focused on the
problem of finding the local maxima of a certain likelihood function over
the space of the connected subgraphs of the graph associated to the
regions of interest. Given a map with n regions, on average this algorithm
finds a quasi-optimal solution after analyzing s n log(n) subgraphs, where
s depends on the cases density uniformity in the map. The algorithm is
applied to a study of homicide clusters detection in a Brazilian large
metropolitan area.
KEYWORDS: Spatial cluster detection, simulated annealing,
likelihood ratio test, disease clusters, hot-spot detection.
z = a region formed by contiguous areas inside the map
Z = the collection of all regions of a map
p = the probability that an individual is a case in z
q = the probability that an individual is a case outside z
H0 : p  q
H1 : z  Z , p  q
N = total map population
C = total map cases
nz  the population inside the region z
cz  the cases inside the region z
KULLDORFF’S SPATIAL SCAN STATISTIC
L ( z , p, q )  p
cz
1  p 
nz c z
q
C c z
1  q 
N  n z C  c z
sup L( z, p, q)
T
zZ , p  q
sup L( z, p, q)
( p, q  [0,1]).
p q
sup L( z, p, q)  sup p C (1  p ) N C
pq
p[ 0,1]
C C ( N  C ) N C

 L0
N
N
L( z )  sup L( z, p, q )
p q
 c  cz  n  c  nz cz  C  c  C cz  N  n  (C  c )  N  N z ( C cz )
cz C  cz
z
z
z
z
z
z
if

  


 

,
 n
nz N  nz
N  nz
 z   nz 
 N  nz  

L ,
otherwise.
 0
T  sup z L( z ) / L0
The objective is to find the region z
that maximizes the function T.
The greedy algorithm
Initial configuration
The greedy algorithm
Highest LLR choice
The greedy algorithm
Picks the highest
LLR neighboor
The greedy algorithm
Highest LLR choice
The greedy algorithm
Picks the highest
LLR neighboor
The greedy algorithm
Highest LLR choice
The greedy algorithm
Picks the highest
LLR neighboor
The greedy algorithm
Highest LLR choice
(remove the region)
The greedy algorithm
Chooses the highest
LLR neighboor,
removing the region
The greedy algorithm
Highest LLR choice
The greedy algorithm
Picks the highest
LLR neighboor
The greedy algorithm
Highest LLR choice
The greedy algorithm
Picks the highest
LLR neighboor
The algorithm stops
The S.A. algorithm
Initial configuration
The S.A. algorithm
Highest LLR choice
The S.A. algorithm
Picks the highest
LLR neighboor
The S.A. algorithm
Highest LLR choice
The S.A. algorithm
S.A. chooses
another neighboor
instead
The S.A. algorithm
Picks the S.A.
chosen neighboor
The S.A. algorithm
Highest LLR choice
The S.A. algorithm
Picks the highest
LLR neighboor
The S.A. algorithm
Highest LLR choice
The S.A. algorithm
Picks the highest
LLR neighboor
The S.A. algorithm
Highest LLR choice
The S.A. algorithm
Picks the highest
LLR neighboor
The algorithm stops
The greedy algorithm choice
 High temperature: Uniform random
choice of neighbors.
 Medium Temperature: Random choice
with chances proportional to the logarithm
of the likelihood ratio of the neighbors.
 Low Temperature: Always choose the
neighbor with the highest likelihood ratio.
 High temperature: Higher mobility, does
not have a strong preferential direction.
 Medium Temperature: Has a higher
probability of choosing a direction with high
likelihood ratio, but without discarding
another directions.
 Low Temperature: Deterministic, always
choosing the neighbor with the highest
likelihood ratio.
 F(G,high) : returns a neighbor chosen uniformly
at random.
 F(G,medium) : returns a neighbor chosen at
random with probability proportional to the
logarithm of the likelihood ratio of the neighbors.
 F(G,low) : returns the neighbor with the highest
likelihood ratio.
 H(G) : returns a neighbor chosen at random
among the neighbors of the area chosen in the last
step.

It was found (Hl=1) or not (Hl=0) a neighbor
with higher L-value at the current step;

The number of consecutive steps (cs) such
that weren’t found new subgraphs with L–value > 1.

The number (vb) of times that the current
subgraph has been visited before in the survey;

The number (cv) of common vertices between
the current subgraph and the highest yet valued one
in the survey.
The basic survey algorithm
select randomly an initial connected subgraph G;
do{
find the set N(G) of all the connected subgraphs neighbors of G;
compute L for all new subgraphs in N(G);
compute hL, cs, vb, cv, cs_threshold and cs_threshold_2;
if (cs>cs_threshold_2) G:= F(G, high) ;
else{
if ((hL=0) and (vb>vb_threshold_2)) G:= F(G, medium);
else if ((hL=0) or (vb>vb_threshold_2)) G:=F(G, low);
else G:=H(G);
}
}while ((cs<=cs_threshold) and (vb<=vb_threshold));
cs_threshold=cv
cs_threshold_2=cv/2
vb_threshold is fixed and was empirically determined, and is
between 6 and 10 for most situations
vb_threshold_2= vb_threshold /2
Thus F(G, high) is adopted if the current subgraph has a relatively low
L-value, was visited many times, and for several steps of the survey the
L-values for the subgraphs have not increased.
F(G, medium) is used if the current subgraph has a relatively low L-value,
has been visited many times, but there have been an increase of the
L-values for some recently surveyed subgraph.
F(G, low) is used if there have been an increase of the L-values for some
recently surveyed subgraph, but at least one of the following conditions
are true: the current subgraph has a relatively low L-value, or it has been
visited many times.
Finally, H(G) is applied when the current subgraph has a relatively high Lvalue, has not been visited many times, and there have been an increase of
the L-values for some recently surveyed subgraph.
The first initial subgraph may be chosen by
Kulldorff’s SatScan algorithm.
The basic survey algorithm is repeated several times,
with randomly chosen initial subgraphs.
Ribeirão das Neves City
Brazil - Year 2000
259.065 inhabitants
50 homicide attempts
112 regions
Eliane Rocha & Luiz Duczmal
Testing the significance of the most likely cluster
The whole process is then repeated several hundreds
of times with random allocations, each area receiving
cases with a probability proportional to its population
Histogram of LLR values from 10,000 runs of
the S.A. algorithm under the null hypothesis
LLR
The average number of transitions from state 0 to state k is asymptotically given by
s 1k ln( k ).
The modified Cereal Box Problem
A cereal company distributes randomly n
kinds of gifts, one gift into each of its boxes.
Little Joe wants to collect all the n gifts, but
his little sister throws away each box gift
with probability (1-s).
On average, how many boxes are needed in
order to complete Joe´s collection?
Homicides per 100,000 inhabitants
>80
64
41
29
21
17
13
10
8
5
0
FIGURE 6: The homicide incidence map in the city of Belo Horizonte
Circular Cluster found by the
Kulldorff’s SatScan algorithm
Belo Horizonte city metropolitan area
Map of homicide cases, year 1995
Total population: 2,189,630
Total cases: 273
Total cases density (X 100,000): 12.47
Number of areas: 240
Yellow indicates zero cases areas
Arbitrarily shaped cluster found by
the Simulated Annealing algorithm
Belo Horizonte city metropolitan area
Map of homicide cases, year 1995
Total population: 2,189,630
Total cases: 273
Total cases density (X 100,000): 12.47
Number of areas: 240
Yellow indicates zero cases areas
Belo Horizonte city metropolitan area
Map of homicide cases, year 1995
Total population: 2,189,630
Total cases: 273
Total cases density (X 100,000): 12.47
Number of areas: 240
algorithm
cluster
size
population
cases
Density
X
100,000
ln(T)
mean
{ln(T)}
p-value
Kulldorff
27
285,162
90
31.56
35.93
5.14
0.000
Simulated
Annealing
24
227,598
111
48.77
84.65
19.91
0.000
area
population cases density X
100,000
A
17,363
1
5.76
B
18,257
2
10.95
C
21,655
2
9.24
Future work
•Power Tests;
•Penalty Functions;
•How to speed up the convergence?
•How to reach a good sub-optimal solution?
•How do we know if a good solution is attained?
•Space-time clusters.