On Spatial Aspects of Data

A Primer for Bayesian Spatial Probits
With an Application to Deforestation in Madagascar
Companion Paper for the Policy Research Report on Forests, Environment, and Livelihoods
Timothy S. Thomas
The World Bank
Development Economics Research Group
The findings, interpretations, and conclusions expressed in this paper do not necessarily reflect the views of the Executive Directors of The World Bank or the governments they represent. The World Bank does not guarantee the
accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on
any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any
territory or the endorsement or acceptance of such boundaries.
19 June 2007, Washington, DC
Version 3.1
1
Introduction
This paper is a companion paper for the World Bank’s Policy Research Report (PRR) on tropical deforestation, At Loggerheads? Agricultural Expansion, Poverty Reduction, and Environment in the Tropical Forests (Chomitz, 2007). This paper describes how to use simulation to
solve regressions on dichotomous variables that may be impacted by spatial lags or spatially autocorrelated errors. Most of what is written can be found elsewhere, but this consolidates the details in one place, provides a convenient background to the freely available MatLab programs
that the author has written for Bayesian spatial analysis1, and provides a sort of technical appendix to another companion paper to the PRR, called "Impact of Economic Policy Options on Deforestation in Madagascar".
Just as researchers necessarily account for correlations between data that has a time dimension, we must likewise account for correlations between data that has a spatial or geographic dimension. The standard assumption in regressions is that observations are independent of one
another. But it many cases, it seems a more likely that observations that are "near" in a geographic (or sometimes even in a social) sense either influence each other moreso than an obseration that is far, or that they are more influenced by the same unmeasured force than an observation that is far.
Many studies just ignore the potential interaction between observations of data, but to do so
could lead to inefficient and sometimes inconsistent estimates. For example, any analysis which
has as units of observation administrative units (like countries and states) probably ought to account for the geographic nature of the variables -- and yet most studies of this nature do not take
1
For the programs, go to www.timthomas.net.
2
spatial nearness into account. Nevertheless, spatial analyses are becoming more mainstream.
Anselin (1980, 1988), Hepple (1979), and Hordjik and Paelinck (1976) were among the first to
apply spatial methodologies to economic data. The field of spatial statistics developed even earlier (Moran 1948, 1950a, 1950b; Geary 1954; Cliff and Ord 1972, 1973). But von Thünen was
one of the earliest to understand the importance of geographic location in economic analysis
(Von Thünen 1826, 1966; Samuelson 1983).
Even if we do not believe that we need to take spatial nearness into consideration due to the
dependent variables, in some cases we ought to consider it for the sake of the explanatory variables. Many explanatory variables we use in our analyses are geographic in nature, like nearness
to roads or cities; rainfall; or soil characteristics. When we use an explanatory variable that is
inherently spatial, there is a reasonable chance that it is measured with some imprecision. It is
not unusual for GIS2 data to have imprecisely defined boundaries (e.g., for a soils map, where
one type of soil ends and the other begins is often approximate), or for information from one GIS
layer to be slightly misaligned geographically with another layer. This type of measurement error is likely to induce spatial error into the problem, which ultimately ought to be reflected in the
residual of the regression analysis.
Weights matrices
One of the main differences between time series analysis and spatial analysis is that time is
linear and is ordered in the sense that we can always know which observations precede and follow any given observation. We know that the year 2003 comes after the year 2002, and 2001
2
GIS is short for "geographical information system", which in practice is digital data which is geographically referenced (has x and y coordinates).
3
comes immediately before it. This is not the case in spatial analysis, where it is normal for one
observation to have several neighbors, and there is no natural ordering. For example, Canada
and Mexico are neighbors of the United States, but the United States is also a neighbor of Canada.
In order to specify the relationship between observations, most spatial analyses use weights
matrices. The weights matrix tells how observation i, represented by values in row i, relate to
another observation j, given in column j. By convention, the weights matrix assumes that the
weight of observation i on observation i is 0 for all i.
Most analyses assume many zeroes in the weights matrix. In part, this is because we would
expect the direct interaction between distant observations as being very small if not zero, and because the mathematical operations required for this type of analysis -- including storing an N by
N matrix, and having to invert it -- limit the size of the analysis if the matrix is not sparse (i.e.,
having many zeroes). The memory requirements of square matrices can be considerable. In the
case we will consider later, we have 5,661 data points. The square matrices A and B have
32,046,921 elements, and if the computer uses 8 bytes per element, then it requires 256,375,368
bytes for storing each one. The matrix we use, however, is sparse, and to save it in sparse format
takes less than 5,142,506 bytes – and this is larger than most weight matrices, because on average there are 47 neighbors for each observation in our example. With only 4 neighbors, it would
be less than 500 kilobytes, while the full (non-sparse) matrix would still take up 256 MB of
memory.
The literature often talks about a row standardized weights matrix, which simply means each
row is normalized so that it sums to 1. This alleviates the problem with "holes" in the log determinant (Bell and Bockstael 2000), and it allows us to interpret the spatial parameter as a correla-
4
tion measure, which in the analogy to time series, could range from -1 to 1 (and in most spatial
analyses, would be expected to range from 0 to 1).
Some standard weights matrices (before row-standardization) are as follows:

Each neighbor (sharing any portion of a border) is given a weight of 1. This is particularly used with polygons, and less so with points. See Figure 1 for an example of
each type. The figure on the left represents firaisana boundaries in Madagascar (these
are like counties). The figure on the right represents points evenly spaced on a 20 kilometer grid.
[ Insert Figure 1 near here ]

N nearest neighbors are given a weight of 1, where N is arbitrarily chosen by the researcher.

Distance-weighted neighbors (so that the nearest neighbors are given the highest
weight, perhaps by using inverse distance or inverse distance squared). Note that in
order for the analysis to use sparse matrices -- an important memory saver when there
are more than a thousand observations -- the researcher usually assigns a maximum
distance, beyond which the value in the matrix is 0.
While it makes sense to row-standardize the first two types of weights matrices, Bell and
Bockstael discuss the advantages and disadvantages of doing so to weights matrices based on
distance weights. The advantage is related to the eigenvalues, but the disadvantage is that one
would expect that absolute distance ought to be an important measure of influence of each
neighbor, and that these influences should be consistent across all observations.
5
In agreement with Bell and Bockstael, let's consider an example of how row standardization
in this case goes against intuition about how neighbors ought to influence each other. Suppose
that observation 1 has only two neighbors, each at the 10 kilometer cut-off. In other words, observation 1 is fairly isolated. In my work with deforestation studies, this might represent a small
clump of trees somewhat distant from other clumps and larger contiguous forests. Suppose observation 2 has eight neighbors at a 1 kilometer distance, and many more would be expected in
the 2 to 10 kilometer distance. In my deforestation work, this would be trees in the middle of a
large contiguous forest. To simplify the mathematical calculation, however, let’s assume that
there are no neighbors in the 2 to 10 kilometer range. Row standardization would tell us that
each of the 8 neighbors of observation 2 would have an influence weighted at 0.125 (with rowstandardization), while observation 1's two neighbors, at a great distance of 10 kilometers each
from observation 1, have weights of 0.5 each. As a result, the row standardized weights matrix
tells us that observation 1's neighbors have much stronger influence on observaton 1 than do
each of observation 2's neighbors on observation 2, even though they are considerably closer.
An ad hoc solution to this dilemma is what I call "matrix-standardization". In matrix standardization, we sum all of the rows, and then divide each element in the matrix by the maximum
of the sums. In this way, no row of the "matrix-standardized" weights matrix will sum to greater
than 1, and the spatial parameter ought to still be able to be thought of as a correlation measure,
though technically no longer restricted to be less than 1 in absolute value.
Conley (1996, 1999) and others (see Hepple 1995a for examples) assume that instead of using
weights matrices to explain how the errors are correlated, we could assume that the residual covariances were a function of distance only. Conley then nonparametrically estimates the covariance matrix (which, as we shall see shortly, is a function of the inverse of the weights matrix in
6
the normal approach to spatial econometrics). Conley assumes homoskedasticity, which differs
from the approach using the weights matrix, because that approach generally implies heteroscedasticity, though not necessarily a large amount.
In this paper, we will use the method of Anselin and others to impose weights matrices. Incidentally, on p. 257 of Anselin (2002), he shows a simple example of polygons in a plane, the
non-row-standardized neighbors matrix resulting from the arrangement of the polygons, and the
row-standardized weights matrix. This may be helpful to anyone having difficulty picturing
what I am describing here.
Gibbs Sampling in the Spatial Probit
Those who are interested in learning more about Bayesian econometrics can reference a number of useful texts (Lancaster 2004; Koop 2003; Gill 2002; Carlin and Louis 2000; Gelman et al.
1995). These texts include information about the Gibbs Sampler. Other texts are more focused
on sampling methodologies and Markov Chain Monte Carlo (MCMC) methods, of which the
Gibbs Sampler is one (Robert and Casella 1999; Gilks et al. 1996). The best book to learn
Bayesian methods for spatial regressions is LeSage’s web book (1999a). Much of what is presented in this paper is outlined in LeSage (1999a, especially pages 99-104, 107-110, and 152154) and Fleming (2004, although I present more of the details, especially in regard to the full
spatial model). Beron and Vijverberg (2004) is also a good source.
The Gibbs Sampling algorithm is both simple and powerful. It requires that we be able to
specify the distribution of each parameter conditional on the other parameters. The algorithm
begins by taking a draw from the conditional distribution associated with the first parameter or
set of parameters. For example, we will show shortly that b (the parameter multilplied times the
7
explanatory variables, X) is distributed normally, with a known mean and variance that depends
upon the other parameters. A draw is simply randomly selecting a value from the normal distribution with the given mean and variance. The algorithm proceeds sequentially through each parameter by taking a draw from each of the conditional distributions. After a draw for each parameter has been made, the iteration is completed, and the resulting parameter simulations are
saved.3
The advantage of the Gibbs Sampler for use with the probits and tobits -- and especially the
spatial probit -- is that it simulates the continuous latent variable and then treats the data like a
regular linear regression, in the sense that it treats the simulated variable as if it were the actual
variable (Albert and Chib 1993). We might think of each value for the latent variable as being a
parameter. Most of what I write in this paper is applicable both to spatial analyses where the dependent variable is known and unknown.
There are three types of spatial models that we will discuss here:
 The spatially autocorrelated lag model (SAL, and sometimes SAR) is given by y = Wy
+ X+ u, where u is iid normal, with mean of 0 and variance of 2V, where V reflects
heteroscedasticity and is therefore diagonal, but not all ones. The SAL is equivalently
written as y = (I – W)-1X + (I – W)-1u. The weights matrix is W, and the spatial parameter is .
 The spatially autocorrelated error model (SAE, and sometimes SEM) is given by y = X
+ , where  = W + u, and where u is iid normal, with mean of 0 and variance of 2V.
They are saved after the “burn in” period is finished. "Burn in" is the set of omitted iterations that take it from the
starting point, which was likely to be far from the true parameters, to a point where the simulations are likely at or
near the true parameters.
3
8
This is rewritten as y = X + (I – W)-1u, which could also be written as y = Wy + X
– WX + u. Here, the spatial parameter is .
 The full spatial model (SAC), which combines the other two, and is given by y = W1y
+ X + , where  = W2 + u, and where u is iid normal, with mean of 0 and variance
of 2V. This is rewritten as y = (I – W1)-1X + (I – W1)-1(I – W2)-1u. Sometimes W1
= W2.
The mathematics can be notationally challenging, so to generalize, we follow LeSage's simplifying notation, in which all three models can be written as y = A-1X + A-1B-1u, or Ay = X +
B-1u, or BAy = BX + u. Here, A = I in the SEM, and (I – W1) in the SAR and SAC; B = I in
the SAR, and (I – W2) in the SEM and SAC. This is equivalent to saying that  is set equal to 0
in the SEM, while  is set equal to 0 in the SAR, and that A is always equal to (I – W1), while B
is always equal to (I – W2).
Consider the multivariate regression from a Bayesian approach, using Gibbs sampling. The
likelihood function can be written as

L  2 2

N 2
V
1 2



A B exp  1 2  Ay  X  B V 1 B Ay  X 


2
or, using more shorthhand notation, as

L  2 2

N 2
V
1 2
  2 e BV Be
A B exp  1
1
2
where we define e = Ay – X.
This can be re-written as a likelihood function for y*, which is

L  2 2

N 2
V
1 2



A B exp  1 2 y *  X *  y *  X *  
2




9

where X* = PBX, y* = PBAy, and V-1 = P’P. Since we assumed that V reflected only heteroscedasticity, the off-diagonal terms are all 0, and P is equal to the square root of the reciprocal of
each element of V. That is, Pii = 1 / sqrt(Vii); and Pij = Vij = 0 if i  j.
Koop (p. 36) factors a likelihood similar to this into something that looks like a product of a
normal distribution for  and a gamma or chi-squared distribution for the inverse of 2. In our
case, it would give us
LV
1 2

A B  2 2



k 2





 


exp  1 2   ˆ X * X *   ˆ  
2



 N  k s 2  
2  N  k  2


2

exp



2
 2


where N is the number of observations, k is the length of ,
1


ˆ   X * X *  X * y *


and



s 2  y *  X * ˆ y *  X * ˆ
 N  k 
Here,  has a mean of ˆ and a variance of 2(X*′X*)-1; and the inverse of 2 is distributed with
N–k degrees of freedom.
In the Bayesian approach, the posterior is equal to the prior times the likelihood function. The
prior simply is a measure of our beliefs about the parameters before we do the experiment (i.e.,
do the data analysis). Often we have very few expectations about the value of the parameters, so
in the case of b we might have a prior that it is equal to the zero vector, with a large variance and
no covariance. Our objective in the Bayesian approach is to find the parameters for the posterior.
10
Suppose we assume a prior for  that is distributed normally with a mean vector of c and a covariance matrix of 2T. The prior can be written as

p  2 2

k 2
T
1 2
 

exp  1 2   c  T 1   c 


2
Looking only at kernel of the posterior with respect to  for the posterior we get







 



p  exp  1 2   ˆ X * X *   ˆ  1 2   c  T 1   c 
2

2



Multiplying and rearranging terms, we find that the kernel of the posterior is proportional to
 
 

~
~

exp  1    H    
2


where

H   X * X *  2   T 1  2




  H 1 X * y *  2   T 1c  2 
~
~
This shows that  is normally distributed with a mean of  and a variance of H-1.
If we do not have very strong prior beliefs about , then the covariance matrix given by 2T
would be large. When this is the case, the second term in H becomes small relative to the first
term, and H approaches X*′X*/2. At the same time, the second term in brackets in the
~
~
expression for  becomes small relative to the first bracketed term, so that  approaches
(X*′X*)-1X*′y*.
LeSage (1999a), following Geweke (1993), says that in
y
*



 X * ˆ y *  X * ˆ  2
11
2 is distributed as a chi-squared with N degrees of freedom. However, if we assume an improper prior of 2 for 2, and multiply this prior by the term in the second bracket of Koop’s rearrangement of the normal likelihood function for  and , we see that the resulting posterior appears to be distributed with N – k degrees of freedom instead, since the chi-squared distribution
is given by

z d 21e  z 2 d 22 d 2

where d is the degrees of freedom and z is the measure in question. The reason for the difference, according to Geweke, is that in the conditional, which is what we are interested in with
Gibbs Sampling, we gain k degrees of freedom because we conditioned on the .
~
Let X = BX and ~y = BAy. Assume that our prior is that r / Vii is disributed as chi-squared
with r degrees of freedom. Then we find from Geweke that

~
~
yi  X i ˆ


2
 2  r  Vii

is distributed as a chi-squared with r + 1 degrees of freedom. Geweke (1993) shows that this
very special kind of heteroscedasticity is equivalent to assuming that the error is distributed as a
t-distribution with r + 1 degrees of freedom. That is, by making a prior assumption about the
nature of the heteroskedasticity, we are able to control the degrees of freedom in the error term.
Choosing a large value for r makes the t-distribution behave like a normal distribution, while
choosing a value near 4 makes the t-distribution behave similar to a logistic distribution.
It would be helpful to discuss the covariance terms, 2 and V. In the maximum likelihood
probit, one is unable to compute  because it is only found in a ratio with , and  is not found
apart from . However, when simulating the latent variable y, the  is found apart from , so we
are able to compute it, if we choose to. If our interest is in comparing the results to the maxi12
mum likelihood results, then there may be value in fixing  to be equal to 1. Otherwise, we
could get a better fit if we let it vary, at least in the case of homoscedasticity.
When considering heteroscedasticity with maximum likelihood, we cannot estimate a parameterized covariance matrix and  simultaneously. For example, we could not hope to estimate a i
defined as 0*exp(0 + 1Zi), since ln(0) and 0 are colinear (you can see this by taking the log
of this proposed expression for i). In the case of the Gibbs Sampler, however, we do not parameterize the covariance matrix V. Instead, we assume that it is drawn from a chi-squared distribution of r degrees of freedom, where we choose r based on the degree of heteroscedasticty
that we believe is in the data (small values of r represent greater heteroscedasticity). Since this
distribution has a mean of r and a variance of 2r, unless we allow for the possibility that  is not
equal to 1, then the covariance matrix is fixed in scale. From that perspective, it is entirely reasonable and desirable to simulate both 2 and V. On the other hand, the mean of the diagonal
elements of V will not be 1, in general, though it will likely be near 1, and so the meaning of 2
is less clear. If the true variance has a mean much different than 1, allowing 2 to vary would
detect this. In the end, the choice depends a lot on preferences, on whether multicollinearity interferes with jointly estimating 2 and V, and whether we need to compare the parameters to
maximum likelihood probit estimates. I did note in checking the linear model, that restricting 2
to be equal to 1 yields very bad results, but for our case where the dependent variable is dichotomous, we will likely restrict 2 to be 1.
13
Simulating the Latent Variable
Recall that y = A-1X + A-1B-1u. Because E[A-1B-1u] is 0 when u is iid with mean 0 and variance V, E(y) is A-1X. However, while the mean is zero, the variance is not simple. The variance
  A   A BV
is given by  = A1 B 1V B 1
1
1
 , and the inverse of the variance be given by
1
BA
 = -1 = A'B'V-1BA. This tells us that the error term is distributed as a multivariate normal
(MVN).
We have just shown very simply that all we need to do to simulate the latent dependent variable y is to take a draw from a MVN. All that follows in this section is the mathematics involved
in re-framing how to express the variance of this particular MVN to something that is computational easier (i.e., quicker) for a computer to calculate. Anticipating the need for a shorter notation for the variance and it's inverse, let  be used for the variance, and  = -1 = A'B'V-1BA be
used for the inverse variance.
The pdf for the MVN distributions is given by4
f  y   2 
N 2

1 2

exp  1 2 y     1  y   


(1)
If we partition y into two blocks, y1 and y2 (which may both be vectors or one a vector and the
other a scalar), then the joint distribution is given by
f  y1 , y 2   2 
N 2
11
12
 21  22
1 2

1


 y1  1  11 12   y1  1  

exp  1 2
 
 

 y 2   2   21  22   y 2   2  



The determinant of a partitioned matrix can be written as
4
Just to be clear, for our specific problem,  = A-1X.
14
(2)
11
12
 21  22
1
1
  22 11  12  22  21  11  22   2111 12
(3)
The inverse of a partitioned matrix can be written as
 11

 21
12 
 22 
1


1
1

11  12  22  21

1
1
  22  21 11  12  22  21



1

1



1

1
12  22
1
 22   22  21 11  12  22  21
 1   1     1 1   1
11
12
22
21 11
12
21 11
  11
1
1
1
  22   2111 12  2111


1
 11  12  22  21

1

1


1
12  22 
1
1
 11 12  22   2111 12

1
22
  2111 12

1

1



(4)
Plugging the equations for determinant and inverse of a partitioned matrix into the equation
for the joint distribution and rearranging terms (this is no small exercise!), we verify that the
joint distribution is equal to the conditional distribution times the marginal distribution. That is,
f(y1,y2) = f(y1|y2) f(y2), where
f  y1 y 2   2 
 N1 2
1
11  12 22  21

1
11
1 2
 12  22  21




1
exp  1 2 y1  1  12  22  y 2   2 

 y
1
1
 1  12  22
1

 y2   2 
(5)
and
f  y 2   2 
 N2 2
 22
1 2


 1
exp  1 2 y 2   2   22  y 2   2 
Equation (5) is the formula for a MVN with mean 1 + 1222-1(y2 – 2 and variance 11 –
1222-121. Equation (6) is the formula for a MVN with mean of 2 and variance of 22.
We defined  = -1, which in the SEM is equal to (I – W)'V-1(I – W). Using (4), we can
write a number of equalities defining the submatrices of interest. They include
15
(6)
1
1
11  11  12  22  21
12  22
1
(7a)
1
 11 12
(7b)
Using the equations of (7), we see that we can express (5) as a MVN with mean 1 – 11-112
(y2 – 2) and variance 11-1. Note that 11-1 is a scalar, the reciprocal of 11, and the MVN is a
simple single-variable normal distribution. It is very important to note  is much easier to
work with and compute than , since V-1 is the inverse of a diagonal matrix, and to compute the
inverse, we simply invert each element on the diagonal of V. Multiplication by A and B is easy,
since both matrices are sparse. On the other hand,  is difficult to compute since we have to take
the inverses of A and B, which are difficult to compute and not necessarily sparse. Working with
the inverse variance directly makes an otherwise intractable problem for large matrices into an
easy one. After computing the mean and variance of the conditional distribution, we take a draw
from the left or right truncated normal distribution, depending on whether we observe a 0 or 1 for
the observation. Note that the simulated latent variables use simulated values for the latent dependent variable from the previous iteration of the Gibbs Sampler.
Let us focus on 12 (y2 – 2). If 11 is 1x1, then 12 is a 1 x (n-1) and y2 – 2 is an (n-1) x 1,
the product of which is a 1 x 1 matrix. In programming, we like to use vectors and matrices instead of loops, especially when the number of loops is potentially large. So we ask ourselves,
“How does the ith value in the vector (y – ) compare to 12 (y2 – 2)?”
In order to do this, we need to switch from block matrix notation to row-column notation.
This is fairly simple. When referring to the ith row (and ith column), the reader should note this
is block 1, and all other rows except i (same for the columns) will be block 2. The ith value in
(y – ) —which is a column vector—is simply i*(y – ), where the * means all columns of
16
 (so i* is the ith row of ). We see that 12 (y2 – 2) = i*(y – ) – ii(yi – i), where 12 is
simply the ith row of  with the ith column omitted. That means that the means in (5) can be
equivalently (and more usefully) written as  – [ii-1](y – ) + (y – ) =  + (I – [ii-1] (y
– ), where [ii-1] is an n x n diagonal matrix, with diagonal elements consisting of the reciprocal of the corresponding diagonal element in , and all zeroes off-diagonal. The difference between this notation and (5) is that (5) represents one conditional, while this equation represents
all n conditionals stacked. The diagonalized variance matrix for all n conditionals is [ii-1].
The most time consuming operation now that we have solved the problem of taking a draw
from a MVN is in computing A-1X for the SAL and SAC. It is not necessary to compute A-1 to
solve this problem, but only A-1X, which can be calculated by Gaussian reduction, which is much
faster than computing the inverse and multiplying by X. If A is not too large (say, less than 1,000
by 1,000), then we could use Gaussian reduction to draw the uncorrelated residual and multiply
by A-1 and B-1, as necessary, rather than using the method developed in the preceding pages.
The last parameters that we need to find the conditional for are  and . Because  is distributed as

L  2 2

N 2
V
1 2
I  W1 B



exp  1 2 I  W1  y  X  B V 1 BI  W1  y  X 


2
and  is distributed as

L  2 2

N 2
V
1 2
A I  W2




exp  1 2  Ay  X  I  W2  V 1 I  W2  Ay  X 


2
17
we cannot generate a draw from a standard distribution. Instead, we must use the MetropolisHastings algorithm. We will not go into details here, but it is a standard accept-reject algorithm
used in simulation. See LeSage (1999a) for more information.
Marginal effects
Beron and Vijverberg (2004) show how marginals are computed, but for very large matrices,
their method may tax the computation capacity of many computers. Here, I parallel their presentation, but use notation which allows computation even with large datasets. I also discuss some
alternative values which can be said to be the marginal effects in the probit.
Let's begin by taking a closer look at the details of the spatial probit. First, we assume that
there is a latent or unobserved dependent variable, y, such that y = A-1X + A-1B-1u, where, as
before, A = I in the SEM, and (I – W1) in the SAR and SAC; B = I in the SAR, and (I – W2) in
the SEM and SAC. We also note that by assumption, u is iid, with mean 0 and variance 2V,
where V is a diagonal matrix, with elements taken from a t-distribution. Let there be a matrix P
such that P'P = V-1. Instead of y, we observe a variable z equal to 1 if y > 0, and equal to 0 if y 
0. Let us define two new n by n matrices S = A-1B-1P-1 and T = A-1X. Define v = Pu, where v is
iid, with mean 0 and variance 2. For the ith observation (the ith row of y and T) we can now
n
write yi  Ti    S ij v j . Because they are independently distributed by assumption, the covarij 1
ance of vj and vk is 0 if j does not equal k. If each vj is from a normal distribution, the residual
term, which for simplicity we will call i, is normally distributed with mean 0 and variance i2 =
n
 2  S ij2 . In general,
j 1
n
 Sij2 will differ from
j 1
n
S
j 1
2
kj
, when i and k are different, and therefore
18
there will be a degree of heteroskedasticity in the covariance matrix. While this appears similar
to the development we did earlier for simulating the latent variable y, it differs in that for the earlier analysis, we focused on the conditional distribution of each yi given all other yj, j not equal to
i. Here, we look at the joint distribution of all y.
Since zi = 1 implies that yi > 0, we know that Ti + i > 0, which in turn implies that i > –Ti.
This tells us that when zi = 1, i is drawn from a truncated normal distribution, and that E(zi | zi =
1) = ((0 – (–Ti)) / i) = (Ti / i). Similarly, we find that E(zi | zi = 0) = 1 – (Ti / i). We
also note that E(zi) = 1 • E(zi | zi = 1) + 0 • E(zi | zi = 0), which means that E(zi) = (Ti / i). For
completeness, it seems appropriate to mention here that E(yi | zi = 1) = Ti + i(Ti / i) / (Ti
/ i), and that E(yi | zi = 0) = Ti – i(Ti / i) / [1 – (Ti / i)].
We are often interested in knowing the effect of a marginal change in one of the explanatory
variables on the outcome. In the case of the probit, the outcome we are typically interested in is
 
E(z). In the non-spatial probit, E z i  x j  ˆ j X i ˆ , where the "^" means the parameter estimate for the true parameter, xj is the jth variable in the explanatory variable matrix X, and Xi is
the ith row of X. In practice, the marginal effect of a change in xj on z is given by
 
E z  x j  ˆ j Xˆ , where X denotes the row vector consisting of the mean of each explanatory variable. While relatively easy to calculate, it seems that this is not the only possible estimate of the marginal effect of xj on z, nor is it the best. The normal pdf of a mean is not equal to
the mean of the pdf, in general. It would seem more appropriate to compute the marginal effect
of a change in xj on z for each observation, and then average the values, to give
ˆ n  X ˆ .
n
j
i 1
i
19
This is particularly important once we consider heteroskedasticity which is found in the spa-



tial probit. We find that in the case of the SEM probit, E z i  xij  ˆ j ˆ i  X i ˆ ˆ i . Using
the same idea that we used above when computing the "mean marginal effect" for the non-spatial
probit, we would compute the mean marginal effect for the spatial probit with autocorrelated er-


n


ror as ˆ j n  1 ˆ i  X i ˆ ˆ i . This method can only be applied if we can compute each i,
i 1
which at first pass appears to require us to compute S = A-1B-1P-1.
The marginal is much more difficult to calculate in the SAR and SAC. In the SEM, E(zi) =
(Xi / i), but in the SAR and SAC, E(zi) = (Ti / i) = (A-1Xi / i) = ((I – W1)-1Xi /
i). The (I – W1)-1 insures that xij will impact not only zi but also several other – and perhaps all
n
other – z. Define R = A-1, and define Q = B-1. Then yi = RiX + i, where  i   Rij u j in the
j 1
n
 n

SAR model, and  i     Rik Qkj u j in the SAC model. This means that i is distributed
j 1  k 1

n
normally with mean of 0 and variance i2 equal to  2  Rij2 for the SAR and
j 1
 n

  Rik Qkj 

j 1  k 1

n
2
for the SAC.
 K n

We now can write E z i   Ri X i     Rim X mk  k i  . From this we derive the
 k 1 m1


marginal effect of a change in the lth value of variable xj on E(zi) as E z i  xlj = Ril ˆ j ˆ i



 Ri Xˆ ˆ i , where l ranges from 1 to n, and j ranges from 1 to K. In the SEM, this derivative
was equal to 0 if i did not equal l. Here it is not necessarily equal to 0. However, what this
20
means is that we would need to report nK marginals for each observation i, a total of n2K to
compute and report. Even if we averaged across the n observations, we still have nK marginals.
A more sensible alternative definition of "the marginal" might be the change in E(zi) resulting
from a marginal change to all observations of xj. This would be written as E z i  x j =
ˆ
j

ˆ i  Ri Xˆ ˆ i
 R
n
m 1
im
. In order for us to have the marginal for E(z) instead of E(zi), we
have to average across all observations to give
E z 

x = ̂ j n
j
  1 ˆ  R Xˆ ˆ  R
n
i 1
n

i
i
i
m 1
im

.

Note that our ability to compute the marginal at first seemed related to our ability to compute the
inverse for two potentially large matrices A and B. However, using Gaussian reduction and
working with submatrices, it is possible to compute i and the summation of Rim without having
to invert (and store) the entire matrix. We see this formula above is a generalized formula that
includes the spatial error model, once we recognize that in the SEM, R = I, so that RiX = Xi, and
the sum of all elements of row i in R, when R = I, is simply 1.
An Application to Deforestation in the Spiny Forests of Madagascar
Thomas (2007) did a detailed study of deforestation in Madagascar. We will consider an excerpt from that study, so that we can demonstrate some of the techniques developed in this paper.
We will narrow our study by considering only the spiny forests of Madagascar, intentionally excluding the geographically larger moist forests also included in their study (see Figure 2). Furthermore, while the analysis used here in this report includes the same large number of explanatory variables as in the full report, we will not publish all of the variables, but only a few that we
21
can focus on. Readers interested in the full results should consult Thomas (2007). The abbreviated results are displayed in Table 1.
One of the main objectives of the authors was to see what effect poverty had on deforestation.
The authors used mean expenditure per capita at the firaisana (county) level as an indicator of
poverty in a particular location. Table 1 shows that in the model with no spatial effects, areas
having higher levels of per capita expenditure (i.e., lower levels of poverty) have lower probability of deforestation in the spiny forests of Madagascar between 1990 and 2000. This result is
statistically significant at the 10 percent level. The result supports the general idea that poverty
increases deforestation pressures. The magnitude of this effect, however, is small, as revealed by
the marginal effects and from the difference in probability computed at the 5th percentile and
95th percentile of expenditure per capita shown in the data. We see that when we include spatial
effects -- whether spatial error or spatial lag -- then the impact of poverty on deforestation is no
longer statistically significant, though the quantitative effect is the same or larger (but still
small). The direction of the effect remains the same in all models.
I tried to compute the full spatial model, but the SAC was very sensitive to starting values,
and converged to very different parameters, depending upon the initial assumptions. This is similar to what sometimes happens when maximizing a likelihood function, and the values converge
to local maxima, but not global maxima. This is a warning to us to be careful to try different
starting points, and not to rely on the SAC to be able to differentiate consistently between the
spatial error and the spatial lag model.
The author of the original study posits that nearness to roads ought to lead to higher probabilities of deforestation, and that this effect ought to be strong. However, the analysis shows mixed
results in support of this hypothesis. In the model with no spatial effects, increasing distance
22
from the road increases the probability of deforestation, contrary to the hypothesis. This result is
statistically significant at the 10 percent level, but the quantitative effect is very small for both
the marginal effect and the change in probability computed by going from the 5th percentile of
road distance to the 95th percentile. The SAR model supports this counter-intuitive result, increasing the statistical significance on the parameter estimate to the 1 percent level. The quantitative effects are slightly larger, but still quite small. The SEM model, however, contradicts this
result. It shows that the parameter is negative, as anticipated. It is not statistically significant,
however, nor is it quantitatively large.
With contradictory results such as we have for the influence of this particular variable, it
would be very helpful to know which model is the "true" model. In traditional frequentist econometrics, we often use the likelihood function to help us decide the "true" model, especially
when the models are nested. Because the parameter estimates are stochastic in the Bayesian approach, so is the value of the log likelihood, and the value of the log likelihood is not helpful in
computing statistical significance of nested models, as we do with the likelihood ratio test in frequentist econometrics.
Should we be concerned about the spatial error parameter being so close to 1 in our estimates
of the SEM model? If we used a row standardized weights matrix, where each row in the
weights matrix sums to 1, then we should be very concerned. If the spatial parameter is 1 in such
a model, then the process is not stationary, and the inverse of I – W -- equal to I + W + W2 + W3 +
... -- is undefined and I – W is singular. In this model, using the spatial error parameter estimated
in the SEM model, we find that the diagonal elements of the inverse of I – errWerr range from 1
to 1.144, while the off-diagonal elements never get larger than 0.2. That the variance matrix
does not get very large as it would if we had a row-standardized matrix with a spatial parameter
23
near 1 is in part due to the average row sum in the matrix-standardized weights matrix being only
0.642 instead of 1, and the largest individual weight in the matrix being 0.0782.
While these contradictory results for the sign on the distance to roads parameters are disappointing because they leave us wondering which is the true story about roads, it is important to
keep in mind that the quantitative impact of roads in any of these models was quite small, indicating that roads in themselves were not an important part of this story. But does it make sense
that deforestation would NOT be influenced by roads, even though our previously stated belief
was that it should? To be most accurate, we believe that deforestation ought to take place near
roads before it takes place far from roads, all other things being equal. Since deforestation is a
dynamic process, if the roads have not changed significantly over a long period of time, most of
the valuable land near roads would already be deforested before our study's starting period, and
therefore what remains may be still there due to circumstances which we did not measure for our
dataset, such as landowner's preference to keep some trees; local parks; steep slopes; or poor
soils. While we controlled for many of these effects, our data was coarse, and there could have
been, for example, steep but highly localized slopes that did not show up in our data.
The uncertainty in choosing which model is correct points to the need to develop the area of
Bayesian model comparison in the spatial probit. This entails computing the marginal likelihood, which Hepple (1995a, 1995b, 2003) has written about extensively for the linear spatial
case. Koop (2003, pp 157-161) outlines how to compute the marginal likelihood for cases with
simulated latent variables, based on Chib's (1995) pioneering paper. To run such a program is
time consuming, and it is not entirely straightforward to do. As such, it will be the subject of future work.
24
Nearness to non-forested areas is an important factor in deforestation according to all regressions in this study. It is a strongly statistically significant effect in all models, and in the models
with spatial effects, it is quantitatively large in both the marginal effect and the effect found from
comparing the response for not being near non-forested areas to the response for being near deforested areas. This seems to point to the importance of controlling for spatial effects, because
without doing so, the influence of all of the explanatory variables is quantitatively small.
One of our concerns with spatial effects in the probit model is that even if we only have spatial error (instead of spatial lag), the parameter estimates could become biased, unlike the effect
in the linear model. This is because of the potential heteroskedasticity introduced by the weights
matrix.5 In this particular study, we might have expected very little heteroskedasticity associated
with the matrix-standardized inverse squared weights matrix (with a 10 kilometer cutoff). For
example, with a spatial parameter of 0.7, the diagonal elements of (I – W)-2 only ranged from 1
to 1.123, and the largest off-diagonal element was 0.211. With little heteroskedasticity, we
would expect little parameter bias in a probit model. Large analyses, in general, will have very
little heteroskedasticity induced by the weights matrix, especially if the spatial parameter is not
too large. An exception is the case of isolated clumps, where a small group has no relationship
linking it to the wider group. Since inverses of block diagonal matrices are equal to the inverse
of each block, the heteroskedasticity can be substantial in such cases. For example, if you were
to add a small clump of 2 observations to our current example in the spiny forests of Madagascar
(where each observation in the clump has only the other observation in the clump as a neighbor),
we would find that the diagonal elements in the small clump portion of the (I – W)-2 matrix
5
Note that the idea that the off-diagonal elements don't lead to inconsistency in parameter estimates, only the ondiagonal, see Fleming pp 1, 7, 29; and to a lesser extent Anselin (1992) p 252.
25
have elements of 5.729, which is a much different variance from the large contiguous block with
all diagonal elements very close to 1.
Nevertheless, since our spatial parameter is relatively large in the SEM model, we find that
the diagonal elements range from 1 to 2.339, which means that there is a reasonable degree of
heteroskedasticity from the weights matrix. Because we used Geweke's method of heteroskedasticity, setting the degrees of freedom equal to 4 to simulate a t-distribution with very fat tails, the
diagonal elements of the covariance matrix ranged from 1.7 to 136.8. This high degree of heteroskedasticity explains how the SEM parameters can differ so significantly from the parameters
in the no spatial effects model.
Conclusions
The main purpose of this report was to lay a mathematical foundation for using Bayesian
methods to solve probit models with spatial effects. We included an application of this method
abstracted and condensed from a paper by Thomas (2007) dealing with deforestation in Madagascar.
In the course of this paper, we derived some important results used in Bayesian methodology
-- results that perhaps are not published elsewhere, especially the ones dealing with simulating
the latent variable. While affirming the derivation of the marginal presented by Beron and
Vijverberg (2004), the formulas computed here enable us handle larger matrices than the general
formula presented by those authors.
In our application, we compared probit models with alternative spatial structures. We used
“matrix-standardized” weights matrices, a compromise suggested here to deal with the Bell and
Bockstael critique of row-standardization when distance ought to matter. The models compared
26
in our example sometimes gave contradictory results, and we explained how log likelihood values were no help in resolve the contradiction. We also explained how we tried to estimate a
SAC model, but discovered that it was highly sensitive to starting values, and therefore was not
helpful in deciding between the spatial lag and the spatial error models. As we concluded, we
pointed to the workable solution (a matter for future research) of using marginal likelihoods to
determine which model fits the data the best.
We saw in the spiny forests of Madadagascar how poverty had a very small positive effect on
deforestation probability; how roads had very small but mixed effects on deforestation probability (depending on the model); and how nearness to nonforest was a very positive determinant of
deforestation probability.
Acknowledgements
I am extremely indebted to Jim LeSage for not only writing an excellent text that is available
on the web, but for programming almost everything he teaches in his book, and making those
MatLab m-files available for download, as well. Thanks to Ken Chomitz for very helpful feedback on an earlier draft.
References
Albert, James H. and Siddhartha Chib. 1993. “Bayesian Analysis of Binary and Polychotomous
Response Data”, Journal of the American Statistical Association 88:669-679.
Anselin, Luc. 1980. Estimation Methods for Spatial Autoregressive Structures. Regional Science Dissertation and Monograph Series 8. Field of Regional Science, Cornell University,
Ithaca, N.Y.
Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic.
Anselin, Luc. 1999. “Spatial Econometrics,” working paper, Bruton Center, University of Texas at Dallas.
27
Anselin, Luc. 2002. "Under the Hood: Issues in the Specification and Interpretation of Spatial
Regression Models", Agricultural Economics 27:247-267.
Barry, Ronald and Pace, R. Kelly. 1999. “Monte Carlo Estimates of the Log Determinant of
Large Sparse Matrices”, Linear Algebra and Its Applications 289:41-54.
Bell, Kathleen P. and Nancy E. Bockstael. 2000. “Applying the Generalized-Moments Estimation Approach to Spatial Problems Involving Microlevel Data,” The Review of Economics
and Statistics 82(1):72-82.
Beron, Kurt J. and Wim P. M. Vijverberg. 2000. “Probit in a Spatial Context: A Monte Carlo
Approach,” in Advances in Spatial Econometrics, ed. by Luc Anselin and Raymond J.G.M.
Florax. New York: Springer-Verlag.
Beron, Kurt J. and Wim P. M. Vijverberg. 2004. “Probit in a Spatial Context: A Monte Carlo
Analysis”, in Advances in Spatial Econometrics: Methodology, Tools,and Applications, ed.
by Luc Anselin, Raymond J.G.M. Florax, and Sergio J. Rey. New York: Springer-Verlag,
169-195.
Carlin, Bradley P. and Thomas A. Louis. 2000. Bayes and Empirical Bayes Methods for Data
Analysis 2nd ed. New York: Chapman and Hall.
Chib, Siddhartha. 1995. “Marginal Likelihood from the Gibbs Output”, Journal of the American
Statistical Association 90(432, December):1313-1321.
Chomitz, Kenneth M., with Piet Buys, Giacomo De Luca, Timothy S. Thomas, and Sheila
Wertz-Kanounnikoff. 2007. At Loggerheads? Agricultural Expansion, Poverty Reduction,
and Environment in the Tropical Forests. Washington: The World Bank.
Cliff, A. and J. K. Ord. 1972. "Testing for spatial autocorrelation among regression residuals",
Geographical Analysis 4:267–284.
Cliff, A. and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.
Conley, T.G. 1996. "Econometric Modelling of Cross-sectional Dependence". Ph.D. Dissertation. Department of Economics, University of Chicago, Chicago, IL.
Conley, T.G. 1999. "GMM Estimation with Cross Sectional Dependence", Journal of Econometrics 92:1-45.
Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete
Data Via the EM Algorithm”, Journal of the Royal Statistical Society B 39:1-38.
Fleming, Mark. 2004. “Techniques for Estimating Spatially Dependent Discrete Choice Models”, in Advances in Spatial Econometrics: Methodology, Tools,and Applications, ed. by Luc
Anselin, Raymond J.G.M. Florax, and Sergio J. Rey. New York: Springer-Verlag, 145-168.
Geary, R. C. 1954. "The Contiguity Ratio and Statistical Mapping", The Incorporated Statistician 5:115-145.
Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 1995. Bayesian Data
Analysis. London: Chapman and Hall.
28
Geman, S., and D. Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian
Restoration of Images”, IEEE Transactions on Pattern Analysis and Machine Intelligence
6:721-741.
Geweke, John. 1993. “Bayesian Treatment of the Independent Student-t Linear Model”, Journal
of Applied Econometrics 8:19-40.
Gilks, Walter R., Sylvia Richardson, and David J. Spiegelhalter, eds. 1996. Markov Chain
Monte Carlo in Practice. London: Chapman and Hall.
Gill, Jeff. 2002. Bayesian Methods: A Social and Behavioral Sciences Approach. Boca Raton,
FL: Chapman & Hall/CRC.
Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice
Hall.
Hastings, W. K. 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Biometrika 57:97-109.
Hepple, Leslie W. 1979. "Bayesian Analysis of the Linear Model with Spatial Dependence", in
Exploratory and Explanatory Statistical Analysis of Spatial Data, ed. By C. P. A. Bartels and
R. H. Ketellapper. The Hague: Martinus Nijhoff, pp 179-199.
Hepple, Leslie W. 1995a. "Bayesian Techniques in Spatial and Network Econometrics: 1. Model Comparison and Posterior Odds", Environment and Planning A 27:447-469
Hepple, Leslie W. 1995b. " Bayesian Techniques in Spatial and Network Econometrics: 2.
Computational Methods and Algorithms", Environment and Planning A 27:615-644.
Hepple, Leslie W. 2003. "Bayesian Model Choice in Spatial Econometrics", paper for LSU
Spatial Econometrics Conference, Baton Rouge, 7-11 November 2003.
Hordijk, L. and J. Paelinck. 1976. "Some Principles and Results in Spatial Econometrics", Recherches Economiques de Louvain 42:175–97.
Kelejian, Harry H. and Ingmar R. Prucha. 1999. “A Generalized Moments Estimator for the
Autoregressive Parameter in a Spatial Model”, International Economic Review 40(2):509533.
Koop, Gary. 2003. Bayesian Econometrics. Chichester: Wiley.
Kreyszig, Erwin. 1993. Advanced Engineering Mathematics, 7th ed. New York: Wiley.
Lancaster, Tony. 2004. An Introduction to Modern Bayesian Econometrics. Malden, MA:
Blackwell.
LeSage, James P. 1997. “Bayesian Estimation of Spatial Autoregressive Models”, International
Regional Science Review 20(1&2):113-129.
LeSage, James P. 1998. “Spatial Econometrics”, mimeo. Downloaded July 2003 from
http://www.spatial-econometrics.com/html/wbook.pdf.
LeSage, James P. 1999a. “The Theory and Practice of Spatial Econometrics”, mimeo. Downloaded July 2003 from http://www.spatial-econometrics.com/html/sbook.pdf.
29
LeSage, James P. 1999b. “Applied Econometrics Using MATLAB”, mimeo. Downloaded July
2003 from http://www.spatial-econometrics.com/html/mbook.pdf.
LeSage, James P. 2000. “Bayesian Estimation of Limited Dependent Variable Spatial Autoregressive Models”, Geographical Analysis 32(1):19-35.
McMillen, Daniel P. 1992. “Probit with Spatial Autocorrelation,” Journal of Regional Science
35(3):417-436.
Metropolis, N., W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. “Equation
and State Calculations by Fast Computing Machines”, Journal of Chemical Physics 21:10871092.
Moran, P. A. P. 1948. "The interpretation of statistical maps", Biometrika 35:255–260.
Moran, P. A. P. 1950a. "Notes on continuous stochastic phenomena", Biometrika 37:17–23.
Moran, P. A. P. 1950b. "A test for the serial dependence of residuals", Biometrika 37:178–181.
Pace, R. Kelly and Ronald Barry. 1998. “Quick Computations of Spatially Autoregressive Estimators”, Geographical Analysis 29(3):232-247.
Robert, Christian P. and George Casella. 1999. Monte Carlo Statistical Methods. New York:
Springer.
Samuelson, Paul. 1983. “Thunen at Two Hundred”, Journal of Economic Literature
21(4:December):1468-1488.
Thomas, Timothy S. 2007. "Impact of Economic Policy Options on Deforestation in Madagascar", a Companion Paper for the Policy Research Report on Forests, Environment, and Livelihoods, The World Bank, Washington, D.C.
Vijverberg, Wim P. M. 1997. “Monte Carlo Evaluation of Multivariate Normal Probabilities,”
Journal of Econometrics 76:281-307.
von Thünen, Johann Heinrich. 1966. Isolated State (an English edition of Der Isolierte Staat
published in 1826). Translated by Carla M. Wartenberg. Edited with an introduction by Peter
Hall. Oxford and New York: Pergamon Press.
Whittle, P. 1954. "On Stationary Processes in the Plane", Biometrika 41:434–449.
World Wildlife Federation (WWF). 2001. "Terrestrial Ecoregions GIS Database". Available at
http://www.worldwildlife.org/science/data/terreco.cfm. Downloaded May 25, 2005.
30
Figure 1. Possible ways for geographic data to be used in a statistical
analysis: administrative units and a regular grid of points
31
Figure 2. Biomes of Madagascar (WWF)
32
Table 1 Probit results
Variable
Param
Std Dev
Prob from
simul
Marginal
effect
Effect of
change from
5th to 95th
percentile
-0.5624
-0.0001
-0.0021
0.0000
0.0730
0.0886
0.1295
0.0026
0.0059
0.0031
0.0136
0.0184
0.0355
-0.6508
-0.0001
-0.0024
0.0000
0.1722
0.1629
0.2097
0.0090
0.0310
0.0234
0.0959
0.0893
0.1244
-0.4553
-0.0001
-0.0003
0.0000
0.0896
0.0932
0.1367
0.0040
0.0047
0.0125
0.0530
0.0560
0.0981
No spatial
Constant
Expenditure per capita
Population density
Road, distance to
Distance to non-forest: 1km-2km
Distance to non-forest: 500m-1km
Distance to non-forest: <500m
lag (spatial parameter)
err (spatial parameter)
log likelihood (using latent variable)
log likelihood (probit form)
-4.2451
-1.06E-03
-1.59E-02
7.57E-06
0.5508
0.6688
0.9771
None
None
-7,631.763
-1,482.173
0.3631
6.57E-04
5.39E-03
4.51E-06
0.2105
0.1965
0.1927
0.000
0.055
0.000
0.051
0.005
0.000
0.000
SAR
Constant
Expenditure per capita
Population density
Road, distance to
Distance to non-forest: 1km-2km
Distance to non-forest: 500m-1km
Distance to non-forest: <500m
lag (spatial parameter)
err (spatial parameter)
log likelihood (using latent variable)
log likelihood (probit form)
-2.6607
-5.67E-04
-9.69E-03
8.95E-06
0.7039
0.6660
0.8573
0.8010
None
-7,674.402
-1,427.657
0.3331
5.22E-04
2.48E-03
3.39E-06
0.2676
0.2604
0.2380
0.0360
0.000
0.139
0.000
0.006
0.000
0.001
0.000
0.000
SEM
Constant
Expenditure per capita
Population density
Road, distance to
Distance to non-forest: 1km-2km
Distance to non-forest: 500m-1km
Distance to non-forest: <500m
lag (spatial parameter)
err (spatial parameter)
log likelihood (using latent variable)
log likelihood (probit form)
-3.5800
-4.92E-04
-2.58E-03
-1.05E-05
0.7046
0.7325
1.0751
None
1.0023
-7,707.396
-1,608.253
33
0.4599
9.48E-04
2.95E-03
9.00E-06
0.2663
0.2668
0.2661
0.000
0.311
0.217
0.120
0.002
0.002
0.000
0.0270
0.000
Table 2 Summary statistics for data
Variable
Expenditure per capita
Population density
Road, distance to
Distance to non-forest: 1km-2km
Distance to non-forest: 500m-1km
Distance to non-forest: <500m
Binary
variable
Mean
Std dev
Min
5th percentile
Median
95th percentile
Max
0
0
0
1
1
1
236
15.2
9,477
0.113
0.180
0.659
56
21.3
7,998
0.316
0.384
0.474
139.2
1.324
0
0
0
0
153.6
5.684
707
0
0
0
232.3
9.097
7,403
0
0
0
314.9
43.682
25,943
1
1
1
611
628.4
44,090
1
1
1
34