A Primer for Bayesian Spatial Probits With an Application to Deforestation in Madagascar Companion Paper for the Policy Research Report on Forests, Environment, and Livelihoods Timothy S. Thomas The World Bank Development Economics Research Group The findings, interpretations, and conclusions expressed in this paper do not necessarily reflect the views of the Executive Directors of The World Bank or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries. 19 June 2007, Washington, DC Version 3.1 1 Introduction This paper is a companion paper for the World Bank’s Policy Research Report (PRR) on tropical deforestation, At Loggerheads? Agricultural Expansion, Poverty Reduction, and Environment in the Tropical Forests (Chomitz, 2007). This paper describes how to use simulation to solve regressions on dichotomous variables that may be impacted by spatial lags or spatially autocorrelated errors. Most of what is written can be found elsewhere, but this consolidates the details in one place, provides a convenient background to the freely available MatLab programs that the author has written for Bayesian spatial analysis1, and provides a sort of technical appendix to another companion paper to the PRR, called "Impact of Economic Policy Options on Deforestation in Madagascar". Just as researchers necessarily account for correlations between data that has a time dimension, we must likewise account for correlations between data that has a spatial or geographic dimension. The standard assumption in regressions is that observations are independent of one another. But it many cases, it seems a more likely that observations that are "near" in a geographic (or sometimes even in a social) sense either influence each other moreso than an obseration that is far, or that they are more influenced by the same unmeasured force than an observation that is far. Many studies just ignore the potential interaction between observations of data, but to do so could lead to inefficient and sometimes inconsistent estimates. For example, any analysis which has as units of observation administrative units (like countries and states) probably ought to account for the geographic nature of the variables -- and yet most studies of this nature do not take 1 For the programs, go to www.timthomas.net. 2 spatial nearness into account. Nevertheless, spatial analyses are becoming more mainstream. Anselin (1980, 1988), Hepple (1979), and Hordjik and Paelinck (1976) were among the first to apply spatial methodologies to economic data. The field of spatial statistics developed even earlier (Moran 1948, 1950a, 1950b; Geary 1954; Cliff and Ord 1972, 1973). But von Thünen was one of the earliest to understand the importance of geographic location in economic analysis (Von Thünen 1826, 1966; Samuelson 1983). Even if we do not believe that we need to take spatial nearness into consideration due to the dependent variables, in some cases we ought to consider it for the sake of the explanatory variables. Many explanatory variables we use in our analyses are geographic in nature, like nearness to roads or cities; rainfall; or soil characteristics. When we use an explanatory variable that is inherently spatial, there is a reasonable chance that it is measured with some imprecision. It is not unusual for GIS2 data to have imprecisely defined boundaries (e.g., for a soils map, where one type of soil ends and the other begins is often approximate), or for information from one GIS layer to be slightly misaligned geographically with another layer. This type of measurement error is likely to induce spatial error into the problem, which ultimately ought to be reflected in the residual of the regression analysis. Weights matrices One of the main differences between time series analysis and spatial analysis is that time is linear and is ordered in the sense that we can always know which observations precede and follow any given observation. We know that the year 2003 comes after the year 2002, and 2001 2 GIS is short for "geographical information system", which in practice is digital data which is geographically referenced (has x and y coordinates). 3 comes immediately before it. This is not the case in spatial analysis, where it is normal for one observation to have several neighbors, and there is no natural ordering. For example, Canada and Mexico are neighbors of the United States, but the United States is also a neighbor of Canada. In order to specify the relationship between observations, most spatial analyses use weights matrices. The weights matrix tells how observation i, represented by values in row i, relate to another observation j, given in column j. By convention, the weights matrix assumes that the weight of observation i on observation i is 0 for all i. Most analyses assume many zeroes in the weights matrix. In part, this is because we would expect the direct interaction between distant observations as being very small if not zero, and because the mathematical operations required for this type of analysis -- including storing an N by N matrix, and having to invert it -- limit the size of the analysis if the matrix is not sparse (i.e., having many zeroes). The memory requirements of square matrices can be considerable. In the case we will consider later, we have 5,661 data points. The square matrices A and B have 32,046,921 elements, and if the computer uses 8 bytes per element, then it requires 256,375,368 bytes for storing each one. The matrix we use, however, is sparse, and to save it in sparse format takes less than 5,142,506 bytes – and this is larger than most weight matrices, because on average there are 47 neighbors for each observation in our example. With only 4 neighbors, it would be less than 500 kilobytes, while the full (non-sparse) matrix would still take up 256 MB of memory. The literature often talks about a row standardized weights matrix, which simply means each row is normalized so that it sums to 1. This alleviates the problem with "holes" in the log determinant (Bell and Bockstael 2000), and it allows us to interpret the spatial parameter as a correla- 4 tion measure, which in the analogy to time series, could range from -1 to 1 (and in most spatial analyses, would be expected to range from 0 to 1). Some standard weights matrices (before row-standardization) are as follows: Each neighbor (sharing any portion of a border) is given a weight of 1. This is particularly used with polygons, and less so with points. See Figure 1 for an example of each type. The figure on the left represents firaisana boundaries in Madagascar (these are like counties). The figure on the right represents points evenly spaced on a 20 kilometer grid. [ Insert Figure 1 near here ] N nearest neighbors are given a weight of 1, where N is arbitrarily chosen by the researcher. Distance-weighted neighbors (so that the nearest neighbors are given the highest weight, perhaps by using inverse distance or inverse distance squared). Note that in order for the analysis to use sparse matrices -- an important memory saver when there are more than a thousand observations -- the researcher usually assigns a maximum distance, beyond which the value in the matrix is 0. While it makes sense to row-standardize the first two types of weights matrices, Bell and Bockstael discuss the advantages and disadvantages of doing so to weights matrices based on distance weights. The advantage is related to the eigenvalues, but the disadvantage is that one would expect that absolute distance ought to be an important measure of influence of each neighbor, and that these influences should be consistent across all observations. 5 In agreement with Bell and Bockstael, let's consider an example of how row standardization in this case goes against intuition about how neighbors ought to influence each other. Suppose that observation 1 has only two neighbors, each at the 10 kilometer cut-off. In other words, observation 1 is fairly isolated. In my work with deforestation studies, this might represent a small clump of trees somewhat distant from other clumps and larger contiguous forests. Suppose observation 2 has eight neighbors at a 1 kilometer distance, and many more would be expected in the 2 to 10 kilometer distance. In my deforestation work, this would be trees in the middle of a large contiguous forest. To simplify the mathematical calculation, however, let’s assume that there are no neighbors in the 2 to 10 kilometer range. Row standardization would tell us that each of the 8 neighbors of observation 2 would have an influence weighted at 0.125 (with rowstandardization), while observation 1's two neighbors, at a great distance of 10 kilometers each from observation 1, have weights of 0.5 each. As a result, the row standardized weights matrix tells us that observation 1's neighbors have much stronger influence on observaton 1 than do each of observation 2's neighbors on observation 2, even though they are considerably closer. An ad hoc solution to this dilemma is what I call "matrix-standardization". In matrix standardization, we sum all of the rows, and then divide each element in the matrix by the maximum of the sums. In this way, no row of the "matrix-standardized" weights matrix will sum to greater than 1, and the spatial parameter ought to still be able to be thought of as a correlation measure, though technically no longer restricted to be less than 1 in absolute value. Conley (1996, 1999) and others (see Hepple 1995a for examples) assume that instead of using weights matrices to explain how the errors are correlated, we could assume that the residual covariances were a function of distance only. Conley then nonparametrically estimates the covariance matrix (which, as we shall see shortly, is a function of the inverse of the weights matrix in 6 the normal approach to spatial econometrics). Conley assumes homoskedasticity, which differs from the approach using the weights matrix, because that approach generally implies heteroscedasticity, though not necessarily a large amount. In this paper, we will use the method of Anselin and others to impose weights matrices. Incidentally, on p. 257 of Anselin (2002), he shows a simple example of polygons in a plane, the non-row-standardized neighbors matrix resulting from the arrangement of the polygons, and the row-standardized weights matrix. This may be helpful to anyone having difficulty picturing what I am describing here. Gibbs Sampling in the Spatial Probit Those who are interested in learning more about Bayesian econometrics can reference a number of useful texts (Lancaster 2004; Koop 2003; Gill 2002; Carlin and Louis 2000; Gelman et al. 1995). These texts include information about the Gibbs Sampler. Other texts are more focused on sampling methodologies and Markov Chain Monte Carlo (MCMC) methods, of which the Gibbs Sampler is one (Robert and Casella 1999; Gilks et al. 1996). The best book to learn Bayesian methods for spatial regressions is LeSage’s web book (1999a). Much of what is presented in this paper is outlined in LeSage (1999a, especially pages 99-104, 107-110, and 152154) and Fleming (2004, although I present more of the details, especially in regard to the full spatial model). Beron and Vijverberg (2004) is also a good source. The Gibbs Sampling algorithm is both simple and powerful. It requires that we be able to specify the distribution of each parameter conditional on the other parameters. The algorithm begins by taking a draw from the conditional distribution associated with the first parameter or set of parameters. For example, we will show shortly that b (the parameter multilplied times the 7 explanatory variables, X) is distributed normally, with a known mean and variance that depends upon the other parameters. A draw is simply randomly selecting a value from the normal distribution with the given mean and variance. The algorithm proceeds sequentially through each parameter by taking a draw from each of the conditional distributions. After a draw for each parameter has been made, the iteration is completed, and the resulting parameter simulations are saved.3 The advantage of the Gibbs Sampler for use with the probits and tobits -- and especially the spatial probit -- is that it simulates the continuous latent variable and then treats the data like a regular linear regression, in the sense that it treats the simulated variable as if it were the actual variable (Albert and Chib 1993). We might think of each value for the latent variable as being a parameter. Most of what I write in this paper is applicable both to spatial analyses where the dependent variable is known and unknown. There are three types of spatial models that we will discuss here: The spatially autocorrelated lag model (SAL, and sometimes SAR) is given by y = Wy + X+ u, where u is iid normal, with mean of 0 and variance of 2V, where V reflects heteroscedasticity and is therefore diagonal, but not all ones. The SAL is equivalently written as y = (I – W)-1X + (I – W)-1u. The weights matrix is W, and the spatial parameter is . The spatially autocorrelated error model (SAE, and sometimes SEM) is given by y = X + , where = W + u, and where u is iid normal, with mean of 0 and variance of 2V. They are saved after the “burn in” period is finished. "Burn in" is the set of omitted iterations that take it from the starting point, which was likely to be far from the true parameters, to a point where the simulations are likely at or near the true parameters. 3 8 This is rewritten as y = X + (I – W)-1u, which could also be written as y = Wy + X – WX + u. Here, the spatial parameter is . The full spatial model (SAC), which combines the other two, and is given by y = W1y + X + , where = W2 + u, and where u is iid normal, with mean of 0 and variance of 2V. This is rewritten as y = (I – W1)-1X + (I – W1)-1(I – W2)-1u. Sometimes W1 = W2. The mathematics can be notationally challenging, so to generalize, we follow LeSage's simplifying notation, in which all three models can be written as y = A-1X + A-1B-1u, or Ay = X + B-1u, or BAy = BX + u. Here, A = I in the SEM, and (I – W1) in the SAR and SAC; B = I in the SAR, and (I – W2) in the SEM and SAC. This is equivalent to saying that is set equal to 0 in the SEM, while is set equal to 0 in the SAR, and that A is always equal to (I – W1), while B is always equal to (I – W2). Consider the multivariate regression from a Bayesian approach, using Gibbs sampling. The likelihood function can be written as L 2 2 N 2 V 1 2 A B exp 1 2 Ay X B V 1 B Ay X 2 or, using more shorthhand notation, as L 2 2 N 2 V 1 2 2 e BV Be A B exp 1 1 2 where we define e = Ay – X. This can be re-written as a likelihood function for y*, which is L 2 2 N 2 V 1 2 A B exp 1 2 y * X * y * X * 2 9 where X* = PBX, y* = PBAy, and V-1 = P’P. Since we assumed that V reflected only heteroscedasticity, the off-diagonal terms are all 0, and P is equal to the square root of the reciprocal of each element of V. That is, Pii = 1 / sqrt(Vii); and Pij = Vij = 0 if i j. Koop (p. 36) factors a likelihood similar to this into something that looks like a product of a normal distribution for and a gamma or chi-squared distribution for the inverse of 2. In our case, it would give us LV 1 2 A B 2 2 k 2 exp 1 2 ˆ X * X * ˆ 2 N k s 2 2 N k 2 2 exp 2 2 where N is the number of observations, k is the length of , 1 ˆ X * X * X * y * and s 2 y * X * ˆ y * X * ˆ N k Here, has a mean of ˆ and a variance of 2(X*′X*)-1; and the inverse of 2 is distributed with N–k degrees of freedom. In the Bayesian approach, the posterior is equal to the prior times the likelihood function. The prior simply is a measure of our beliefs about the parameters before we do the experiment (i.e., do the data analysis). Often we have very few expectations about the value of the parameters, so in the case of b we might have a prior that it is equal to the zero vector, with a large variance and no covariance. Our objective in the Bayesian approach is to find the parameters for the posterior. 10 Suppose we assume a prior for that is distributed normally with a mean vector of c and a covariance matrix of 2T. The prior can be written as p 2 2 k 2 T 1 2 exp 1 2 c T 1 c 2 Looking only at kernel of the posterior with respect to for the posterior we get p exp 1 2 ˆ X * X * ˆ 1 2 c T 1 c 2 2 Multiplying and rearranging terms, we find that the kernel of the posterior is proportional to ~ ~ exp 1 H 2 where H X * X * 2 T 1 2 H 1 X * y * 2 T 1c 2 ~ ~ This shows that is normally distributed with a mean of and a variance of H-1. If we do not have very strong prior beliefs about , then the covariance matrix given by 2T would be large. When this is the case, the second term in H becomes small relative to the first term, and H approaches X*′X*/2. At the same time, the second term in brackets in the ~ ~ expression for becomes small relative to the first bracketed term, so that approaches (X*′X*)-1X*′y*. LeSage (1999a), following Geweke (1993), says that in y * X * ˆ y * X * ˆ 2 11 2 is distributed as a chi-squared with N degrees of freedom. However, if we assume an improper prior of 2 for 2, and multiply this prior by the term in the second bracket of Koop’s rearrangement of the normal likelihood function for and , we see that the resulting posterior appears to be distributed with N – k degrees of freedom instead, since the chi-squared distribution is given by z d 21e z 2 d 22 d 2 where d is the degrees of freedom and z is the measure in question. The reason for the difference, according to Geweke, is that in the conditional, which is what we are interested in with Gibbs Sampling, we gain k degrees of freedom because we conditioned on the . ~ Let X = BX and ~y = BAy. Assume that our prior is that r / Vii is disributed as chi-squared with r degrees of freedom. Then we find from Geweke that ~ ~ yi X i ˆ 2 2 r Vii is distributed as a chi-squared with r + 1 degrees of freedom. Geweke (1993) shows that this very special kind of heteroscedasticity is equivalent to assuming that the error is distributed as a t-distribution with r + 1 degrees of freedom. That is, by making a prior assumption about the nature of the heteroskedasticity, we are able to control the degrees of freedom in the error term. Choosing a large value for r makes the t-distribution behave like a normal distribution, while choosing a value near 4 makes the t-distribution behave similar to a logistic distribution. It would be helpful to discuss the covariance terms, 2 and V. In the maximum likelihood probit, one is unable to compute because it is only found in a ratio with , and is not found apart from . However, when simulating the latent variable y, the is found apart from , so we are able to compute it, if we choose to. If our interest is in comparing the results to the maxi12 mum likelihood results, then there may be value in fixing to be equal to 1. Otherwise, we could get a better fit if we let it vary, at least in the case of homoscedasticity. When considering heteroscedasticity with maximum likelihood, we cannot estimate a parameterized covariance matrix and simultaneously. For example, we could not hope to estimate a i defined as 0*exp(0 + 1Zi), since ln(0) and 0 are colinear (you can see this by taking the log of this proposed expression for i). In the case of the Gibbs Sampler, however, we do not parameterize the covariance matrix V. Instead, we assume that it is drawn from a chi-squared distribution of r degrees of freedom, where we choose r based on the degree of heteroscedasticty that we believe is in the data (small values of r represent greater heteroscedasticity). Since this distribution has a mean of r and a variance of 2r, unless we allow for the possibility that is not equal to 1, then the covariance matrix is fixed in scale. From that perspective, it is entirely reasonable and desirable to simulate both 2 and V. On the other hand, the mean of the diagonal elements of V will not be 1, in general, though it will likely be near 1, and so the meaning of 2 is less clear. If the true variance has a mean much different than 1, allowing 2 to vary would detect this. In the end, the choice depends a lot on preferences, on whether multicollinearity interferes with jointly estimating 2 and V, and whether we need to compare the parameters to maximum likelihood probit estimates. I did note in checking the linear model, that restricting 2 to be equal to 1 yields very bad results, but for our case where the dependent variable is dichotomous, we will likely restrict 2 to be 1. 13 Simulating the Latent Variable Recall that y = A-1X + A-1B-1u. Because E[A-1B-1u] is 0 when u is iid with mean 0 and variance V, E(y) is A-1X. However, while the mean is zero, the variance is not simple. The variance A A BV is given by = A1 B 1V B 1 1 1 , and the inverse of the variance be given by 1 BA = -1 = A'B'V-1BA. This tells us that the error term is distributed as a multivariate normal (MVN). We have just shown very simply that all we need to do to simulate the latent dependent variable y is to take a draw from a MVN. All that follows in this section is the mathematics involved in re-framing how to express the variance of this particular MVN to something that is computational easier (i.e., quicker) for a computer to calculate. Anticipating the need for a shorter notation for the variance and it's inverse, let be used for the variance, and = -1 = A'B'V-1BA be used for the inverse variance. The pdf for the MVN distributions is given by4 f y 2 N 2 1 2 exp 1 2 y 1 y (1) If we partition y into two blocks, y1 and y2 (which may both be vectors or one a vector and the other a scalar), then the joint distribution is given by f y1 , y 2 2 N 2 11 12 21 22 1 2 1 y1 1 11 12 y1 1 exp 1 2 y 2 2 21 22 y 2 2 The determinant of a partitioned matrix can be written as 4 Just to be clear, for our specific problem, = A-1X. 14 (2) 11 12 21 22 1 1 22 11 12 22 21 11 22 2111 12 (3) The inverse of a partitioned matrix can be written as 11 21 12 22 1 1 1 11 12 22 21 1 1 22 21 11 12 22 21 1 1 1 1 12 22 1 22 22 21 11 12 22 21 1 1 1 1 1 11 12 22 21 11 12 21 11 11 1 1 1 22 2111 12 2111 1 11 12 22 21 1 1 1 12 22 1 1 11 12 22 2111 12 1 22 2111 12 1 1 (4) Plugging the equations for determinant and inverse of a partitioned matrix into the equation for the joint distribution and rearranging terms (this is no small exercise!), we verify that the joint distribution is equal to the conditional distribution times the marginal distribution. That is, f(y1,y2) = f(y1|y2) f(y2), where f y1 y 2 2 N1 2 1 11 12 22 21 1 11 1 2 12 22 21 1 exp 1 2 y1 1 12 22 y 2 2 y 1 1 1 12 22 1 y2 2 (5) and f y 2 2 N2 2 22 1 2 1 exp 1 2 y 2 2 22 y 2 2 Equation (5) is the formula for a MVN with mean 1 + 1222-1(y2 – 2 and variance 11 – 1222-121. Equation (6) is the formula for a MVN with mean of 2 and variance of 22. We defined = -1, which in the SEM is equal to (I – W)'V-1(I – W). Using (4), we can write a number of equalities defining the submatrices of interest. They include 15 (6) 1 1 11 11 12 22 21 12 22 1 (7a) 1 11 12 (7b) Using the equations of (7), we see that we can express (5) as a MVN with mean 1 – 11-112 (y2 – 2) and variance 11-1. Note that 11-1 is a scalar, the reciprocal of 11, and the MVN is a simple single-variable normal distribution. It is very important to note is much easier to work with and compute than , since V-1 is the inverse of a diagonal matrix, and to compute the inverse, we simply invert each element on the diagonal of V. Multiplication by A and B is easy, since both matrices are sparse. On the other hand, is difficult to compute since we have to take the inverses of A and B, which are difficult to compute and not necessarily sparse. Working with the inverse variance directly makes an otherwise intractable problem for large matrices into an easy one. After computing the mean and variance of the conditional distribution, we take a draw from the left or right truncated normal distribution, depending on whether we observe a 0 or 1 for the observation. Note that the simulated latent variables use simulated values for the latent dependent variable from the previous iteration of the Gibbs Sampler. Let us focus on 12 (y2 – 2). If 11 is 1x1, then 12 is a 1 x (n-1) and y2 – 2 is an (n-1) x 1, the product of which is a 1 x 1 matrix. In programming, we like to use vectors and matrices instead of loops, especially when the number of loops is potentially large. So we ask ourselves, “How does the ith value in the vector (y – ) compare to 12 (y2 – 2)?” In order to do this, we need to switch from block matrix notation to row-column notation. This is fairly simple. When referring to the ith row (and ith column), the reader should note this is block 1, and all other rows except i (same for the columns) will be block 2. The ith value in (y – ) —which is a column vector—is simply i*(y – ), where the * means all columns of 16 (so i* is the ith row of ). We see that 12 (y2 – 2) = i*(y – ) – ii(yi – i), where 12 is simply the ith row of with the ith column omitted. That means that the means in (5) can be equivalently (and more usefully) written as – [ii-1](y – ) + (y – ) = + (I – [ii-1] (y – ), where [ii-1] is an n x n diagonal matrix, with diagonal elements consisting of the reciprocal of the corresponding diagonal element in , and all zeroes off-diagonal. The difference between this notation and (5) is that (5) represents one conditional, while this equation represents all n conditionals stacked. The diagonalized variance matrix for all n conditionals is [ii-1]. The most time consuming operation now that we have solved the problem of taking a draw from a MVN is in computing A-1X for the SAL and SAC. It is not necessary to compute A-1 to solve this problem, but only A-1X, which can be calculated by Gaussian reduction, which is much faster than computing the inverse and multiplying by X. If A is not too large (say, less than 1,000 by 1,000), then we could use Gaussian reduction to draw the uncorrelated residual and multiply by A-1 and B-1, as necessary, rather than using the method developed in the preceding pages. The last parameters that we need to find the conditional for are and . Because is distributed as L 2 2 N 2 V 1 2 I W1 B exp 1 2 I W1 y X B V 1 BI W1 y X 2 and is distributed as L 2 2 N 2 V 1 2 A I W2 exp 1 2 Ay X I W2 V 1 I W2 Ay X 2 17 we cannot generate a draw from a standard distribution. Instead, we must use the MetropolisHastings algorithm. We will not go into details here, but it is a standard accept-reject algorithm used in simulation. See LeSage (1999a) for more information. Marginal effects Beron and Vijverberg (2004) show how marginals are computed, but for very large matrices, their method may tax the computation capacity of many computers. Here, I parallel their presentation, but use notation which allows computation even with large datasets. I also discuss some alternative values which can be said to be the marginal effects in the probit. Let's begin by taking a closer look at the details of the spatial probit. First, we assume that there is a latent or unobserved dependent variable, y, such that y = A-1X + A-1B-1u, where, as before, A = I in the SEM, and (I – W1) in the SAR and SAC; B = I in the SAR, and (I – W2) in the SEM and SAC. We also note that by assumption, u is iid, with mean 0 and variance 2V, where V is a diagonal matrix, with elements taken from a t-distribution. Let there be a matrix P such that P'P = V-1. Instead of y, we observe a variable z equal to 1 if y > 0, and equal to 0 if y 0. Let us define two new n by n matrices S = A-1B-1P-1 and T = A-1X. Define v = Pu, where v is iid, with mean 0 and variance 2. For the ith observation (the ith row of y and T) we can now n write yi Ti S ij v j . Because they are independently distributed by assumption, the covarij 1 ance of vj and vk is 0 if j does not equal k. If each vj is from a normal distribution, the residual term, which for simplicity we will call i, is normally distributed with mean 0 and variance i2 = n 2 S ij2 . In general, j 1 n Sij2 will differ from j 1 n S j 1 2 kj , when i and k are different, and therefore 18 there will be a degree of heteroskedasticity in the covariance matrix. While this appears similar to the development we did earlier for simulating the latent variable y, it differs in that for the earlier analysis, we focused on the conditional distribution of each yi given all other yj, j not equal to i. Here, we look at the joint distribution of all y. Since zi = 1 implies that yi > 0, we know that Ti + i > 0, which in turn implies that i > –Ti. This tells us that when zi = 1, i is drawn from a truncated normal distribution, and that E(zi | zi = 1) = ((0 – (–Ti)) / i) = (Ti / i). Similarly, we find that E(zi | zi = 0) = 1 – (Ti / i). We also note that E(zi) = 1 • E(zi | zi = 1) + 0 • E(zi | zi = 0), which means that E(zi) = (Ti / i). For completeness, it seems appropriate to mention here that E(yi | zi = 1) = Ti + i(Ti / i) / (Ti / i), and that E(yi | zi = 0) = Ti – i(Ti / i) / [1 – (Ti / i)]. We are often interested in knowing the effect of a marginal change in one of the explanatory variables on the outcome. In the case of the probit, the outcome we are typically interested in is E(z). In the non-spatial probit, E z i x j ˆ j X i ˆ , where the "^" means the parameter estimate for the true parameter, xj is the jth variable in the explanatory variable matrix X, and Xi is the ith row of X. In practice, the marginal effect of a change in xj on z is given by E z x j ˆ j Xˆ , where X denotes the row vector consisting of the mean of each explanatory variable. While relatively easy to calculate, it seems that this is not the only possible estimate of the marginal effect of xj on z, nor is it the best. The normal pdf of a mean is not equal to the mean of the pdf, in general. It would seem more appropriate to compute the marginal effect of a change in xj on z for each observation, and then average the values, to give ˆ n X ˆ . n j i 1 i 19 This is particularly important once we consider heteroskedasticity which is found in the spa- tial probit. We find that in the case of the SEM probit, E z i xij ˆ j ˆ i X i ˆ ˆ i . Using the same idea that we used above when computing the "mean marginal effect" for the non-spatial probit, we would compute the mean marginal effect for the spatial probit with autocorrelated er- n ror as ˆ j n 1 ˆ i X i ˆ ˆ i . This method can only be applied if we can compute each i, i 1 which at first pass appears to require us to compute S = A-1B-1P-1. The marginal is much more difficult to calculate in the SAR and SAC. In the SEM, E(zi) = (Xi / i), but in the SAR and SAC, E(zi) = (Ti / i) = (A-1Xi / i) = ((I – W1)-1Xi / i). The (I – W1)-1 insures that xij will impact not only zi but also several other – and perhaps all n other – z. Define R = A-1, and define Q = B-1. Then yi = RiX + i, where i Rij u j in the j 1 n n SAR model, and i Rik Qkj u j in the SAC model. This means that i is distributed j 1 k 1 n normally with mean of 0 and variance i2 equal to 2 Rij2 for the SAR and j 1 n Rik Qkj j 1 k 1 n 2 for the SAC. K n We now can write E z i Ri X i Rim X mk k i . From this we derive the k 1 m1 marginal effect of a change in the lth value of variable xj on E(zi) as E z i xlj = Ril ˆ j ˆ i Ri Xˆ ˆ i , where l ranges from 1 to n, and j ranges from 1 to K. In the SEM, this derivative was equal to 0 if i did not equal l. Here it is not necessarily equal to 0. However, what this 20 means is that we would need to report nK marginals for each observation i, a total of n2K to compute and report. Even if we averaged across the n observations, we still have nK marginals. A more sensible alternative definition of "the marginal" might be the change in E(zi) resulting from a marginal change to all observations of xj. This would be written as E z i x j = ˆ j ˆ i Ri Xˆ ˆ i R n m 1 im . In order for us to have the marginal for E(z) instead of E(zi), we have to average across all observations to give E z x = ̂ j n j 1 ˆ R Xˆ ˆ R n i 1 n i i i m 1 im . Note that our ability to compute the marginal at first seemed related to our ability to compute the inverse for two potentially large matrices A and B. However, using Gaussian reduction and working with submatrices, it is possible to compute i and the summation of Rim without having to invert (and store) the entire matrix. We see this formula above is a generalized formula that includes the spatial error model, once we recognize that in the SEM, R = I, so that RiX = Xi, and the sum of all elements of row i in R, when R = I, is simply 1. An Application to Deforestation in the Spiny Forests of Madagascar Thomas (2007) did a detailed study of deforestation in Madagascar. We will consider an excerpt from that study, so that we can demonstrate some of the techniques developed in this paper. We will narrow our study by considering only the spiny forests of Madagascar, intentionally excluding the geographically larger moist forests also included in their study (see Figure 2). Furthermore, while the analysis used here in this report includes the same large number of explanatory variables as in the full report, we will not publish all of the variables, but only a few that we 21 can focus on. Readers interested in the full results should consult Thomas (2007). The abbreviated results are displayed in Table 1. One of the main objectives of the authors was to see what effect poverty had on deforestation. The authors used mean expenditure per capita at the firaisana (county) level as an indicator of poverty in a particular location. Table 1 shows that in the model with no spatial effects, areas having higher levels of per capita expenditure (i.e., lower levels of poverty) have lower probability of deforestation in the spiny forests of Madagascar between 1990 and 2000. This result is statistically significant at the 10 percent level. The result supports the general idea that poverty increases deforestation pressures. The magnitude of this effect, however, is small, as revealed by the marginal effects and from the difference in probability computed at the 5th percentile and 95th percentile of expenditure per capita shown in the data. We see that when we include spatial effects -- whether spatial error or spatial lag -- then the impact of poverty on deforestation is no longer statistically significant, though the quantitative effect is the same or larger (but still small). The direction of the effect remains the same in all models. I tried to compute the full spatial model, but the SAC was very sensitive to starting values, and converged to very different parameters, depending upon the initial assumptions. This is similar to what sometimes happens when maximizing a likelihood function, and the values converge to local maxima, but not global maxima. This is a warning to us to be careful to try different starting points, and not to rely on the SAC to be able to differentiate consistently between the spatial error and the spatial lag model. The author of the original study posits that nearness to roads ought to lead to higher probabilities of deforestation, and that this effect ought to be strong. However, the analysis shows mixed results in support of this hypothesis. In the model with no spatial effects, increasing distance 22 from the road increases the probability of deforestation, contrary to the hypothesis. This result is statistically significant at the 10 percent level, but the quantitative effect is very small for both the marginal effect and the change in probability computed by going from the 5th percentile of road distance to the 95th percentile. The SAR model supports this counter-intuitive result, increasing the statistical significance on the parameter estimate to the 1 percent level. The quantitative effects are slightly larger, but still quite small. The SEM model, however, contradicts this result. It shows that the parameter is negative, as anticipated. It is not statistically significant, however, nor is it quantitatively large. With contradictory results such as we have for the influence of this particular variable, it would be very helpful to know which model is the "true" model. In traditional frequentist econometrics, we often use the likelihood function to help us decide the "true" model, especially when the models are nested. Because the parameter estimates are stochastic in the Bayesian approach, so is the value of the log likelihood, and the value of the log likelihood is not helpful in computing statistical significance of nested models, as we do with the likelihood ratio test in frequentist econometrics. Should we be concerned about the spatial error parameter being so close to 1 in our estimates of the SEM model? If we used a row standardized weights matrix, where each row in the weights matrix sums to 1, then we should be very concerned. If the spatial parameter is 1 in such a model, then the process is not stationary, and the inverse of I – W -- equal to I + W + W2 + W3 + ... -- is undefined and I – W is singular. In this model, using the spatial error parameter estimated in the SEM model, we find that the diagonal elements of the inverse of I – errWerr range from 1 to 1.144, while the off-diagonal elements never get larger than 0.2. That the variance matrix does not get very large as it would if we had a row-standardized matrix with a spatial parameter 23 near 1 is in part due to the average row sum in the matrix-standardized weights matrix being only 0.642 instead of 1, and the largest individual weight in the matrix being 0.0782. While these contradictory results for the sign on the distance to roads parameters are disappointing because they leave us wondering which is the true story about roads, it is important to keep in mind that the quantitative impact of roads in any of these models was quite small, indicating that roads in themselves were not an important part of this story. But does it make sense that deforestation would NOT be influenced by roads, even though our previously stated belief was that it should? To be most accurate, we believe that deforestation ought to take place near roads before it takes place far from roads, all other things being equal. Since deforestation is a dynamic process, if the roads have not changed significantly over a long period of time, most of the valuable land near roads would already be deforested before our study's starting period, and therefore what remains may be still there due to circumstances which we did not measure for our dataset, such as landowner's preference to keep some trees; local parks; steep slopes; or poor soils. While we controlled for many of these effects, our data was coarse, and there could have been, for example, steep but highly localized slopes that did not show up in our data. The uncertainty in choosing which model is correct points to the need to develop the area of Bayesian model comparison in the spatial probit. This entails computing the marginal likelihood, which Hepple (1995a, 1995b, 2003) has written about extensively for the linear spatial case. Koop (2003, pp 157-161) outlines how to compute the marginal likelihood for cases with simulated latent variables, based on Chib's (1995) pioneering paper. To run such a program is time consuming, and it is not entirely straightforward to do. As such, it will be the subject of future work. 24 Nearness to non-forested areas is an important factor in deforestation according to all regressions in this study. It is a strongly statistically significant effect in all models, and in the models with spatial effects, it is quantitatively large in both the marginal effect and the effect found from comparing the response for not being near non-forested areas to the response for being near deforested areas. This seems to point to the importance of controlling for spatial effects, because without doing so, the influence of all of the explanatory variables is quantitatively small. One of our concerns with spatial effects in the probit model is that even if we only have spatial error (instead of spatial lag), the parameter estimates could become biased, unlike the effect in the linear model. This is because of the potential heteroskedasticity introduced by the weights matrix.5 In this particular study, we might have expected very little heteroskedasticity associated with the matrix-standardized inverse squared weights matrix (with a 10 kilometer cutoff). For example, with a spatial parameter of 0.7, the diagonal elements of (I – W)-2 only ranged from 1 to 1.123, and the largest off-diagonal element was 0.211. With little heteroskedasticity, we would expect little parameter bias in a probit model. Large analyses, in general, will have very little heteroskedasticity induced by the weights matrix, especially if the spatial parameter is not too large. An exception is the case of isolated clumps, where a small group has no relationship linking it to the wider group. Since inverses of block diagonal matrices are equal to the inverse of each block, the heteroskedasticity can be substantial in such cases. For example, if you were to add a small clump of 2 observations to our current example in the spiny forests of Madagascar (where each observation in the clump has only the other observation in the clump as a neighbor), we would find that the diagonal elements in the small clump portion of the (I – W)-2 matrix 5 Note that the idea that the off-diagonal elements don't lead to inconsistency in parameter estimates, only the ondiagonal, see Fleming pp 1, 7, 29; and to a lesser extent Anselin (1992) p 252. 25 have elements of 5.729, which is a much different variance from the large contiguous block with all diagonal elements very close to 1. Nevertheless, since our spatial parameter is relatively large in the SEM model, we find that the diagonal elements range from 1 to 2.339, which means that there is a reasonable degree of heteroskedasticity from the weights matrix. Because we used Geweke's method of heteroskedasticity, setting the degrees of freedom equal to 4 to simulate a t-distribution with very fat tails, the diagonal elements of the covariance matrix ranged from 1.7 to 136.8. This high degree of heteroskedasticity explains how the SEM parameters can differ so significantly from the parameters in the no spatial effects model. Conclusions The main purpose of this report was to lay a mathematical foundation for using Bayesian methods to solve probit models with spatial effects. We included an application of this method abstracted and condensed from a paper by Thomas (2007) dealing with deforestation in Madagascar. In the course of this paper, we derived some important results used in Bayesian methodology -- results that perhaps are not published elsewhere, especially the ones dealing with simulating the latent variable. While affirming the derivation of the marginal presented by Beron and Vijverberg (2004), the formulas computed here enable us handle larger matrices than the general formula presented by those authors. In our application, we compared probit models with alternative spatial structures. We used “matrix-standardized” weights matrices, a compromise suggested here to deal with the Bell and Bockstael critique of row-standardization when distance ought to matter. The models compared 26 in our example sometimes gave contradictory results, and we explained how log likelihood values were no help in resolve the contradiction. We also explained how we tried to estimate a SAC model, but discovered that it was highly sensitive to starting values, and therefore was not helpful in deciding between the spatial lag and the spatial error models. As we concluded, we pointed to the workable solution (a matter for future research) of using marginal likelihoods to determine which model fits the data the best. We saw in the spiny forests of Madadagascar how poverty had a very small positive effect on deforestation probability; how roads had very small but mixed effects on deforestation probability (depending on the model); and how nearness to nonforest was a very positive determinant of deforestation probability. Acknowledgements I am extremely indebted to Jim LeSage for not only writing an excellent text that is available on the web, but for programming almost everything he teaches in his book, and making those MatLab m-files available for download, as well. Thanks to Ken Chomitz for very helpful feedback on an earlier draft. References Albert, James H. and Siddhartha Chib. 1993. “Bayesian Analysis of Binary and Polychotomous Response Data”, Journal of the American Statistical Association 88:669-679. Anselin, Luc. 1980. Estimation Methods for Spatial Autoregressive Structures. Regional Science Dissertation and Monograph Series 8. Field of Regional Science, Cornell University, Ithaca, N.Y. Anselin, Luc. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic. Anselin, Luc. 1999. “Spatial Econometrics,” working paper, Bruton Center, University of Texas at Dallas. 27 Anselin, Luc. 2002. "Under the Hood: Issues in the Specification and Interpretation of Spatial Regression Models", Agricultural Economics 27:247-267. Barry, Ronald and Pace, R. Kelly. 1999. “Monte Carlo Estimates of the Log Determinant of Large Sparse Matrices”, Linear Algebra and Its Applications 289:41-54. Bell, Kathleen P. and Nancy E. Bockstael. 2000. “Applying the Generalized-Moments Estimation Approach to Spatial Problems Involving Microlevel Data,” The Review of Economics and Statistics 82(1):72-82. Beron, Kurt J. and Wim P. M. Vijverberg. 2000. “Probit in a Spatial Context: A Monte Carlo Approach,” in Advances in Spatial Econometrics, ed. by Luc Anselin and Raymond J.G.M. Florax. New York: Springer-Verlag. Beron, Kurt J. and Wim P. M. Vijverberg. 2004. “Probit in a Spatial Context: A Monte Carlo Analysis”, in Advances in Spatial Econometrics: Methodology, Tools,and Applications, ed. by Luc Anselin, Raymond J.G.M. Florax, and Sergio J. Rey. New York: Springer-Verlag, 169-195. Carlin, Bradley P. and Thomas A. Louis. 2000. Bayes and Empirical Bayes Methods for Data Analysis 2nd ed. New York: Chapman and Hall. Chib, Siddhartha. 1995. “Marginal Likelihood from the Gibbs Output”, Journal of the American Statistical Association 90(432, December):1313-1321. Chomitz, Kenneth M., with Piet Buys, Giacomo De Luca, Timothy S. Thomas, and Sheila Wertz-Kanounnikoff. 2007. At Loggerheads? Agricultural Expansion, Poverty Reduction, and Environment in the Tropical Forests. Washington: The World Bank. Cliff, A. and J. K. Ord. 1972. "Testing for spatial autocorrelation among regression residuals", Geographical Analysis 4:267–284. Cliff, A. and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion. Conley, T.G. 1996. "Econometric Modelling of Cross-sectional Dependence". Ph.D. Dissertation. Department of Economics, University of Chicago, Chicago, IL. Conley, T.G. 1999. "GMM Estimation with Cross Sectional Dependence", Journal of Econometrics 92:1-45. Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. “Maximum Likelihood from Incomplete Data Via the EM Algorithm”, Journal of the Royal Statistical Society B 39:1-38. Fleming, Mark. 2004. “Techniques for Estimating Spatially Dependent Discrete Choice Models”, in Advances in Spatial Econometrics: Methodology, Tools,and Applications, ed. by Luc Anselin, Raymond J.G.M. Florax, and Sergio J. Rey. New York: Springer-Verlag, 145-168. Geary, R. C. 1954. "The Contiguity Ratio and Statistical Mapping", The Incorporated Statistician 5:115-145. Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 1995. Bayesian Data Analysis. London: Chapman and Hall. 28 Geman, S., and D. Geman. 1984. “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images”, IEEE Transactions on Pattern Analysis and Machine Intelligence 6:721-741. Geweke, John. 1993. “Bayesian Treatment of the Independent Student-t Linear Model”, Journal of Applied Econometrics 8:19-40. Gilks, Walter R., Sylvia Richardson, and David J. Spiegelhalter, eds. 1996. Markov Chain Monte Carlo in Practice. London: Chapman and Hall. Gill, Jeff. 2002. Bayesian Methods: A Social and Behavioral Sciences Approach. Boca Raton, FL: Chapman & Hall/CRC. Greene, William H. 2003. Econometric Analysis, 5th ed. Upper Saddle River, NJ: Prentice Hall. Hastings, W. K. 1970. “Monte Carlo Sampling Methods Using Markov Chains and Their Applications,” Biometrika 57:97-109. Hepple, Leslie W. 1979. "Bayesian Analysis of the Linear Model with Spatial Dependence", in Exploratory and Explanatory Statistical Analysis of Spatial Data, ed. By C. P. A. Bartels and R. H. Ketellapper. The Hague: Martinus Nijhoff, pp 179-199. Hepple, Leslie W. 1995a. "Bayesian Techniques in Spatial and Network Econometrics: 1. Model Comparison and Posterior Odds", Environment and Planning A 27:447-469 Hepple, Leslie W. 1995b. " Bayesian Techniques in Spatial and Network Econometrics: 2. Computational Methods and Algorithms", Environment and Planning A 27:615-644. Hepple, Leslie W. 2003. "Bayesian Model Choice in Spatial Econometrics", paper for LSU Spatial Econometrics Conference, Baton Rouge, 7-11 November 2003. Hordijk, L. and J. Paelinck. 1976. "Some Principles and Results in Spatial Econometrics", Recherches Economiques de Louvain 42:175–97. Kelejian, Harry H. and Ingmar R. Prucha. 1999. “A Generalized Moments Estimator for the Autoregressive Parameter in a Spatial Model”, International Economic Review 40(2):509533. Koop, Gary. 2003. Bayesian Econometrics. Chichester: Wiley. Kreyszig, Erwin. 1993. Advanced Engineering Mathematics, 7th ed. New York: Wiley. Lancaster, Tony. 2004. An Introduction to Modern Bayesian Econometrics. Malden, MA: Blackwell. LeSage, James P. 1997. “Bayesian Estimation of Spatial Autoregressive Models”, International Regional Science Review 20(1&2):113-129. LeSage, James P. 1998. “Spatial Econometrics”, mimeo. Downloaded July 2003 from http://www.spatial-econometrics.com/html/wbook.pdf. LeSage, James P. 1999a. “The Theory and Practice of Spatial Econometrics”, mimeo. Downloaded July 2003 from http://www.spatial-econometrics.com/html/sbook.pdf. 29 LeSage, James P. 1999b. “Applied Econometrics Using MATLAB”, mimeo. Downloaded July 2003 from http://www.spatial-econometrics.com/html/mbook.pdf. LeSage, James P. 2000. “Bayesian Estimation of Limited Dependent Variable Spatial Autoregressive Models”, Geographical Analysis 32(1):19-35. McMillen, Daniel P. 1992. “Probit with Spatial Autocorrelation,” Journal of Regional Science 35(3):417-436. Metropolis, N., W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. 1953. “Equation and State Calculations by Fast Computing Machines”, Journal of Chemical Physics 21:10871092. Moran, P. A. P. 1948. "The interpretation of statistical maps", Biometrika 35:255–260. Moran, P. A. P. 1950a. "Notes on continuous stochastic phenomena", Biometrika 37:17–23. Moran, P. A. P. 1950b. "A test for the serial dependence of residuals", Biometrika 37:178–181. Pace, R. Kelly and Ronald Barry. 1998. “Quick Computations of Spatially Autoregressive Estimators”, Geographical Analysis 29(3):232-247. Robert, Christian P. and George Casella. 1999. Monte Carlo Statistical Methods. New York: Springer. Samuelson, Paul. 1983. “Thunen at Two Hundred”, Journal of Economic Literature 21(4:December):1468-1488. Thomas, Timothy S. 2007. "Impact of Economic Policy Options on Deforestation in Madagascar", a Companion Paper for the Policy Research Report on Forests, Environment, and Livelihoods, The World Bank, Washington, D.C. Vijverberg, Wim P. M. 1997. “Monte Carlo Evaluation of Multivariate Normal Probabilities,” Journal of Econometrics 76:281-307. von Thünen, Johann Heinrich. 1966. Isolated State (an English edition of Der Isolierte Staat published in 1826). Translated by Carla M. Wartenberg. Edited with an introduction by Peter Hall. Oxford and New York: Pergamon Press. Whittle, P. 1954. "On Stationary Processes in the Plane", Biometrika 41:434–449. World Wildlife Federation (WWF). 2001. "Terrestrial Ecoregions GIS Database". Available at http://www.worldwildlife.org/science/data/terreco.cfm. Downloaded May 25, 2005. 30 Figure 1. Possible ways for geographic data to be used in a statistical analysis: administrative units and a regular grid of points 31 Figure 2. Biomes of Madagascar (WWF) 32 Table 1 Probit results Variable Param Std Dev Prob from simul Marginal effect Effect of change from 5th to 95th percentile -0.5624 -0.0001 -0.0021 0.0000 0.0730 0.0886 0.1295 0.0026 0.0059 0.0031 0.0136 0.0184 0.0355 -0.6508 -0.0001 -0.0024 0.0000 0.1722 0.1629 0.2097 0.0090 0.0310 0.0234 0.0959 0.0893 0.1244 -0.4553 -0.0001 -0.0003 0.0000 0.0896 0.0932 0.1367 0.0040 0.0047 0.0125 0.0530 0.0560 0.0981 No spatial Constant Expenditure per capita Population density Road, distance to Distance to non-forest: 1km-2km Distance to non-forest: 500m-1km Distance to non-forest: <500m lag (spatial parameter) err (spatial parameter) log likelihood (using latent variable) log likelihood (probit form) -4.2451 -1.06E-03 -1.59E-02 7.57E-06 0.5508 0.6688 0.9771 None None -7,631.763 -1,482.173 0.3631 6.57E-04 5.39E-03 4.51E-06 0.2105 0.1965 0.1927 0.000 0.055 0.000 0.051 0.005 0.000 0.000 SAR Constant Expenditure per capita Population density Road, distance to Distance to non-forest: 1km-2km Distance to non-forest: 500m-1km Distance to non-forest: <500m lag (spatial parameter) err (spatial parameter) log likelihood (using latent variable) log likelihood (probit form) -2.6607 -5.67E-04 -9.69E-03 8.95E-06 0.7039 0.6660 0.8573 0.8010 None -7,674.402 -1,427.657 0.3331 5.22E-04 2.48E-03 3.39E-06 0.2676 0.2604 0.2380 0.0360 0.000 0.139 0.000 0.006 0.000 0.001 0.000 0.000 SEM Constant Expenditure per capita Population density Road, distance to Distance to non-forest: 1km-2km Distance to non-forest: 500m-1km Distance to non-forest: <500m lag (spatial parameter) err (spatial parameter) log likelihood (using latent variable) log likelihood (probit form) -3.5800 -4.92E-04 -2.58E-03 -1.05E-05 0.7046 0.7325 1.0751 None 1.0023 -7,707.396 -1,608.253 33 0.4599 9.48E-04 2.95E-03 9.00E-06 0.2663 0.2668 0.2661 0.000 0.311 0.217 0.120 0.002 0.002 0.000 0.0270 0.000 Table 2 Summary statistics for data Variable Expenditure per capita Population density Road, distance to Distance to non-forest: 1km-2km Distance to non-forest: 500m-1km Distance to non-forest: <500m Binary variable Mean Std dev Min 5th percentile Median 95th percentile Max 0 0 0 1 1 1 236 15.2 9,477 0.113 0.180 0.659 56 21.3 7,998 0.316 0.384 0.474 139.2 1.324 0 0 0 0 153.6 5.684 707 0 0 0 232.3 9.097 7,403 0 0 0 314.9 43.682 25,943 1 1 1 611 628.4 44,090 1 1 1 34
© Copyright 2026 Paperzz