The Modifiable Areal Unit Problem in a Repression Model Whose Independent Variable Is a Distance from a Predetermined Point Naoto Tagashira, Atsuyuki Okabe Geographical Analysis, Volume 34, Number 1, January 2002, pp. 1-20 (Article) Published by The Ohio State University Press DOI: https://doi.org/10.1353/geo.2002.0006 For additional information about this article https://muse.jhu.edu/article/12090 Accessed 31 Jul 2017 23:36 GMT Naoto Tagashira and Atsuyuki Okabe The Modifiable Areal Unit Problem in a Regression Model Whose Independent Variable Is a Distance from a Predetermined Point This paper deals with the modifiable areal unit problem in the context of a regression model where a dependent variable is an attribute value (say, income) of an atomic data unit (say, a household) and an independent variable is a distance from a predetermined point (say, a central business district) to the atomic data unit (a disaggregated model). We apply this disaggregated model to spatially aggregated data in which the dependent variable is the average income over a spatial unit and the independent variable is the average distance from each household in a spatial unit to the predetermined point (an aggregated model). First, estimating the slope coefficient by the least squares method, we prove that the variance of the estimator for the slope coefficient in the aggregated model is larger than that in the disaggregated model. Second, focusing on variations in the variance of the estimator for the slope coefficient in the aggregated model with respect to the number of zones, we obtain the number of zones in which the variance is close to that in the disaggregated model. Third, we obtain the zoning system that has the minimum variance for a fixed number of zones. We also calculate the maximum variance in order to examine the range of the variance. When we investigate urban phenomena with a regression model, we often use a distance variable as an independent variable. For example, let us consider the analysis of an income distribution in a city. A possible hypothesis is that the income of a household varies with respect to the distance from the central business district (CBD) to the household. To test this hypothesis, we need the data of each income and the distance from each household to the CBD. However, individual income data is usually unavailable. Thus we often use aggregated data with respect to zones, say, census tracts. This aggregation is problematic, because the results calculated from aggregate level data are unlikely to correspond to those from individual level data. In addition, the results from aggregate level data might vary according to the degree of aggregation. The former problem is known as the ecological fallacy and the latter as the Naoto Tagashira is a research engineer at the Central Research Institute of Electric Power Industry. E-mail: [email protected] Atsuyuki Okabe is a professor in the Center for Spatial Information Science, University of Tokyo. E-mail: [email protected] Geographical Analysis, Vol. 34, No. 1 (January 2002) The Ohio State University Submitted: 9/09/99. Revised version accepted: 3/22/01. 2 / Geographical Analysis modifiable areal unit problem. The modifiable areal unit problem contains two problems: the scale problem and the aggregation problem (Openshaw 1984b). The scale problem results from the difference in the number of zones partitioning a region. The aggregation problem results from the particular configuration for a fixed number of zones. This paper deals with the ecological fallacy problem, the scale problem, and the aggregation problem in the context of a regression model where a dependent variable is an attribute value (say, income) of an atomic data unit (say, a household) and an independent variable is a distance from a predetermined point (say, the CBD) to the atomic data unit (we call this model a disaggregated model). We apply this disaggregated model to spatially aggregated data in which the dependent variable is the average income over a spatial unit and the independent variable is the average distance from each household in a spatial unit to the predetermined point (we call this model an aggregated model). Okabe and Tagashira (1996) discuss the difference between the expected value of the estimator for the slope coefficient in the aggregated model and that in the disaggregated model. In this paper, we focus on the difference in the variance of the estimator for the slope coefficient in the aggregated model and that in the disaggregated model. The first objective of this paper is to examine which model has the smaller variance. If the answer is the disaggregated model, the second objective is to obtain the number of zones and the configuration of zones in which the variance of the estimator for the slope coefficient in the aggregated model is as close as possible to that in the disaggregated model. This result would be very useful in practice, because it would show under what condition we can safely use the aggregated model. Related studies are many. Gehlke and Biehl (1934), Robinson (1950), and Yule and Kendall (1950) focus on variations in a correlation coefficient with respect to the number of zones. Among them, Robinson (1950) particularly gives the empirical evidence that the ecological fallacy problem can in fact occur. Their studies are extended to variations in estimators for coefficients of a regression model (for example, Blalock 1964; Sawicki 1973; Clark and Avery 1976; Openshaw 1978, 1984a, 1984b; Arbia 1989; and Fotheringham and Wong 1991). In their models, however, variables are mostly aspatial attribute values and a distance variable is not always explicitly treated. The distance variable is explicitly considered in spatial interaction models (for example, Curry 1972; Johnston 1973, 1975, 1976; Cliff, Martin, and Ord 1974, 1975; Curry, Griffith, and Sheppard 1975; Masser and Brown 1975, 1978; Okabe 1977; Openshaw 1977; Slater 1985; Batty and Sikdar 1982a, b, c, d; and Amrhein and Flowerdew 1992). These studies, however, cannot be applied to the regression model which we are going to deal with in this paper. The most relevant studies to this paper are Openshaw (1978, 1984a, 1984b). He proposed some zone-design criteria (Openshaw 1978) and examined the ecological fallacy through empirical experiments (Openshaw 1984a). In addition, he showed how the goodness of fit was affected by the choice of zones (Openshaw 1984b). This paper elaborates his papers. This paper consists of five sections including this introductory section. In the next section, we prove that the variance of the estimator for the slope coefficient in the aggregated model is larger than that in the disaggregated model. In section 2, focusing on the scale problem, we examine the number of zones for which the variance of the estimator for the slope coefficient in the aggregated model is close to that in the disaggregated model. In section 3, we turn to the aggregation problem, and obtain the zoning system that has the minimum variance for a fixed number of zones. We also calculate the maximum variance in order to examine the range of the variance. In the last section, we summarize the major results of this paper. Naoto Tagashira and Atsuyuki Okabe / 3 1. MODEL Suppose that there are n atomic units (say, households) in a region S. Let yi be an attribute value (say, income) of the ith atomic unit, and xi be the distance between the ith atomic unit and one predetermined center (say, the CBD). We assume that the relationship between yi and xi is truly represented by a general regression model, y i f ( x i ) ε i , i 1,..., n , (1) E(ε i ) 0, i 1,..., n . (2) Suppose next that the region S is divided into m subregions, S1, ..., Sm, which are mutually exclusive and collectively exhaustive. Let Ik be the set of indices of atomic units in a subregion, Sk, and let y( k) 1 nk ∑ i ∈I k y i , x ( k) 1 nk ∑ i ∈I k x i , ε ( k) 1 nk ∑ i ∈I k εi , (3) which are the average values of yi, xi, and εi across discrete data units in the subregion Sk, respectively. When the above model is applied to spatially aggregated data, one might expect that the equation y( k) f ( x ( k) ) ε ( k) , k 1,..., m , (4) also holds. However, it is well known that the above equation holds if and only if the general model is specified as a linear regression model. For example, Okabe and Tagashira (1996) provide the proof. Thus when we want to deal with a nonlinear model, we should transform the nonlinear model into a linear model through an appropriate transformation. Since we are concerned with a model that satisfies equation (4), we assume from now on that the relationship between yi and xi is truly given by the ordinary linear regression model, that is, Model 1 (Disaggregated Model): y i β 0 β1 x i ε i , i 1,..., n , (5) E(ε i ) 0, i 1,..., n , (6) Var (ε i ) σ 2 , i 1,..., n , (7) Cov (ε i , ε j ) 0, i j, i, j 1,..., n . (8) We notice, of course, that the assumptions (7) and (8) are sometimes violated in actual application. In particular, aggregation bias with spatial autocorrelation is discussed in depth (Cliff and Ord 1973; Anselin 1988). Yet we assume the ordinary linear 4 / Geographical Analysis regression model here because we want to focus on the variance of the estimator for the slope coefficient without increasing complexity brought by relaxing these assumptions. Then, using the ordinary least squares method, we can get unbiased estimators. Aggregating this disaggregated model with respect to households in each subregion Sk, we obtain Model 2 (Aggregated Model): y( k) β 0 β1 x ( k) ε ( k) , k 1,..., m , E(ε ( k) ) 1 nk Var (ε ( k) ) ∑ i ∈I k 1 n k2 E(ε i ) 0, k 1,..., m , ∑ i ∈I k Cov (ε ( k) , ε ( h) ) (9) Var(ε i ) 1 (10) σ2 , k 1,..., m , nk ∑ ∑ n k2 n h2 i ∈I k j ∈I h (11) Cov(ε i , ε j ) 0, k h, k, h 1,..., m . (12) Note that the aggregated model has the same coefficients as the disaggregated model. Also note that the disaggregated model has homoskedasticity [equation (7)], whereas the aggregated model has heteroskedasticity [equation (11)]. Thus the aggregated model is not the ordinary linear regression model. However, if we apply the weighted least squares method with weights given by Var(ε(k))1, that is, nk in the aggregated model, to the estimation of coefficients, we can obtain unbiased estimators. Therefore, the expected coefficients in the aggregated model are the same as those in the disaggregated model. From now on, in order to distinguish the coefficients in the disaggregated model from those in the aggregated model, we denote the coefficients in the disaggregated model by β′0 and β′1. 1.1 Comparison of the Variance of the Estimator for the Slope Coefficient in the Aggregated Model with That in the Disaggregated Model Let us next obtain the variance of the estimator for the slope coefficient in equation (9). Since the aggregated model has heteroskedasticity [equation (11)], we apply the weighted least squares method with weights given by nk to the estimation of β1. The estimator, β̂1, of β1 is ∑ k 1 n k ( x ( k) x )( y( k) y) , βˆ 1 m 2 ( ) n x x ∑ k 1 k ( k) m (13) where x– and –y are the averages of x(k) and y(k) in the region S, respectively. From this equation, the variance of β̂1 is given by σ2 Var(βˆ 1 ) ∑x n m ∑ n k x (2k) k 1 i 1 n i 2 σ2 n 1 m ∑ k 1 p k x (2k) , ( AD) 2 (14) Naoto Tagashira and Atsuyuki Okabe / 5 where pk denotes the ratio of the number of atomic units in Sk to the number of atomic units in S, that is, nk/n, and AD is the average distance between the predetermined point and all atomic units. Since the above equation contains nk or x(k), we notice that the variance of the estimator is influenced by the choice of subregions. In order to compare the variance of the estimator for the slope coefficient in the aggregated model to that in the disaggregated model, let us next calculate the variance of the estimator in the disaggregated model. Using the ordinary least squares method, we obtain σ2 Var(βˆ 1′ ) n ∑ x i2 n ∑ xi i 1 2 σ2 1 , n AD 2 ( AD)2 (15) n i 1 where AD2 is the average of the squared distance from the predetermined point to all atomic data units. Since equation (14) can be rewritten as Var(βˆ 1 ) σ2 n 1 1 m ∑ n k 1 ∑ i ∈I k xi nk , 2 ( AD)2 we can examine which model has the smaller variance by comparing with ∑ (16) ∑ m k 1 (Σ i ∈I k x i ) 2 nk n x 2. i 1 i Using the Cauchy-Schwarz inequality, we notice that the following inequality holds for k, (Σ i ∈I k x i ) 2 nk ∑ i ∈I k x i2 . (17) The equality is satisfied when nk 1 or all atomic units in Sk have the same distance. Aggregating the above equation with respect to Sk, we obtain m ∑ k 1 (Σ i ∈I k x i nk ) 2 n ∑ x i2 . i 1 (18) Therefore, the variance of the estimator for the slope coefficient in the disaggregated model is smaller than that in the aggregated model except in special cases in which each subregion Sk has one atomic unit or all atomic units in Sk have the same distance [this result corresponds with Cramer (1964)]. Thus the aggregated model is no better than the disaggregated model. In practice, 6 / Geographical Analysis however, if the difference between them is small, we may use the aggregated model. In the next section, we attempt to find the number of zones for which the variance of the estimator for the slope coefficient in the aggregated model is close to that in the disaggregated model. Then we use the relative efficiency to compare the variances of the estimators defined by Var (βˆ 1 ) Re (βˆ 1 / βˆ 1′ ) Var (βˆ 1′ ) AD 2 ( AD)2 m ∑ p k x (2k) k 1 . ( AD) (19) 2 2. THE SCALE PROBLEM In this chapter, we examine variations in the variance of the estimator for the slope coefficient in the aggregated model with respect to the number of zones and obtain the number of zones for which the variance is close to that in the disaggregated model. 2.1 Concentric Zones When we analyze urban phenomena, we often use concentric zones. Historically, this zoning stems from Burgess’s (1925) concentric zone theory, followed by factorial ecologists (Herbert and Johnston 1976). Clark (1951) also assumed concentric zones centered at the CBD when he examined the population density in a city. It is hence worth examining the variance when subregions are given by concentric zones. Suppose that S is represented by a circle. Then we can assume that the radius of the circle is 1 without loss of generality. The reason is that both the denominator and the numerator of equation (19) contain the terms of the same power of the distance. Thus, although the relative distance value affects the relative efficiency, the absolute distance value does not. We also deal with variations in the variance with the number of zones, but the absolute distance does not matter to the trend of these variations. Suppose next that Sk is represented by a ring centered in a predetermined center whose inner boundary is given by the circle with radius rk1 and whose outer boundary is given by the circle with radius rk (note that S1 is given by a disk with radius r1) and that the width of a ring is the same for k, that is, rk k/m. Let (x, θ) be the polar coordinates of a point in S where the origin is placed at the predetermined center, O, and let g(x) be the distribution function of atomic units at (x, θ) (note that the distribution function is assumed to be isotropic from O). Then p(k), x(k), AD, and AD2 are calculated as follows. pk ∫ ∫ S k xg( x ) dθdx x ( k) ∫ ∫ S xg( x ) dθdx rk ∫ ∫ S k x 2 g( x ) dθdx ∫ ∫ S k xg( x ) dθdx ∫r xg( x )dx k1 1 ∫0 xg( x )dx rk ∫r k1 rk ∫r k1 (20) , x 2 g( x )dx , (21) xg( x )dx 1 2 ∫ ∫ x 2 g( x ) dθdx ∫0 x g( x )dx AD S , 1 ∫ ∫ S xg( x ) dθdx xg( x )dx ∫0 (22) Naoto Tagashira and Atsuyuki Okabe / 7 1 3 ∫ ∫ S x 3 g( x ) dθdx ∫0 x g( x )dx AD 2 . 1 ∫ ∫ S xg( x ) dθdx xg( x )dx (23) ∫0 2.1.1 Uniform Distribution Function. First, we consider the case in which atomic units are uniformly distributed over S. Substituting rk k/m into equations (20), (21), and (22), we obtain pk 1 m2 (2k 1) , (24) x ( k) 2(3k 2 3k 1) , 3m (2k 1) (25) AD 2 . 3 (26) Then we obtain the variance of β̂1 as Var(βˆ 1 ) σ2 72 , n {4 h(m )} (27) where h(m ) 6 m 2 ψ (m 1 / 2) m 4 γ log 4 m4 , (28) where ψ(x) is the Di gamma function, and γ is the Euler number. We next prove the following theorem. THEOREM 1: Assume that a region S consists of equal-width concentric zones; atomic units are uniformly distributed over S. Then, if the disaggregated model is expressed by the ordinary regression model whose independent variable is a distance from the center, the variance of the estimator for the slope coefficient in the aggregated model decreases with an increase in the number of zones. PROOF. From equation (27), it is obvious that if h(m) increases as m increases, the variance of β̂1 decreases as m increases. Hence we wish to prove that ∆h h(m 1) h(m) is positive for m 1. Using the property of the Di gamma function given by ψ m 3 1 1 ψ m , 1 2 2 1 ∆h is written as m 2 (29) 8 / Geographical Analysis 1 1 1 1 1 ψ m ∆h 6 γ log 4 2 2 2 m 2 (m 1)2 (m 1) m 2 (m 1) (2m 1) 4 . (30) From the definition of the Di gamma function, the relation ψ m m 1 1 γ log 4 2 ∑ 2m k 1 2 2 k 1 (31) holds for m 1. Thus 1 1 2 m 1 ∆h 3 0 (32) 2 2 2 2 4 m (m 1) (2m 1) (m 1) m (m 1) holds for m 1. This completes the proof. Having understood that the estimator variance decreases with an increase in the number of zones, and that the disaggregated model has the smaller variance than the aggregated model, we now wish to know the number of zones for which the variance of the estimator for the slope coefficient in the aggregated model is close to that in the disaggregated model. Under the hypotheses introduced in this section, the variance of the estimator for the slope coefficient in the disaggregated model that we obtain is Var(βˆ 1 ) 18σ 2 . n (33) Thus the relative efficiency is written as Re (βˆ 1 / βˆ 1′ ) 4m 2 ( ) 4 m 4 6 m 2 ψ m 1 γ log 4 2 . (34) We show the numerical values of Re(β̂1/β̂1′ ) in Table 1. For three zones, for example, the relative efficiency is 1.1865. This means that the variance of the estimator for the slope coefficient in the aggregated model is larger than that in the disaggregated model by 18.65 percent. For example, if we require the relative efficiency to be less than 1.05, we should prepare six zones or more. If we want to make the efficiency smaller, we need a larger number of zones. In order to make it less than 1.01, we need to have thirteen zones or more. 2.1.2 Negative Exponential Distribution Function. Up to now we have assumed that the atomic data units were distributed uniformly. This assumption, however, is sometimes unrealistic. Thus we next assume a more realistic distribution. Naoto Tagashira and Atsuyuki Okabe / 9 TABLE 1 Variation in Re(β̂1/β̂1′ ) with the Number of Concentric Zones Where Atomic Units Are Distributed Uniformly Number of zones Re(β̂1/β̂1′ ) 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1.1865 1.0995 1.0622 1.0427 1.0311 1.0237 1.0187 1.0151 1.0125 1.0105 1.0089 1.0077 1.0067 1.0059 1.0052 1.0046 1.0042 1.0038 When atomic units refer to households, the spatial distribution is likely to follow Clark’s (1951) law given by g( x ) a exp(bx ), a, b 0, ( x , θ) ∈ S . (35) From equation (19), since the absolute value of the number of the atomic units does not affect the relative efficiency, we can put a 1 without loss of generality. Using equations (20) and (21), the values of pk and x(k) are written as pk ( exp b k m ){ (1 b ) exp( )(1 b )} k m k1 m b m (36) exp(b)(1 b) 1 and ( ) b exp x ( k) b m 2 2 ( k1) m2 2b { ( )( b b exp b m ( k1) m 2 ( ) (b k1 1 m k2 m2 k m b2 2 b )} k m ), 2 (37) 1 respectively. Since Var(β̂1) becomes complicated, we numerically investigate the values of Var(β̂1) with b 1, 2, 4, 6. Note that b 1, 2, 4, 6 implies that the densities at 1/2, which is a half of the radius of the region, are about 61 percent, 37 percent, 14 percent, 5 percent of the density of the center and that the densities at 1, which is the fringe of the region, are about 37 percent, 14 percent, 2 percent, 0.2 percent of the density of the center, respectively. We think b 6 is large enough to represent a significant decrease in the distribution of atomic data units that we encounter in actual application. The values of Re(β̂1/β̂1′ ) are shown in Table 2. We notice from this table that the values of the relative efficiency decrease with an increase in m for all values of b. 10 / Geographical Analysis TABLE 2 Variation in Re(β̂1/β̂1′ ) with the Number of Concentric Zones Where Atomic Units Are Distributed According to g(x, θ) exp (bx) Where b 1.0, 2.0, 4.0, 6.0 Re(β̂1/β̂1′ ) Number of zones b 1.0 b 2.0 b 4.0 b 6.0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1.1637 1.0881 1.0554 1.0381 1.0278 1.0212 1.0168 1.0136 1.0112 1.0094 1.0080 1.0069 1.0060 1.0053 1.0047 1.0042 1.0037 1.0034 1.1526 1.0828 1.0523 1.0361 1.0264 1.0202 1.0159 1.0129 1.0107 1.0090 1.0076 1.0066 1.0057 1.0050 1.0045 1.0040 1.0036 1.0032 1.1623 1.0888 1.0565 1.0392 1.0288 1.0221 1.0175 1.0142 1.0117 1.0099 1.0084 1.0073 1.0063 1.0056 1.0049 1.0044 1.0040 1.0036 1.2176 1.1178 1.0749 1.0521 1.0384 1.0295 1.0234 1.0190 1.0157 1.0133 1.0113 1.0098 1.0085 1.0075 1.0067 1.0059 1.0053 1.0048 Since the variance of β̂1′ is constant, we also notice that the variance of β̂1 decreases with an increase in m. Table 2 also shows that for b 1, 2, 4, if the number of zones is the same, the values of the relative efficiency are smaller than those for the uniform function. This means that in the case of the negative exponential function whose concavity is not so large, we obtain the variance closer to the disaggregated model than the uniform function. For b 6 (the density declines quite rapidly), however, the relative efficiency is larger than that for the uniform function. Thus, we need a larger number of zones if we want to make the variance as close to the disaggregated model as the uniform distribution. For example, in order to make the relative efficiency less than 1.01, we have to prepare fourteen zones or more. 2.2 Square Grid Zones We next assume that a region is square shaped and that subregions are given by square grid zones (see Figure 1). As we stated in the case of the concentric zones, since the absolute distance value does not affect the following discussion, we assume that a half of the side length of the region is 1 without loss of generality. 2.2.1 Uniform Distribution Function. First, suppose that atomic data units are distributed uniformly. In this case, we obtain pk 1 . m (38) The value of x(k) is given by x ( k) L(k) , 3 m where the detailed explanation of L(k) can be found in Appendix 1. (39) Naoto Tagashira and Atsuyuki Okabe / 11 FIG. 1. Square Grid Zones Substitution of these equations into equation (14) gives Var(βˆ 1 ) σ2 n 1 m2 2 Σm k 1 L(k) 4 { 9 )} ( 2 log 2 1 2 . (40) Using equation (15), we also obtain the variance of the estimator for the slope coefficient in the disaggregated model as σ Var(βˆ 1′ ) n 2 9 4 4 2 2 log ( ) { ( 2 1 log )} 2 1 2 . (41) The values of Re(β̂1/β̂1′ ) are shown in Table 3. We notice from this table that Var(β̂1) decreases with an increase in m. If we have only sixteen zones, the variance of the estimator for the slope coefficient in the aggregated model is larger than that in the disaggregated model by 33.81 percent. If we require a smaller variance, a larger number of zones is needed. For example, 484 zones lead to the relative efficiency of less than 1.01. Now, let us compare the results of Table 3 with those of Table 1. We first reconsider the square grid zones by taking Figure 1, for instance. There are thirty-six zones in Figure 1. However, there are some zones that have the same distance variable and the same number of atomic units. For example, there are seven other zones that have the same distance variable and the same number of atomic units as the Sk shown in Figure 1 (the diagonally shaded zones). When we use the weighted least squares method, if the different zones have the same distance and the same number of atomic units, their zones correspond to one independent variable. Considering this matter, we notice that Figure 1 has thirty-six zones, but that it corresponds to only six independent variables. Thus let us compare the result of the six zones in Table 1 with that of thirty-six zones in Table 3. We can find that the relative efficiency of the Table 3 is larger than that of Table 1. This finding is true for other numbers of zones. This means that the square grid zones are not better than the equal-width concentric zones in the case of the same number of independent variables when we estimate the slope coefficient in the aggregated model whose independent variable is the distance from the center. 12 / Geographical Analysis TABLE 3 Variation in Re(β̂1/β̂1′ ) with the Number of Square Grid Zones Where Atomic Units Are Distributed Uniformly Number of zones Re(β̂1/β̂1′ ) 16 36 64 100 144 196 256 324 400 484 576 676 784 900 1024 1156 1296 1444 1600 1.3381 1.1274 1.0681 1.0426 1.0293 1.0213 1.0163 1.0128 1.0104 1.0085 1.0072 1.0061 1.0053 1.0046 1.0040 1.0036 1.0032 1.0029 1.0026 Let us next consider why the square grid zones are not better than the equal-width concentric zones. We can transform equation (14) as Var(βˆ 1 ) σ2 n 1 ∑ h k p k p h ( x ( k) x ( h) )2 . (42) Then we notice that if we want to lessen the variance, we need to enlarge the difference between the distance variables with weights given by the number of zones. In the case of the square grid zones, zones are not divided on the basis of the distances from the center. Thus, even if the ith atomic unit has a larger distance than the jth atomic unit, the ith unit is sometimes assigned to the Sk that has a smaller average distance than the Sh to which the jth unit is assigned (we call this the assignment loss in this paper). Due to this assignment loss, the difference between x(k) and x(h) is smaller than the difference in a case where we divide the area containing Sk and Sh so as not to cause the assignment loss. On the other hand, since the concentric zones are made on the basis of the distances from the center, the assignment loss does not occur. We, of course, notice that since the values of pk in the square grid zones are different from those of the equal width concentric zones, the influence of the weights is not clear. However, Table 1 and Table 3 imply that we need a larger number of independent variables in the case of the square grid zones than in the equal-width concentric zones to obtain the same relative efficiency due to the assignment loss. 2.2.1 Negative Exponential Distribution Function. Let us next assume that the distribution of the atomic units is represented by Clark’s law, that is, the negative exponential function. Using the method shown in Okabe and Tagashira (1996), we can calculate the values of pk and x(k). Then we can obtain the relative efficiency of the estimator for the slope coefficient in the aggregated model to that in the disaggregated model in the case of b 1, 2, 4. Note that b 1, 2, 4 implies that the densities at – √2/2, which is the half of the distance from the center to the farthest point of this region, are about 49 percent, 24 percent, 6 percent of the density of the center and that Naoto Tagashira and Atsuyuki Okabe / 13 – the densities at √2, which is the farthest point of this region, are about 24 percent, 6 percent, 0.3 percent of the density of the center, respectively. We think that b 4 is large enough to represent a significant decrease in the distribution of atomic data units in the case of the square grid zones. Table 4 shows that the variance of the estimator for the slope coefficient in the aggregated model decreases with an increase in the number of zones. Table 4 also shows that for b 1, 2, if the number of zones is the same, the values of the relative efficiency are smaller than those in the uniform function. For b 4, however, the relative efficiency is larger than that in the uniform density. If we want to make the efficiency less than 1.01, for example, we have to prepare 576 zones. Comparing the results of Table 4 with those of Table 2, we can find that the relative efficiency in the square grid zones is larger than that in the concentric zones for the corresponding number of zones. The reason would be similar to the case of the uniform function. 3. THE AGGREGATION PROBLEM In this section, we investigate the effects of different zoning systems for a fixed number of zones. We first calculate the zoning system that has the minimum variance, which is referred to as the min-var zoning system in this paper. This zoning is obtained by solving the following minimization problem: min pk subject to Var (βˆ 1 ) (43) ∑ k 1 p k 1 . (44) m For comparative purposes, we compute the zoning system that has the maximum variance, which is referred to as the max-var zoning system. This zoning system is obtained by maximizing Var(β̂1) under the constraint (44). In addition, in the case of the concentric zones, we also compare the equal-width concentric zones with the TABLE 4 Variation in Re(β̂1/β̂1′ ) with the Number of Square Grid Zones Where Atomic Units Are Distributed According to g(x, θ) exp (bx) Where b 1.0, 2.0, 4.0 Re(β̂1/β̂1′ ) Number of zones b 1.0 b 2.0 b 4.0 16 36 64 100 144 196 256 324 400 484 576 676 784 900 1024 1156 1296 1444 1600 1.3178 1.1198 1.0641 1.0401 1.0275 1.0201 1.0153 1.0121 1.0098 1.0081 1.0068 1.0058 1.0050 1.0043 1.0038 1.0034 1.0030 1.0027 1.0024 1.3224 1.1208 1.0645 1.0404 1.0277 1.0202 1.0154 1.0122 1.0098 1.0081 1.0068 1.0058 1.0050 1.0043 1.0038 1.0034 1.0030 1.0027 1.0024 1.4231 1.1535 1.0809 1.0503 1.0344 1.0251 1.0191 1.0150 1.0121 1.0100 1.0084 1.0071 1.0061 1.0053 1.0047 1.0041 1.0037 1.0033 1.0030 14 / Geographical Analysis min-var and the max-var systems. In the case of the grid zones, we compare the square grid zones. 3.1 Concentric Zones We deal with the uniform and the negative exponential functions as the distribution functions of the atomic data units. Then we can write the values of pk and x(k) in terms of the radius of Sk, that is, rk. In the case of the uniform distribution, p k rk2 rk21 (45) and x ( k) 2 rk2 rkrk1 rk21 . rk rk1 3 (46) For the negative exponential distribution, pk { exp(brk )(1 brk ) exp(brk1 )(1 brk1 )} exp(b)(1 b) 1 (47) and x ( k) exp(brk1 )(rk21 2rk1 / b 2 / b 2 ) exp(brk )(rk2 2rk / b 2 / b 2 ) . (48) exp(brk1 )(rk1 1 / b) exp(brk )(rk 1 / b) Hence, we use rk as a variable of the minimization problem instead of pk. When we solve the minimization problem, we first calculate the values of Var(β̂1) by changing rk(k m) at the intervals of 0.01 under the constraint of rk1 rk. For example, for three zones, the values of rk change such as r1 0.01,0.02,…,0.98, r2 r1 0.01, r1 0.02,…, 0.99, and r3 1.00. Comparing all these variances, we find the combination of rk that leads to the minimum variance. Similarly, we employ rk as a variable of the maximization problem and detect the maximum variance. However, in the case of the maximization problem, the widths of some zones, that is, the values of rk rk1, in the solution become 0.01. Since we think this is sometimes too small in actual situations, we also obtain the solution at the intervals of 0.1. We calculate the min-var and the max-var zoning systems for m 3, 4, 5; for b 0 (uniform), 1, 2, 4, 6 and show the min-var systems in Table 5. Before we focus on the results of Table 5, let us reconsider the min-var systems. As we mentioned before, if we want to lessen the variance, we need to enlarge the difference between the distance variables with weights given by the number of atomic units contained in the zones. If the difference were not weighted by the number of zones, for example, the min-var systems for m 3 would bring that r1 0.01 and r2 0.99 and the numbers of atomic units contained in S1 and S3 would be very small. However, since the difference is weighted by the number of zones, the numbers of atomic units contained in zones are not so small. For example, for m 3, b 0, the values of rk are r1 0.46 and r2 0.75. These values of rk decrease with an increase in b. The reason is that since more atomic Naoto Tagashira and Atsuyuki Okabe / 15 TABLE 5 Min-var Zoning Systems in Concentric Zones Where Atomic Units Are Distributed According to g(x, θ) exp (bx) Where b 0.0, 1.0, 2.0, 4.0, 6.0 Number of zones b r1 r2 r3 r4 3 0.0 1.0 2.0 4.0 6.0 0.46 0.43 0.39 0.33 0.27 0.75 0.72 0.69 0.62 0.54 1.00 1.00 1.00 1.00 1.00 4 0.0 1.0 2.0 4.0 6.0 0.38 0.34 0.31 0.25 0.21 0.61 0.57 0.53 0.46 0.39 0.81 0.79 0.76 0.70 0.62 1.00 1.00 1.00 1.00 1.00 5 0.0 1.0 2.0 4.0 6.0 0.32 0.29 0.26 0.21 0.17 0.52 0.48 0.45 0.37 0.31 0.69 0.66 0.63 0.55 0.47 0.85 0.83 0.81 0.75 0.68 r5 1.00 1.00 1.00 1.00 1.00 units are located near the center with an increase in b, in order to have each zone contain the certain number of atomic units, we need to make the values of rk(k m) smaller with an increase in b. This trend is similar to other number of zones. As for the max-var zoning systems, each zone width becomes the interval value except for one zone. For example, for m 5, b 0, the values of rk are r1 0.01, r2 0.02, r3 0.03, and r4 0.04 under the intervals of 0.01. This is because in contrast to the min-var systems, the max-var systems decrease the difference between the distance variables. We next show the relative efficiency of the estimator for the slope coefficient in the aggregated model to that in the disaggregated model in Table 6. Let us first look at the relative efficiency of the max-var zoning system. Under the interval of 0.01, the values of the relative efficiency of the max-var systems are much larger than those of the min-var systems. Although the values of the efficiency under the intervals of 0.1 are also larger than those of the min-var systems, they are much lower than those under the intervals of 0.01. In addition, Table 6 shows that the values of the relative efficiency under the intervals of 0.01 in the negative exponential function are smaller than those in the uniform function. Thus we notice that we have to be more careful about the aggregation problem in the case of the uniform function than that in the exponential function. However, the relative efficiency under the intervals of 0.1 decreases with an increase in b until b 2, or 4 turns to an increase. The value under the intervals of 0.01 might also increase if we examine a larger value of b. Hence, note that if we deal with the negative exponential function with a significant decrease, there is the possibility that the relative efficiency becomes large. Let us next move on to the relative efficiency in the case of the equal width concentric zones. The values of the relative efficiency are much smaller than those of the max-var systems. Especially, for b 2,4, the values are close to those of the min-var systems. The reason is that for b 2, 4, the min-var zoning systems are very similar to the equal-width zones (see Table 5). 3.2 Rectangular Grid Zones Next suppose that a region is square shaped and it is divided into m rectangles (see Figure 2). To make calculation simpler, we assume that zones are determined by r1,r2, 16 / Geographical Analysis TABLE 6 Re(β̂1/β̂1′ ) for the Min-var Zoning, the Max-var Zoning, and Equal-Width Zones in Concentric Zones Where Atomic Units Are Distributed According to g(x, θ) exp (bx) Where b 0.0, 1.0, 2.0, 4.0, 6.0 Max-var zoning Interval Number of zones b Min-var zoning 0.01 0.1 Equal-width zones 3 0.0 1.0 2.0 4.0 6.0 1.1480 1.1445 1.1446 1.1561 1.1785 325.24 233.76 174.22 104.94 71.505 4.6642 3.8017 3.2055 3.0478 4.1526 1.1865 1.1637 1.1526 1.1623 1.2176 4 0.0 1.0 2.0 4.0 6.0 1.0800 1.0780 1.0780 1.0842 1.0967 147.47 106.85 80.319 49.268 41.348 2.5347 2.1806 1.9301 2.0315 2.5718 1.0995 1.0881 1.0828 1.0888 1.1178 5 0.0 1.0 2.0 4.0 6.0 1.0504 1.0492 1.0492 1.0530 1.0610 84.619 61.807 46.844 29.237 28.895 1.7487 1.5736 1.4629 1.5571 1.8353 1.0622 1.0554 1.0523 1.0565 1.0749 …,r√––m/2 as shown in Figure 2. Then, since the values of pk and x(k) can be written in terms of rk, we can solve the minimization and the maximization problems with the same way we used in the case of the concentric zones. Table 7 shows the min-var systems for m 16, 36, 64, 100; b 0 (uniform density), 1, 2, 4. The values of the rk(k m) decrease with an increase in b. In addition, the min-var zoning systems are close to the square grid zones for b 1, 2. As for the max-var zoning system, each zone width becomes the interval value except for one zone. These findings are similar to those in the case of the concentric zones. The values of the relative efficiency are shown in Table 8. Looking at the max-var systems, we notice that the values vary according to the interval values and that under the intervals of 0.01 the values of the relative efficiency are much larger than those of the min-var systems. We also notice that the relative efficiency of the max-var systems decreases with an increase in b until b 1, or 2 turns to an increase. Thus we have to deal with the aggregation problem carefully in the case of the uniform function and the exponential function with a significant decrease. Focusing on the square grid zones, we notice that the relative efficiency is much smaller than that of the max-var systems. Especially, b 1,2 lead to the values that are fairly close to those of the min-var systems. The reason is that the min-var zoning systems are very similar to the square grid zones for b 1, 2 (see Table 7). Now let us compare the values of the relative efficiency of the min-var and the max-var systems in Table 8 to those in Table 6. We have already explained that thirtysix zones in the grid zones corresponds to six independent variables in terms of the weighted least squares method. In the same way, sixteen zones corresponds to three variables. Then we compare the results of sixteen grid zones with those of three concentric zones. We have also mentioned that the square grid zones are not better than the equal-width concentric zones in the case of the same number of independent variables. Similarly, the relative efficiency of the min-var systems is larger than that in the concentric zones. This finding is true of the max-var systems under the intervals of 0.1. However, the values of the max-var systems under the intervals of 0.01 in grid zones are smaller than those in the concentric zones. The reason is that although sixteen grid zones corresponds to three independent variables, sixteen zones have only one variable r1 for the maximization problem. Since the degree of freedom is too small for the maximization, we can not lessen the difference between the distance FIG. 2. Rectangular Grid Zones TABLE 7 Min-var Zoning Systems in Rectangular Grid Zones Where Atomic Units Are Distributed According to g(x, θ) exp (bx) Where b 0.0, 1.0, 2.0, 4.0 Number of zones b r1 r2 r3 r4 16 0.0 1.0 2.0 4.0 0.56 0.52 0.49 0.40 1.00 1.00 1.00 1.00 36 0.0 1.0 2.0 4.0 0.40 0.37 0.34 0.27 0.71 0.68 0.65 0.57 1.00 1.00 1.00 1.00 64 0.0 1.0 2.0 4.0 0.33 0.29 0.26 0.21 0.57 0.53 0.49 0.41 0.79 0.76 0.73 0.66 1.00 1.00 1.00 1.00 100 0.0 1.0 2.0 4.0 0.27 0.24 0.22 0.17 0.47 0.43 0.40 0.33 0.65 0.62 0.58 0.51 0.83 0.81 0.78 0.72 r5 1.00 1.00 1.00 1.00 18 / Geographical Analysis TABLE 8 Re(β̂1/β̂1′ ) for the Min-var Zoning, the Max-var Zoning, and Square Grid Zones in Rectangular Grid Zones Where Atomic Units Are Distributed According to g(x, θ) exp (bx) Where b 0.0, 1.0, 2.0, 4.0 Max-var zoning Interval Number of zones b Min-var zoning 0.01 0.1 Square grid zones 16 0.0 1.0 2.0 4.0 1.3159 1.3145 1.3213 1.3578 56.943 47.129 39.654 64.105 5.3220 4.5333 4.1041 6.2631 1.3381 1.3178 1.3224 1.4231 36 0.0 1.0 2.0 4.0 1.1176 1.1171 1.1198 1.1346 28.146 23.347 19.782 31.978 2.6247 2.3179 2.2238 3.0781 1.1274 1.1198 1.1208 1.1535 64 0.0 1.0 2.0 4.0 1.0625 1.0622 1.0637 1.0718 18.572 15.446 13.229 21.254 1.7946 1.6374 1.6261 2.0490 1.0681 1.0641 1.0645 1.0809 100 0.0 1.0 2.0 4.0 1.0389 1.0388 1.0397 1.0448 13.801 11.512 9.9839 15.891 1.4180 1.3320 1.3363 1.5623 1.0426 1.0401 1.0404 1.0503 variables extremely. On the other hand, since the concentric zones have the same number of maximization variables as the independent variables, we can make the difference between the distance variables small significantly. 4. CONCLUSIONS The problem considered in this paper is the modifiable areal unit problem in the regression model where the dependent variable is the attribute value of the atomic data unit and the independent variable is the distance variable from the center of the region to the atomic data unit. We applied this model to spatially aggregated data, and examined the effects of zones used for aggregating data on the variance of the estimator for the slope coefficient in the aggregated model. In this last section, we discuss the implications of the results we have obtained. If we could use the individual level data, it would be better to estimate the coefficient in the disaggregated model. However, individual level data are usually unavailable. Thus we need to estimate the coefficient using the aggregated model. But, as we mentioned in section 1, the variance of the estimator for the slope coefficient in the aggregated model is larger than that in the disaggregated model. Hence we calculated the relative efficiency of the estimator for the slope coefficient in the aggregated model to that in the disaggregated model with respect to the number of zones. We first dealt with the equal-width concentric zones. When atomic data units are distributed uniformly, the relative efficiency decreases with an increase in the number of zones (Theorem 1). For example, six zones or more lead to the relative efficiency becoming less than 1.05 (Table 1). This means that if we prepare six or more zones, the variance of the estimator for the slope coefficient in the aggregated model is larger than that in the disaggregated model by less than 5 percent. If we want to make the relative efficiency closer to 1.0, thirteen or more zones achieve the relative efficiency of less than 1.01. When the atomic units are distributed following Clark’s law, or the negative exponential function, the relative efficiency varies according to the degree of concavity of the function. If the concavity is not large, the relative efficiency in the negative exponential function is smaller than that in the uniform function. However, Naoto Tagashira and Atsuyuki Okabe / 19 when the concavity is very large, the relative efficiency is larger than that in the uniform density. These values of the relative efficiency are shown in Table 2. In the case of the square grid zones with the uniform function, in order to achieve the relative efficiency of less than 1.01, we need at least 484 zones (Table 3). In the case of the negative exponential function with small concavity, the relative efficiency is smaller than that in the uniform function. But when the concavity is very large, we need larger zones in order to obtain the same efficiency as the uniform function. These values of the relative efficiency are shown in Table 4. In section 3, we obtained the zoning systems that achieve the minimum and the maximum estimator variance. The maximum estimator variance is much larger than the minimum variance, especially in the uniform distribution or the exponential distribution with large concavity. However, the variance in the equal-width concentric zones or the square grid zones is much closer to the minimum variance than the maximum variance. In practical situations, it is difficult to obtain the zoning system that yields the minimum variance. Thus it is the alternative measure to use the equal-width concentric zones or the square grid zones. Furthermore, as the number of zones increases, the variance of the estimator for the slope coefficient in the aggregated model approaches that in the disaggregated model. And we can use Tables 1–4 in order to relate the relative efficiency to the number of zones. APPENDIX 1 To simplify the explanation of L(k), we first assume that a subregion Sk is located on an upper right quarter of the region such as Figure 1. If the Sk is located in the Xk 1 column and in the Yk 1 row counted from the origin (the center), L(k) in equation (39) is given by L(k) X k3 log Yk Tk Y 1 Wk Tk X k (X k 1)3 log k (Yk )3 log Yk 1 U k Yk Vk Vk X k 1 (Yk 1)3 log Wk 1 X k 2{X kU k (Yk 1) Tk X kYk Yk Vk (X k 1) Uk Xk Wk (X k 1)(Yk 1)} , where Tk X k2 Yk2 , U k X k2 (Yk 1)2 , Vk (X k 1)2 Yk2 , Wk (X k 1)2 (Yk 1)2 . Since subregions located in the upper right quarter of the region and those in other quarters are symmetrical with respect to the horizontal axis, the vertical axis, or the center, by moving the subregions in other quarters to corresponding locations on the upper right quarter, we can calculate the values of L(k) of subregions in other quarters similarly. LITERATURE CITED Amrhein, C. G., and R. Flowerdew (1992). The Effect of Data Aggregation on a Poisson Regression Model of Canadian Migration. Environment and Planning A 24, 1381–91. Anselin, L. (1988). Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic Publishers. Arbia, G. (1989). “Statistical Effect of Spatial Data Transformations: A Proposed General Framework.” In Accuracy of Spatial Databases, edited by M. F. Goodchild and S. Gopal, pp. 249–59. London: Taylor and Francis. 20 / Geographical Analysis Batty, M., and P. K. Sikdar (1982a). “Spatial Aggregation in Gravity Models. 1. An Information-Theoretic Framework.” Environment and Planning A 14, 377–405. _____ (1982b). “Spatial Aggregation in Gravity Models. 2. One-Dimensional Population Density Models.” Environment and Planning A 14, 525–53. _____ (1982c). “Spatial Aggregation in Gravity Models. 3. Two-Dimensional Trip Distribution and Location Models.” Environment and Planning A 14, 629–58. _____ (1982d). “Spatial Aggregation in Gravity Models. 4. Generalizations and Large-Scale Applications.” Environment and Planning A 14, 795–822. Blalock, H. M. (1964). Causal Influences in Nonexperimental Research. Chapel Hill: University of North Carolina Press. Burgess, E. W. (1925). “The Growth of the City: An Introduction to a Research Project.” In The City, edited by R. E. Park, E. W. Burgess, and D. McKenzie. Chicago: Chicago University Press. Clark, C. (1951). “Urban Population Density.” Journal of the Royal Statistical Society A 114, Part 4, 490–96. Clark, W.A.V., and K. L. Avery (1976). “The Effects of Date Aggregation in Statistical Analysis.” Geographical Analysis 8, 428–38. Cliff, A. D., R. L. Martin, and J. K. Ord (1974). “Evaluating the Friction of Distance Parameter in Gravity Models.” Regional Studies 8, 281–86. _____ (1975). “Map Pattern and Friction of Distance Parameters: Reply to Comments by R. J. Johnston, and by L. Curry, D. A. Griffith and E. S. Sheppard.” Regional Studies 9, 285–88. Cliff, A. D. and Ord J. K. (1973). Spatial Autocorrelation. London: Pion. Cramer, J. S. (1964). “Efficient Grouping, Regression, and Correlation in Engel Curve Analysis.” American Statistical Association Journal 59, 233–50. Curry, L. (1972). “A Spatial Analysis of Gravity Flows.” Regional Studies 6, 131–47. Curry, L., D. A. Griffith, and E. S. Sheppard (1975). “Those Gravity Parameters Again.” Regional Studies 9, 289–96. Fotheringham, A. S., and D.W.S. Wong (1991). “The Modifiable Areal Unit Problem in Multivariate Statistical Analysis.” Environment and Planning A 23, 1025–44. Gehlke, C. E., and K. Biehl (1934). “Certain Effects of Grouping upon the Size of the Correlation Coefficient in Census Tract Material. Journal of the American Statistical Association, Supplement 29, 169–70. Herbert, D. T., and R. J. Johnston, eds. (1976) Social Areas in Cities. Vol. I, Spatial Processes and Form. Chichester: John Wiley. Johnston, R. J. (1973). “On Frictions of Distance and Regression Coefficients.” Area 5, 187–91. _____ (1975). “Map Pattern and Friction of Distance Parameters: A Comment.” Regional Studies 9, 281–83. _____ (1976). “On Regression Coefficients in Comparative Studies of the ‘Frictions of Distance.’” Tijdschrift voor Econ. en Soc Geografie 67, 15–28. Masser, I., and P.J.B. Brown (1975). “Hierarchical Aggregation Procedures for Interaction Date.” Environment and Planning A 7, 509–23. _____ (1978). Spatial Representation and Spatial Interaction. London: Martius Nijhoff. Newling, B. (1969). “The Spatial Variation of Urban Population Densities.” Geographical Review 59. Okabe, A. (1977). “Spatial Aggregation Bias in Trip Distribution Probabilities: The Case of the Opportunity Model.” Transportation Research 11, 197–202. Okabe, A., and N. Tagashira (1996). “Spatial Aggregation Bias in a Regression Model Containing a Distance Variable.” Geographical Systems 2, 83–101. Openshaw, S. (1977). “Optimal Zoning Systems for Spatial Interaction Models.” Environment and Planning A 9, 169–84. _____ (1978). An Empirical Study of Some Zone-Design Criteria.” Environment and Planning A 10, 781–94. _____ (1984a). “Ecological Fallacies and the Analysis of Areal Census Data.” Environment and Planning A 16, 17–31. _____ (1984b). The Modifiable Areal Unit Problem, Concepts, and Techniques in Modern Geography, No 38. Norwich: Geo Books. Robinson, W. S. (1950). “Ecological Correlations and the Behavior of Individuals.” American Sociological Review 15, 351–57. Sawicki, D. S. (1973). “Studies of Aggregated Areal Date: Problems of Statistical Inference.” Land Economics 69, 109–14. Slater, P. B. (1985). “Point-to-Point Migration Functions and Gravity Model Renormalization: Approaches to Aggregation in Spatial Interaction Modeling.” Environment and Planning A 17, 1025–44. Yule, G. U., and M. G. Kendall (1950). An Introduction to the Theory of Statistics. London: Griffin.
© Copyright 2026 Paperzz