Testing for IIA in the Multinomial Logit Model Sociological Methods & Research Volume 35 Number 4 May 2007 583-600 Ó 2007 Sage Publications 10.1177/0049124106292361 http://smr.sagepub.com hosted at http://online.sagepub.com Simon Cheng University of Connecticut, Storrs J. Scott Long Indiana University, Bloomington The multinomial logit model is perhaps the most commonly used regression model for nominal outcomes in the social sciences. A concern raised by many researchers, however, is the assumption of the independence of irrelevant alternatives (IIA) that is implicit in the model. In this article, the authors undertake a series of Monte Carlo simulations to evaluate the three most commonly discussed tests of IIA. Results suggest that the size properties of the most common IIA tests depend on the data structure for the independent variables. These findings are consistent with an earlier impression that, even in well-specified models, IIA tests often reject the assumption when the alternatives seem distinct and often fail to reject IIA when the alternatives can reasonably be viewed as close substitutes. The authors conclude that tests of the IIA assumption that are based on the estimation of a restricted choice set are unsatisfactory for applied work. Keywords: IIA; independent of irrelevant alternatives; multinominal logit he multinomial logit model (MNLM) is perhaps the most commonly used regression model for nominal outcomes. The model is easy to estimate, and interpretation is straightforward, albeit complicated due to the large number of parameters involved. A concern raised by many researchers is the assumption of the independence of irrelevant alternatives that is implicit in the model (e.g., Alvarez and Nalgler 1995; Dow and Endersby 2004; Fry and Harris 1996, 1998; Keane 1992; Lacy and Burden 1999; Mokhtarian and Bagley 2000; Pels, Nijkamp, and Rietveld 2001). The independence of irrelevant alternatives (IIA) means that, all else being equal, a T Authors’ Note: Direct all correspondence to Simon Cheng, University of Connecticut, Department of Sociology, Storrs, CT 06269-2068; e-mail: [email protected]. We thank Tim Fry, Mark Harris, David Weakliem, and two reviewers for their comments. For computing code related to this article, see web.uconn.edu/simoncheng/research iia.htm. 583 Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 584 Sociological Methods & Research person’s choice between two alternative outcomes is unaffected by what other choices are available. McFadden’s (1974) commonly used example that illustrates why this assumption can be unrealistic involves a commuter’s choice among modes of transportation. Suppose that a person can travel to work either by car or by a red bus. Assume that the probability of each mode of travel is 12, so that the odds of taking a car rather than a red bus are 12=12 = 1. IIA requires that if a new alternative becomes available, the probabilities for the prior choices must adjust in precisely the amount necessary to retain the original odds. Now, suppose that the alternatives are expanded to include travel on a blue bus, where this bus is identical to the red bus except for color. We would expect that the probability of taking a red bus would equal that of taking a blue bus. In such a case, the only way to maintain the original odds of taking a car versus a red bus would be if a car is chosen with a probability 13, a red bus 13, and a blue bus 13. By this logic, we could effectively eliminate the use of cars by increasing the number of colors used by bus companies. Obviously, it is more likely that the original bus riders would divide evenly between taking red and blue buses. But this more realistic scenario violates the IIA assumption since the odds of a car versus a red bus would become 12 = 14 6¼ 1. Train (2003) points out that the above example is rather extreme and unlikely to occur in serious, substantive research. It is also important to keep in mind that violations of IIA are not inherent in the choices themselves. That is to say, for a given set of choices, the IIA property could be violated for one specification of the independent variables but not in some other specification. As discussed by McFadden, Train, and Tye (1981), the IIA property implies that those variables omitted from the model are independent random variables in a way that is analogous to the assumption of independent error terms in the linear regression model. Two basic types of tests can be used to test for violations of IIA: choice set partitioning tests and model-based tests. Choice set partitioning tests compare the results from the full MNLM estimated with all outcomes (i.e., choices) to the results from a restricted estimation that includes only some of the outcomes. IIA holds when the estimated coefficients of the full model are statistically similar to those of the restricted one. If the test statistic is significant, the assumption of IIA is rejected, and the conclusion is that the MNLM is inappropriate. The first test of IIA was proposed by McFadden et al. (1981). This likelihood ratio test, hereafter MTT, compares the value of the log-likelihood equation from the restricted estimation to the value obtained by substituting the estimates from the full model into the log-likelihood equation for the restricted estimation. Small and Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 585 Hsiao (1985) demonstrated that the MTT test is asymptotically biased and proposed an alternative likelihood ratio test, known as the Small and Hsiao test, that eliminates this bias. A third IIA test, proposed by Hausman and McFadden (1984), compares the estimates from the full and restricted model. The most commonly used tests are the Hausman and McFadden (HM) test and the Small and Hsiao (SH) test, which are frequently discussed in econometrics texts (e.g., Greene 2003; Train 2003) and can be easily computed using standard software (Zhang and Hoffman 1993). Model-based tests are computed by estimating a more general model that does not impose the IIA assumption and testing constraints that lead to IIA. The most commonly discussed alternative models are multinomial probit, nested logit, and mixed logit (see Train 2003 for an excellent discussion of these models). When these alternative models are used, IIA can be tested by comparing the unrestricted model to a model that imposes constraints leading to IIA. Unfortunately, these models are computationally more difficult and are less familiar to applied researchers. As a consequence, these tests are rarely seen in substantive applications. In the case of the multinomial probit, issues of identification also make application difficult (Keane 1992). These tests are not considered further in our article. Evaluation of statistical tests typically involves assessment of their size and power properties. In assessing size properties, the nominal significance level of a test (e.g., .05, .10) is compared with the empirical significance level in the data structure that does not violate the assumption being evaluated. The empirical significance level is defined as the proportion of times that the correct null hypothesis is rejected over a large number of replications. If the size properties of a test are appropriate, the power of the test is evaluated by assessing the proportion of times that the test rejects the null hypothesis using a data structure that violates the assumption. The more powerful the test, the higher the proportion of tests that detects a violation of the assumption. In two recent articles, Fry and Harris (1996, 1998) use Monte Carlo simulations to evaluate six choice set partitioning tests of IIA, including the MTT, SH, and HM tests. The first article provided evidence that these tests have poor size properties and that critical values based on asymptotic theory may be inappropriate. In their second article, they find that the SH test is oversized and that the HM test is reasonably well sized. Although the MTT test is found to be undersized, it has the greatest power when using empirical critical values. These values are the 95th percentile of the test statistics from 1,000 simulations on samples from a population in which IIA is not violated. Fry and Harris conclude that multiple tests should be used, that Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 586 Sociological Methods & Research inference should be based on empirical critical values, and that a sizeadjusted MTT be used. They also point out that their findings from the simulations in the 1998 article to some degree contradict their findings in the 1996 article, a point that we address below. Our own experience with these tests, reinforced by responding to researchers who encountered anomalies when using the IIA tests implemented in Stata by Long and Freese (2005), suggests that problems with IIA tests cannot be corrected with size adjustments or by using alternative forms of the test. In a variety of substantive applications, we have found that even with reasonable model specifications, these tests often reject IIA when the alternatives seem distinct and that they do not reject IIA when the alternatives can reasonably be viewed as close substitutes. Moreover, variations of IIA tests applied to the same data using the same model often provide inconsistent results regarding the violation of IIA in the full model (see Fry and Harris 1998 for an example). To assess these impressions more formally, we ran a series of Monte Carlo simulations to evaluate the MTT, SH, and HM tests. Because our simulations show that the size properties of these tests are inadequate, we do not consider the power of these tests. We note, however, that even if a test has adequate size properties, its power properties could still be poor. See Brooks, Fry, and Harris (1997, 1998) for a discussion of the power properties of IIA tests. As shown below, our simulations suggest that the size properties of these IIA tests depend on the data structure for the independent variables. With some structures, the size properties are reasonable, while in others, they are extremely inflated. Since in substantive applications it is not possible to determine if the data structure leads to unreasonable results, we conclude that the tests are not useful for assessing IIA.1 We also consider the use of size-adjusted critical values, as suggested by Fry and Harris (1996, 1998). Our simulations find that these values depend on the data structure. This makes the application of the test computationally intensive and largely impractical. We begin with a formal statement of the MNLM and three IIA tests. We then describe our simulations and present the results. We conclude with a discussion of the implications of these findings. The Multinomial Logit Model Let y be the dependent variable with J outcomes numbered from 1 to J. Let x be a vector of K independent variables plus a constant for the intercept. The probability of observing outcome m for a given x is Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA expðxβm Þ Prðy = m | xÞ = PJ j=1 exp xβj for m = 1; . . . ; J: 587 ð1Þ The vector βm = ðβ0m βkm βKm Þ0 includes the intercept β0m and coefficients βkm for the effect of xk on outcome m. To identify the model, we assume without loss of generality that β1 = 0. The model can also be written in terms of the odds for each pair of options m and n: m | n = expðx½βm − βn Þ: ð2Þ Equation (2) shows that the odds of choosing m versus n do not depend on which other outcomes are possible. That is, the odds are determined only by the coefficient vectors for m and n—namely, βm and βn . This is the independence of irrelevant alternatives property, or simply IIA. Testing IIA The MNLM can be viewed as the simultaneous estimation of binary logits for all pairs of outcome categories. While efficient estimation of the model requires that all pairs be estimated simultaneously, which imposes certain logical constraints among parameters, Begg and Gray (1984) show that consistent but inefficient estimates can be obtained by estimating a series of binary logits. For example, an MNLM with three outcomes could be estimated by estimating two binary logits, the first comparing outcomes 1 to 2 and the second comparing 1 to 3. Choice set partitioning tests of IIA essentially involve comparing the estimates using all outcomes simultaneously to those based on a restricted choice set. We now formally describe the tests. bf . The superThe full model is given in equation (1), with estimates β m script f indicates that the estimates are from the full model that includes all outcomes. The restricted estimation is identical to the full model except that the equation for outcome J is excluded: expðxβm Þ Prðy = m | xÞ = PJ−1 j=1 exp xβj for m = 1; . . . ; J − 1; ð3Þ where we assume that β1 = 0. While we have dropped outcome J, any br from the other outcome could have been dropped. Under IIA, estimates β m bf from restricted choice set are consistent but inefficient, while estimates β m the full model are consistent and efficient. The various tests of IIA involve comparing the estimates from the full model to those from the restricted Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 588 Sociological Methods & Research estimation. To define these tests, the estimates from the restricted choice br0 β br0 Þ0 , with the corresponding b r = ðβ set are stacked in the vector β 2 J−1 f 0 f b β bf 0 Þ0 . Note that β b = ðβ bf does not estimates from the full model β 2 J−1 bf since it was not estimated in the restricted estimation. include β J MCFadden, Train, and Tye Test The approximate likelihood ratio test of IIA proposed by McFadden et al. (1981) is defined as h i bf − Lr β br ; MTT = −2 Lr β where Lr is the log-likelihood function for the restricted estimation. Quite simply, the test compares the value of the log-likelihood equation from the restricted estimation to the value obtained by plugging estimates from the full model into the log-likelihood from the restricted model. When IIA holds, MTT is as distributed chi-square with degrees of freedom equal to br . the rows in β Small and Hsiao Test Small and Hsiao (1985) show that MTT is asymptotically biased toward accepting the null hypothesis, which has been empirically confirmed in studies such as Fry and Harris (1996, 1998). Small and Hsiao proposed a modified version of MTT to avoid this bias. First, the sample is randomly divided into subsamples A and B of roughly equal size. The full model from equation (1) is estimated on both subsamples, with estimates conbf . The weighted average of the coefficients from the bf and β tained in β B A two samples is defined as bf + 1 − p1ffiffiffi β bf : bf = p1ffiffiffi β β B AB A 2 2 A restricted subsample is created from subsample B by eliminating all cases with a given value of the dependent variable—in our case, category J. The restricted choice set is estimated using the restricted subsample br with the likelihood function Lr . The Smallyielding the estimates β B Hsiao statistic is h i bf ∗ Þ − Lr ðβ br Þ : SH = −2 Lr ðβ B AB Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 589 SH is asymptotically distributed as chi-square with the degrees of freedom equal to the number of parameters in the restricted choice set. Hausman and MCFadden Test Hausman and McFadden (1984) proposed a Hausman test (Hausman bf , which are consistent and efficient if 1978) that compares the estimates β br . the null hypothesis is true, to the consistent but inefficient estimates β The HM test is defined as 0 h i−1 bf br − β bf Var bf ; br − Var br − β d β d β HM = β β br Þ and Varð bf Þ are the estimated covariance matrices. If IIA dβ dβ where Varð holds, HM is asymptotically distributed as chi-square with df equal to the br . Significant values of HM indicate that the IIA assumption has rows in β been violated. Hausman and McFadden (1984:1226) note that HM can be bf Þ is not positive semidefinite, but they conbr Þ − Varð dβ dβ negative if Varð clude that this is evidence that IIA holds. We use this decision rule in the results we present below. Alternative Forms of Each Test Multiple variants of each test are created by eliminating different alternatives to create the restricted choice set. For example, if we use a restricted estimation that excludes a single category, as we do in our simulations, there are J versions of each test. Version 1 excludes the first category to create the restricted estimation, version 2 excludes the second category, and so on. While the resulting tests are asymptotically equivalent, results can differ substantially in finite samples, as shown below. Generation of Data To examine the size properties of the MTT, HM, and SH tests, we conducted Monte Carlo simulations using eight artificial data sets in which the IIA assumption was not violated. These artificial data sets were constructed to reflect scenarios that might occur in real survey data with both continuous and categorical covariates, with different degrees of collinearity among the covariates, different values of the βs, and small cells in the cross-tabulation between the outcome variable and dichotomous covariates. For each data Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 590 Sociological Methods & Research structure, we generated 150,000 observations with a three-category outcome variable and three independent variables. The independent variables were constructed as follows. 1. x1 is drawn from a uniform distribution on the interval from 1 to 2. 2. x2c is a continuous variable constructed by adding the uniform random variable used for x1 to a normal random variable. The relative weights for the uniform random variable and the normal random variable varied across data sets to change the amount of collinearity between x1 and x2 . To create categorical covariates and sparse cells in the cross-tabulation between outcome y and categorical covariates, we dichotomized x2c to create the binary variable x2d . These are further discussed later. 3. x3 is a skewed variable constructed by adding a random variable drawn from a chi-square distribution with one degree of freedom and the uniform random variable used to construct x1 . Again, we varied the relative weights. The outcome y was constructed as follows. 1. Select values for the βs in equation (1). 2. Compute predicted probabilities for each of 150,000 observations using the probability equation expðβm0 + βm1 x1 + βm2 x2 + βm3 x3 Þ Prðy = m|xÞ = P3 j=1 exp βj0 + βj1 x1 + βj2 x2 + βj3 x3 for m = 2; 3 ð4Þ where Prðy = 1|xÞ = 1 − Prðy = 2|xÞ − Prðy = 3|xÞ. 3. Generate a uniform random number on the interval from 0 to 1 for each observation in each data set. If this random number is less than Prðy = 1Þ computed with equation (4), then y = 1. If the number is between Prðy = 1Þ and Prðy = 1Þ + Prðy = 2Þ, then y = 2; otherwise, y = 3. In Data Sets 1, 2, and 3, all of the xs are continuous. These data sets differ in the degree of collinearity among the xs, with the maximum correlations ranging from .62 in Data Set 1 to .82 in Data Set 3. Data Set 4 was created by dichotomizing x2 with 47 percent of the cases equal to 1. Data Sets 5 through 8 are discussed in the ‘‘IIA Tests in Data With Sparse Cells’’ section. Table 1 summarizes the data sets used in our simulations. Design of Simulations For each data set, simulations were run for sample sizes of n = 150, 250, 350, 500, 1,000, and 2,000.2 The simulations involved these steps: Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 591 Table 1 Descriptive Statistics for Data Sets Used in Simulations Percentage in Means Correlations Data Set y=1 y=2 y=3 x1 x2 x3 rx1x2 rx1x3 rx2x3 1 2 3 4 5 6 7 8 21.6 15.6 12.1 41.0 30.7 15.6 32.8 48.8 57.7 69.0 76.9 33.9 40.9 70.1 34.2 25.1 20.8 15.4 11.1 25.1 28.4 14.4 33.0 26.1 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 1.50 2.00 2.49 0.47 0.88 0.88 0.88 0.88 1.60 2.10 2.60 1.60 1.60 1.60 1.60 1.60 .82 .89 .92 .73 .45 .45 .46 .45 .56 .72 .81 .56 .56 .56 .56 .56 .46 .63 .74 .41 .26 .26 .26 .26 1. Draw a random sample of size N with replacement from the population. 2. For this sample, estimate the MNLM with outcome y and predictors x1 , x2 , and x3 . 3. Using estimates from Step 2, compute three variations of the MTT, HM, and SH tests, excluding the first category for the restricted estimation, the second category, and the third. The test statistics and p values are saved for later analysis. These steps were repeated 500 times for each sample size in each data set. To determine the empirical size for each test, we computed the percentage of times that each test rejected the null hypothesis that IIA held in the population at the .05 and .10 levels of significance. Since the results at the .10 level are consistent with those at the .05 level, they are not reported. For the HM test, we used Hausman and McFadden’s (1984) suggestion that negative chi-squared values be recorded as 0 with the corresponding p value of 1.3 Our analysis begins by examining the three IIA tests using the first four data structures and shows that the size properties are affected by the amount of collinearity and depend on which version of the test is used. Because the undersized properties of the MTT test are highly consistent with those suggested in earlier research, we only present the results for the HM and SH tests. While the SH test has seemingly reasonable size properties with samples of 500 or more in data structures with different degrees of collinearity, we show that the presence of sparse cells can lead to severe size distortion for sample sizes up to 2,000, the largest we present. Using Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 592 Sociological Methods & Research these findings as a guide, we consider the MTT and illustrate the practical problems with using empirical critical values. IIA Tests in Data With Varying Degrees of Collinearity The results of the simulations for the HM test are presented in Figure 1, which shows the percentage of times the HM test rejected the null hypothesis of no violation of the IIA assumption using the .05 level.4 For each data structure, three versions of the HM test were computed, excluding either the first, second, or third outcome category. The percentage listed in the title for the graph using Data Set 4 indicates that 10.6 percent of the cases were found in the smallest cell of the cross-tabulation between y and x2 . The numbers on the lines within each graph indicate the deleted category for the test being presented. The results illustrate that the HM test does not reliably converge to its appropriate size even when the sample is 2,000. Second, the properties of the test depend on which outcome category is deleted in the restricted estimation. For example, in Data Set 2, the test approaches its nominal .05 level when Category 2 is excluded but levels off around .15 when Category 1 is excluded. In a substantial proportion of the samples, the resulting HM test was negative. Even with a sample size of 1,000, 21 to 49 percent of the test statistics were negative. Overall, our results indicate that the HM test is not a viable test for assessing IIA. As shown in Figure 2, the SH test approximates its nominal size as the sample increases to 500 or 1,000. The magnitude of departures from the nominal size and the sample size at which these distortions are largely removed depends on the degree of collinearity in the data. For example, with high collinearity, the size properties are quite poor with samples smaller than 500 and require a sample of at least 1,000 before they are nearly eliminated. We also found evidence of a practical problem that is often encountered when applying these tests with real-world data. There are six ways to compute the SH test in our example. Each outcome category can be the base category in the MNLM used to compute the test. For each base category, there are two variations of the test, depending on which nonbase category is removed. While using Category 1 as the base category when excluding Category 3 is the same model as using Category 2 as the base when excluding Category 3, the results from the SH test will differ due to their dependence on a particular draw of random numbers. In more than 33 percent of samples of 500, at least one of the six possible SH tests provided inconsistent conclusions compared to the other test. Even in Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 593 Figure 1 Size Properties of the Hausman-McFadden Test of the Independence of Irrelevant Alternatives Data 1: low collinearity Data 2: medium collinearity 40 30 20 1 3 10 1 3 3 3 11 1 3 2 22 2 0 0 500 3 1 2 Percent rejected Percent rejected 40 2 30 20 1 10 3 3 11 13 3 222 2 0 1,000 1,500 2,000 0 Data 3: high collinearity 2 1 3 2 1,000 1,500 2,000 Data 4: binary x2 10.6% 40 30 20 3 1 1 3 1 3 1 3 222 2 10 0 0 500 1 3 2 1 3 2 Percent rejected 40 Percent rejected 500 1 3 30 20 1 10 1 11 2 2 2 21 3 3 3 3 0 1,000 1,500 2,000 0 500 2 3 1 2 3 1,000 1,500 2,000 samples of 1,000, inconsistencies were found in 28 percent of the sample. Even greater problems were encountered when we explored data structures with sparse cells. IIA Tests in Data With Sparse Cells In our early experiments with a variety of data structures, we occasionally obtained results that showed severe size distortion, such as illustrated in Figure 3 for Data Structures 7 and 9. In these cases, the size distortion for the SH test increased with sample size for some variations of the test, Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 594 Sociological Methods & Research Figure 2 Size Properties of the Small-Hsiao Test of the Independence of Irrelevant Alternatives Data 1: low collinearity Data 2: medium collinearity 40 Percent rejected Percent rejected 40 30 20 3 1 33 21 2 21 1 3 2 10 2 3 1 30 1 20 3 10 3 13 22 1 2 1 3 2 0 1 2 3 1 2 3 0 0 500 1,000 1,500 2,000 0 Data 3: high collinearity 500 1,000 1,500 2,000 Data 4: binary x2 10.6% 40 30 1 20 3 10 23 1 21 2 1 32 3 Percent rejected 40 Percent rejected 1 2 3 2 3 1 3 2 1 0 30 20 3 2 1 1 3 2 12 1 2 3 3 10 1 3 2 2 3 1 0 0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 and there were substantial differences in the percentage of times different versions of the test rejected the true null. Further analysis revealed that these results were due to the presence of sampling zeros in the crosstabulation between the outcome y and the dichotomous variable x2d . This problem is similar to the size distortion for the likelihood chi-squared statistics in contingency tables with sparse cells (Larntz 1978). To explore this finding more systematically, we constructed four data sets in which the percentage of cases in the smallest cell of the cross-tabulation Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 595 Figure 3 Illustration of Severe Size Distortion HM test SH test 50 3 31 3 1 1 40 30 3 1 3 1 20 10 2 0 0 22 2 2 3 1 Percent rejected Percent rejected 50 2 500 1,000 1,500 2,000 40 30 20 10 2 3 1 3 31 31 11 322 2 2 3 1 2 0 0 500 1,000 1,500 2,000 Note: HM = Hausman-McFadden; SH = Small-Hsiao. of y and x2 varied from 1.8 percent to less than .1 percent. These small percentages could easily occur in data where one of the independent variables indicates membership in an underrepresented group, with an outcome category with few cases, or when a combination of multiple binary variables would lead to the rare occurrence of some outcome category for some combination of independent characteristics. In drawing small samples from data structures in which there were sparse cells, it was common to draw a sample in which there was a zero in the y × x2 table. In such cases, the MNLM can be estimated, but a singub A researcher who encounters this situation when d βÞ. larity occurs in Covð building a model is likely to respecify the model to remove the singularity, either dropping one of the independent variables or collapsing categories. We adapted our simulations to reflect this scenario. If a zero cell was encountered, we drew a replacement sample. Figure 4 presents the results of our simulations for the SH test in data sets with sparse cells. Again, the percentages listed in the title for each graph indicate the percentage of cases in the smallest y × x2 cell in the population data structure. The percentage of tests that reject the null depends greatly on the excluded category. In some cases, the tests have extreme size distortion, rejecting the correct null 50 percent of the time, even with samples of 1,000. In supplementary analyses, we extended the simulations to larger sample sizes Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 596 Sociological Methods & Research Figure 4 Size Properties of the Small-Hsiao Test of the Independence of Irrelevant Alternatives in the Presence of Sparse Cells Data 6: binary x2 .64% 50 40 40 Percent rejected Percent rejected Data 5: binary x2 1.8% 50 2 1 30 20 31 2 32 3 1 32 1 10 1 2 3 3 1 2 0 3 1 3 1 30 1 3 20 2 22 10 2 3 1 2 3 1 2 0 0 500 1,000 1,500 2,000 0 Data 7: binary x2 .10% 40 30 1 3 3 1 2 20 2 10 22 1 3 33 11 3 1 50 2 2 0 Percent rejected 11 333 3 1 500 1,000 1,500 2,000 Data 8: binary x2 .02% 1 50 Percent rejected 3 1 3 1 1 3 40 30 2 20 22 10 2 2 2 0 0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 (6,000, 8,000, and 10,000) and restricted analysis to random samples with at least five observations in the smallest cell. In both cases, the size distortion persisted, again confirming our early finding that the size properties of the IIA tests are highly dependent on the data structure for the independent variables. The results for the HM test (not shown) are similar to those for Data Structures 1 through 4: The size of the test does not converge as the sample size increases, and the percentage rejected depends on the category excluded in the restricted model. These findings could explain why Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 597 Table 2 Empirical Critical Values by Data Structure for n = 500 Excluded Category Data Set 1 2 3 4 5 6 7 8 1 2 3 0.16 0.11 0.08 0.68 0.29 0.26 0.87 0.31 1.03 1.59 1.81 0.25 0.13 1.19 0.24 0.19 0.70 0.69 0.70 0.23 0.13 0.09 0.14 0.14 Fry and Harris (1996, 1998) found contradictory results for the size properties of the HM and SH tests in their two simulations. Empirical Critical Values for the MTT Test Fry and Harris (1998) explored the use of size-adjusted tests. For these tests, the critical value is set to be the 95th percentile of the test statistic computed in the simulation. Based on power, they recommend the MTT as the preferred test. They state, ‘‘Furthermore, where possible, we would recommend that a simulation experiment be conducted to obtain empirical (size-corrected) critical values for use in inference concerning the IIA property’’ (p. 419). Our results suggest that the sampling distribution of MTT is highly dependent on the data structure. For example, Table 2 shows the empirical critical values generated for the MTT test in our eight data structures. Even though Structures 1 through 3 are very similar, differing only in their degree of collinearity, the values differ substantially relative to the small variances in the distributions of the MTT tests (e.g., for Data Set 1, the standard deviations for the three tests are .03, .36, and .21). The variability in the computed empirical critical values suggests that the MTT test may not be effective even with size adjustments. Furthermore, even if a researcher decides that the size-adjusted MTT test is appropriate, we caution that Fry and Harris’s advice requires that researchers obtain the empirical critical values from their own simulations using their data. We believe that this makes the size-adjusted MTT impractical in most substantive applications. Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 598 Sociological Methods & Research Conclusion Our overall conclusion, based on the simulations shown above and our evaluation of other data structures, is that tests of the IIA assumption that are based on the estimation of a restricted choice set are unsatisfactory for applied work. The Hausman-McFadden test shows substantial size distortion that is unaffected by sample size in our simulations. The Small-Hsiao test has reasonable size properties in some data sets but has severe size distortion even in large samples when there are sparse cells in the table of the outcome variable with a binary independent variable. While our simulations are based on relatively simple models with three outcomes and three independent variables, we suspect that simulations using more complex models that more closely approximate real-world models would uncover additional problems with these tests. Furthermore, even if a researcher decided to use these tests, the problem of inconsistent results based on different variations of the test is likely. The MTT test with empirically based critical values, as suggested by Fry and Harris (1996, 1998), also has limitations that make its use impractical in substantive applications. Overall, it appears that the best advice regarding concern about IIA goes back to an early statement by McFadden (1974), who wrote that the multinomial and conditional logit models should only be used in cases where the outcome categories ‘‘can plausibly be assumed to be distinct and weighed independently in the eyes of each decision maker.’’ Similarly, Amemiya (1981:1517) suggests that the MNLM works well when the alternatives are dissimilar. Care in specifying the model to involve distinct outcomes that are not substitutes for one another seems to be reasonable, albeit unfortunately ambiguous, advice. The generalized extreme value (e.g., nested logit, paired combinatorial logit, etc.) and mixed logit model (see Train 2003) show great promise for models that do not impose the IIA assumption but require intensive calculation to estimate and involve more complicate data structures. Notes 1. While our simulations are based on the multinomial logit model, the results for the IIA tests should also apply to the conditional logit model. 2. We also ran simulations with samples sizes of 200, 300, 400, and 450. The results were consistent with those presented in our figures. 3. Supplementary analyses suggest that 20 to 60 percent of the resulting chi-square values from the Hausman-McFadden test were negative, but the incidence decreases as sample size Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. Cheng, Long / Testing for IIA 599 increases. There is no clear relationship between the type of data structure and the percentage of tests with negative chi-square values. 4. The scales of the figures are fixed to make comparisons across figures easier. References Alvarez, R. Michael and Jonathan Nalgler. 1995. ‘‘Economics, Issues and the Perot Candidacy: Voter Choice in the 1992 Presidential Election.’’ American Journal of Political Science 39:714-44. Amemiya, Takeshi. 1981. ‘‘Qualitative Response Models: A Survey.’’ Journal of Economic Literature 19:1483-1536. Begg, Colin B. and Robert Gray. 1984. ‘‘Calculation of Polychotomous Logistic Regression Parameters Using Individualized Regressions.’’ Biometrika 71:11-8. Brooks, Robert D., Tim R. L. Fry, and Mark N. Harris. 1997. ‘‘The Size and Power Properties of Combining Choice Set Partition Tests for the IIA Property in the Logit Model.’’ Journal of Quantitative Economics 13:45-61. ———. 1998. ‘‘Combining Choice Set Partition Tests for IIA: Some Results in the Four Alternative Setting.’’ Journal of Quantitative Economics 14:1-9. Dow, Jay K. and James W. Endersby. 2004. ‘‘Multinomial Probit and Multinomial Logit: A Comparison of Choice Models for Voting Research.’’ Electoral Studies 23:107-22. Fry, Tim R. L. and Mark N. Harris. 1996. ‘‘A Monte Carlo Study of Tests for the Independence of Irrelevant Alternatives Property.’’ Transportation Research Part B: Methodological 30:19-30. ———. 1998. ‘‘Testing for Independence of Irrelevant Alternatives: Some Empirical Results.’’ Sociological Methods & Research 26:401-23. Greene, William H. 2003. Econometric Analysis. 5th ed. New York: Prentice Hall. Hausman, Jerry A. 1978. ‘‘Specification Tests in Econometrics.’’ Econometrica 46:1251-71. Hausman, Jerry A. and Daniel McFadden. 1984. ‘‘Specification Tests for the Multinomial Logit Model.’’ Econometrica 52:1219-40. Keane, Michael P. 1992. ‘‘A Note on Identification in the Multinomial Probit Model.’’ Journal of Business and Economic Statistics 10:193-200. Lacy, Dean and Barry C. Burden. 1999. ‘‘The Vote-Stealing and Turnout Effects of Ross Perot in the 1992 U.S. Presidential Election.’’ American Journal of Political Science 43: 233-55. Larntz, Kinley. 1978. ‘‘Small Sample Comparisons of Exact Levels of Chi-Squared Goodnessof-Fit Statistics.’’ Journal of the American Statistical Association 73:253-63. Long, J. Scott and Jeremy Freese. 2005. Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College Station, TX: Stata Press. McFadden, Daniel. 1974. ‘‘Conditional Logit Analysis of Qualitative Choice Behavior.’’ Pp. 105-42 in Frontiers of Econometrics, edited by P. Zarembka. New York: Academic Press. McFadden, Daniel, Kenneth Train, and William B. Tye. 1981. ‘‘An Application of Diagnostic Tests for the Independence From Irrelevant Alternatives Property of the Multinomial Logit Model.’’ Transportation Research Board Record 637:39-46. Mokhtarian, Patricia L. and Michael N. Bagley. 2000. ‘‘Modeling Employees’ Perceptions and Proportional Preferences of Work Locations: The Regular Workplace and Telecommuting Alternatives.’’ Transportation Research Part A-Policy and Practice 34:223-42. Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution. 600 Sociological Methods & Research Pels, Eric, Peter Nijkamp, and Piet Rietveld. 2001. ‘‘Airport and Airline Choice in a Multiple Airport Region: An Empirical Analysis for the San Francisco Bay Area.’’ Regional Studies 35:1-9. Small, Kenneth A. and Cheng Hsiao. 1985. ‘‘Multinomial Logit Specification Tests.’’ International Economic Review 26:619-27. Train, Kenneth. 2003. Discrete Choice Methods With Simulation. New York: Cambridge University Press. Zhang, Junsen and Saul D. Hoffman. 1993. ‘‘Discrete-Choice Logit Models: Testing the IIA Property.’’ Sociological Methods & Research 22:193-213. Simon Cheng is an assistant professor of sociology at the University of Connecticut. His research focuses on race and ethnicity, sociology of education, family, quantitative methods, and political economy. He is currently working on a new mixture model that allows researchers to adjust for potential misidentification of group membership in survey data containing drastically unequal subsample sizes, as well as on research that examines multiracial students’ schooling experiences. His most recent publication focuses on resource allocation to young children from biracial families (forthcoming in American Journal of Sociology). J. Scott Long is Chancellor’s Professor of Sociology and Statistics at Indiana University– Bloomington. His research focuses on gender differences in the scientific career, stigma and mental health, aging and labor force participation, human sexuality, and statistical methods. His recent research on the scientific career was published as From Scarcity to Visibility by the National Academy of Sciences. He is past editor of Sociological Methods & Research and the recipient of the American Sociological Associations Paul F. Lazarsfeld Memorial Award for Distinguished Contributions in the Field of Sociological Methodology. He is author of Confirmatory Factor Analysis, Covariance Structure Analysis, Regression Models for Categorical and Limited Dependent Variables, Regression Models for Categorical and Limited Dependent Variables With Stata (with Jeremy Freese), and several edited volumes. Downloaded from http://smr.sagepub.com at UNIV OF CONNECTICUT on May 8, 2007 © 2007 SAGE Publications. All rights reserved. Not for commercial use or unauthorized distribution.
© Copyright 2026 Paperzz