Carnegie Mellon University Research Showcase @ CMU Heinz College Research Heinz College 2011 Auto-probit Model for Multiple Regimes of Network Effects Bin Zhang Carnegie Mellon University Andrew C. Thomas Carnegie Mellon University David Krackhardt Carnegie Mellon University, [email protected] Patrick Doreian University of Pittsburgh Ramayya Krishnan Carnegie Mellon University, [email protected] Follow this and additional works at: http://repository.cmu.edu/heinzworks Part of the Databases and Information Systems Commons, and the Public Policy Commons This Working Paper is brought to you for free and open access by the Heinz College at Research Showcase @ CMU. It has been accepted for inclusion in Heinz College Research by an authorized administrator of Research Showcase @ CMU. For more information, please contact [email protected]. Auto-probit Model for Multiple Regimes of Network Effects Bin Zhang Heinz College, iLab Carnegie Mellon University David Krackhardt Heinz College, iLab Carnegie Mellon University Andrew C. Thomas Department of Statistics, iLab Carnegie Mellon University Patrick Doreian Department of Sociology University of Pittsburgh and Faculty of Social Sciences University of Ljubljana Ramayya Krishnan Heinz College, iLab Carnegie Mellon University Abstract Many researchers believe that consumers’ decisions are not only decided by their personal tastes, but also by the decisions of people who are in their networks. On the other hand, social scientists are more interested in consumers’ dichotomous choice. So an auto-probit model accommodating multiple networks are very useful. However, Current methods to investigate multiple autocorrelated network effects on the same group of actors, embedded in social networks, are primitive. Few solutions have been done for two networks (e.g. Doreian 1989), but not easily on more than three. Even fewer solutions when the dependent variable is binary. We developed two solutions, ExpectationMaximization (E-M) and hierarchical Bayesian, for auto-probit models that accommodate multiple network structures. Both solutions are one of the first in their kinds. The behaviors of the solutions, such as the impact of prior distribution, network structures, and sizes of network effects etc, on parameter estimation will also be studied, by using both real and simulated data. 1 Contents 1 Introduction 3 2 Literatures 3 3 Method 3.1 Model Specification . . . . . . . . 3.2 Expectation-Maximization Model 3.3 Hierarchical Bayesian Solution . 3.4 Validation of Bayesian Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 7 9 10 4 Experiments 12 5 Conclusion 14 A E-M solution implementation A.1 Deduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Expectation step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Maximization step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 16 18 18 B Markov chain Monte Carlo estimation 19 C Solution diagnose 22 2 1 Introduction In earlier times, researchers believed consumers’ preference and choice are decided by their attributes only, such as age, education, income etc (Kamakura and Russell, 1989; Allenby and Rossi, 1998). More and more researchers realized that those decisions are not only decided by individuals’ personal taste, but also by the decisions of people who are connected to, i.e. decisions of neighbors in individuals’ social networks, such effect has been called as “peer influence” (Duncan et al., 1968), “neighborhood effects” (Case, 1991), and “conformity” (Bernheim, 1994) etc. Thus autocorrelation models with a network structure term were born to allow researchers to investigate such problems. Eventually such model became insufficient because an actor very often is under the influence of multiple networks. If we want to compare which network effect plays a more influential role in individual’s decision, we have very limited tools. Although some models were developed to include two network autocorrelations terms, such as Doreian (1989) two regimes of network effect autocorrelation model. On the other hand, social scientists usually are more interested in consumers’ dichotomous choice, such as purchase a product or not, adopt a technology or not etc. With such dichotomous dependent variable, if we have more than two network effects to compare, for example, we want to compare which network of a group of actors’, friend, colleague, or family, plays a more influential role on their decision of purchasing iPad or not, no model can solve this kind of problem. We develop a probit model with multi-network autocorrelation terms to study the competing effects of network. We first use Expectation-Maximization (E-M) algorithm, which is similar to maximum likelihood, a traditional method widely adopted, then use hierarchical Bayesian, a Bayesian statistics, to develop two solutions. Both solutions are one of the first in their kinds. we also study the behaviors of both solutions. for example, how sensitive is the solution with regard to the change of parameters’ prior distribution. Preliminary experiments show that E-M method cannot obtain variance-covariance matrix for parameters, thus hierarchical Bayesian is the only option. Our software is also validated by using posterior quantiles method (Cook et al., 2006). We also study whether the solutions can return correctly estimated parameters by using real and simulated data. The rest of the paper is organized as follows. We discuss the literature on the network effects model in Section 2; Our two solutions for multi-network auto-probit, E-M and hierarchical Bayesian, are presented in Section 3. In Section 4 we present the results of experiments for software validation and parameter estimation behavior observation. Conclusions and suggestions for future work complete the paper in Section 5. 2 Literatures The theoretical foundation for our research is social influence on the diffusion process. Diffusion is the “process by which an innovation is communicated through certain channels over time among the members of a social system ... a special type of communication concerned with the spread of messages that are perceived as new ideas” (Rogers, 1962). The network model has been widely used to study diffusion since the Bass (1969) model. A history of network models used to study diffusion of innovations 3 is reviewed by Valente (2005). He categorized the evolution of network models as three stages: macro models, spatial autocorrelation and network (effect) models. In the early era, the probability of adoption is only related to the time that an actor gets exposed to the object of diffusion. In 1969, Bass proposed a famous model to include both rate of innovation and imitation. This model can estimate both the influence from the social network and innovativeness, and is shown in the equation below. Let Yt−1 be the proportion that has adopted at time period t − 1; yt be the proportion of new adoption at t; b0 be the coefficient of innovation, which is the probability of initial adoption; and b1 be the coefficient of innovation. yt = b0 + b1 Yt−1 1 − Yt−1 2 yt = b0 + (b1 − b0 )Yt−1 − b1 Yt−1 Bass’ model is still at the population level. Its assumption is that everyone in the social network has the same probability of interacting. Such an assumption is not realistic because given a large social network, the probability of any random two nodes connecting to each other is not the same. It seems fair to assume people with closer physical distance communicate more and exert greater influence on each other thus spatial autocorrelation was brought in to the models used in the literature. Simultaneous autoregressive models (SAR) is widely used. The general method of SAR are described in Anselin (1988), and Cressie (1993). The SAR model is an autocorrelation model as follow. We observed actor i’s choice yi , (i = 1, ..., n). X is a vector of explanatory variables. y = Xβ + θ θ = ρWθ + where coefficient of explanatory variable β = (β1 , ..., βm )> and m is the number of explanatory variables; = (1 , ..., n ); each i follows normal distribution Normal(0, σi2 ). The most common method used to fit SAR models has been maximum likelihood, see Ord (1975), Doreian (1980, 1982), and Smirnov (2005). SAR is still a model for measure of diffusion at the population level, and does not account for whether one actor is more or less likely to adopt based on his network position. It does not show how network structure influence diffusion either. So researchers turned to network models to more accurately reflect these influences. The diffusion network model explains that the initial adoption is based on actor’s innovativeness and exposure to sources of influence, and that this influence originates from alters who have already adopted and are able to persuade nonusers to adopt. Before moving on, we need to take a look at event history analysis, which offers some quite useful tools for network analysis. Sometimes the factors that affect adoption also affect the formation of network. Thus it is necessary to collect data at different time periods (panel data), and bring in event history analysis. The purpose of event history analysis is to explain why certain individuals are at a higher risk of experiencing the event of interest than others. The most commonly used analysis methods include failure-time models, survival models, and hazard models etc. (An event is a transition from one status to another, e.g. from non-adoption to adoption.) Network influences are captured by contagion model. Social contagion is the interpersonal connection over which innovation is transmitted (Burt, 1987). The probability of each actor’s adoption increases 4 when the number or proportion of the adopters in his network increases. The network exposure is defined as below, where Ei is the proportion of actor i’s neighbors who have adopted; y is the variable of adoption; and w is social network structure matrix. N X Ei = wij yj j=1 N X wi i=1 Currently most of the network effect models can only accommodate one network, for example Burt’s model (1987), and Leenders’ model (1997). However, an actor is very often under influence of multiple networks. For example, friends and colleagues. So if a research requires investigation of which effect out of multiple networks plays a more significant role in consumers’ decision, none of these models could work, thus a model that can accommodate two or more networks is necessary. Cohesion and structural equivalence are two competing social network models to explain diffusion of innovation. While considerable work has been done on these models, the question of which network model explains diffusion has not been resolved. Doreian (1989) introduced two regimes of network effects autocorrelation model. Such method allows us investigate effects of two network effect on consumers’ choices. The network autocorrelation model takes both interdependence of actors and their attributes such as demographics into consideration. Such interdependence are described by a weight matrix. Doreian’s model can capture both actor’s intrinsic opinion and influence from alters in his social network. The model is described as below: y = Xβ + ρ1 W1 y + ρ2 W2 y + where y is the dependent variable, an integer variable; X is a vector of explanatory variables; Ws represent the social structures underlying each autoregressive regime; ρ1 and ρ2 are the parameters of two network effect respectively; is normally distributed disturbance term. However, all of these autocorrelation models mentioned above have continuous dependent variable. Fujimoto and Valente (2011) developed a plausible solution by using logistics regression to investigate network influence on whether an adult use alcohol or not. The model is shown below: logit(y) = Xβ + ρWy + y is the dichotomous dependent variable; X are independent variables and β are the correspondent parameters; W is the social network structure and ρ are the correspondent parameters; is the error term. This kind of method is called as “quick and dirty” (QAD) by Doreian (1982). Although supporting binary dependent variable and multiple network, this model does not satisfy the assumption of logistics regression – the observations are not independent. So the estimation results are biased. 5 3 Method Although there are some network effect models available, but there are still not many models with dichotomous dependent variable. One notable work is Yang and Allenby (2003)’s. They developed a hierarchical Bayesian autoregressive mixture model. Their model can only accommodate one network effect, although this single network effect could be compounded. Yang and Allenby defined each component network be associated with an explanatory variable, the sum of component coefficients is 1. Thus, they made an assumption that all component network coefficient must be at the same side, and statistically significant at the same time. Such assumptions does not hold if the effect of any of the component networks are statistically insignificant. We thus propose a variant auto-probit model that accommodates multiple regimes of network effects for the same group of actors. We then provide two solutions for our model. The first one is an E-M method, which is in the track of maximum likelihood, and the second one is a hierarchical Bayesian method. Detailed description of both estimations are shown in Appendix A and B. 3.1 Model Specification Assume we observe the choice of actors yi , (yi = {0, 1}, i = 1, ..., n) who are in multiple networks. The choice is dichotomous, yi =0 or 1 and is driven by a latent variable zi . The probability that an actor make the choice is: y = I(z > 0) z = Xβ + θ + , ∼ Normaln (0, In ) θ= k X ρi Wi θ + u, u ∼ Normaln (0, σ 2 In ) i=1 where z is the latent preference of consumers. If it is larger than or equal to a threshold, 0, consumers would choose y as 1; if it is smaller than 0, then consumers would choose y as 0. X is an n × m covariate matrix, such as [1 X0 ]. These covariates are the exogenous characteristics of consumers. β is a m × 1 coefficient vector associated with X. θ is the autocorrelation term. θ can be described as the aggregation of multiple network structure Wi and coefficient ρi where i = 1, ..., k. Ws are network structures describing connections among consumers. ρs are the coefficients of Ws describing effect size of such network. The error of the model is defined as two parts, and u, hence it is modeled as augmented error. is the unobservable error term of z and u is the error term of θ. The benefit of such augmented model is that the latent error term u accounts for the nonzero covariances in the latent variable z, if we marginalize on θ, all the unobserved interdependency will be isolated. The augmented error results in latent preferences with nonzero covariance: z ∼ Normal (Xβ, Q) where Q = In + σ 2 In − k X i=1 !−1 ρi Wi In − k X i=1 6 !−1 > ρi Wi . From the specification of Q we can could sense computationally this is a significant problem. The network structures describing the relationships among actors are represented by matrix W. It is important that we develop models allowing multiple competitively network effects. So far all the models can only allow one network structure. Since a group of actors could have multiple relationships, and connections could be explained by different network theories. Ws could be defined on the base of relevant theories, for example, Wi describes cohesion and Wj describes structural equivalence; or defined by different network relationship, such as Wi describes friendship and Wj describes colleagueship. The coefficient ρi describe the effect size of correspondent matrix Wi . By accommodating multiple networks in an auto-probit model we can compare the effects among competing network structures for the same group of actors embedded in social networks. 3.2 Expectation-Maximization Model We first develop a model using MLE method. Since z is latent, we treat it as unobservable data. E-M algorithm is one of the most used methods to solve this kind of problem. Detailed description of our solution for k regimes of network effects is in Appendix A. Our method consists of: first estimate the value for unobserved z given the current parameter set φ, (φ = {β, ρ, σ 2 }). Then use it to complete data set {y, X, z}. We then use this completed data to estimate a new φ by maximizing the expectation of the likelihood of the complete data. We first initialize the parameters. βi ∼ Normal(νβ , Ωβ ) ρj ∼ Normal(νρ , Ωρ ) σ 2 ∼ Gamma(a, b) where i = 1, ..., m, and j = 1, ..., k. We then calculate the conditional expectation of parameters in the E-step. Q(φ) | φ(t) = Ezky,φ(t) [log L(φ | z, y)] n n n n 1 XX = − log 2π − log | Q | − q̌ij (E[zi zj ] − E[zi ]Xj β − E[zj ]Xi β + Xi Xj β 2 ) 2 2 2 i=1 j=1 where t is the number of step, Q = Var(z), Q = In +σ 2 In − Pk i=1 ρi Wi −1 In − Pk i=1 ρi Wi −1 > and q̌ij is an element in the matrix Q−1 . In the M-step, we maximize Q(φ | φ(t) ) to get β t+1 , β t+1 and Σ(t+1) (Σ = σ 2 ) for the next step. 7 , β (t+1) = arg max Q(β | β (t) ) β ρ (t+1) Σ (t+1) = arg max Q(ρ | ρ(t) ) ρ = arg max Q(Σ | Σ(t) ) Σ We replace φ(t) with φ(t+1) and repeat the E-step and M-step until all the parameters converge. Parameter estimates from the E-M algorithm converge to the MLE estimates (Wu, 1983). It is worth noting that the analytical solution for all the parameters is very complicated. For example β (t+1) = arg max Q(φ | φ(t) ) β (t) ∂ log Q(φ | φ ) ∂ 1 > −1 = − (z − Xβ) Q (z − Xβ) ∂β ∂β 2 −1 β̂ = X> Q−1 X X> Q−1 R where R = E(z). Although the analytical solution of β can still be obtained, it is very complicated. And the situation gets even more complicated for the next few parameters. Consider ρ: ρ(t+1) = arg max Q(φ | φ(t) ) ρ Specify ρ = {ρ1 , ..., ρk }, without losing any generalizability, let us take a look at the situation of ρ1 : (t+1) ρ1 = arg max Q(φ | φ(t) ) ρ1 (t) ∂ 1 1 ∂ log Q(φ | φ ) > −1 = − log | Q | − (z − Xβ) Q (z − Xβ) ∂ρ1 ∂ρ1 2 2 ∂ ∂ (z − Xβ)> Q−1 (z − Xβ) = (z> Q−1 z − z> Q−1 Xβ − β > X> Q−1 z + β > X> Q−1 Xβ) ∂ρ1 ∂ρ1 Establish an analytical solution for this parameter is possible only for very simple specification. Given multiple rhos, an analytical solution is impossible. So we have to use numerical method to solve it. We finally come to σ 2 . Let σ 2 = Σ Σ(t+1) = arg max Q(φ | φ(t) ) Σ ∂ log L ∂ 1 1 > −1 = − log | Q | − (z − Xβ) Q (z − Xβ) ∂Σ ∂Σ 2 2 8 (1) The first term at the the right hand side of Equation (1) is: !−1 !−1 > k k X X ∂ ∂ In − log | Q | = log In + Σ In − ρi Wi ρi Wi ∂Σ ∂Σ i=1 i=1 The second term is: ∂ (z − Xβ)> Q−1 (z − Xβ) ∂Σ = ∂ (z − Xβ)> In + Σ In − ∂Σ k X !−1 In − ρ i Wi k X !−1 > ρi Wi −1 (z − Xβ) i=1 i=1 This is again not solvable analytically, and numerical method are needed to get the estimators for all parameters. Although could be solved by numerical method, E-M still cannot be applied in this situation. This is because the mode of σ 2 , the error term of the autocorrelation term θ, is at 0 (see Figure 1), so the estimated value of it by maximum likelihood is at 0, and makes the variance-covariance matrix singular. Thus we have to find another solution. Figure 1: Distribution of σ 2 , variance of θ, estimated by E-M solution 3.3 Hierarchical Bayesian Solution We then turn to Bayesian methods. Since the observed choice of consumer’s is decided by his/her unobserved preference, such problem has a hierarchical structure, so it is natural to think of using hierarchical Bayesian method. The model specification is the same as for the E-M solution. y is 9 the observed dichotomous variable and modeled by probit function, z is the latent variable. The estimation (MCMC method) is done by sequentially generating draw from a series of distributions. MCMC requires specification of prior distributions for the model parameters and derivation of the full conditional distribution of parameters. The full conditional distributions of all the parameters we need to estimate are presented in the Rotational Conditional Maximization and/or Sampling (RCMS) table (Thomas, 2009) below. The prior distributions of the model parameters are specified as follows. We Table 1: RCMS table for hierarchical Bayesian solution Parameter z β θ σ2 ρi Density Draw Type TrunNormaln (Xβ + θ) Normaln (ν β , Ωβ ) Normaln (ν θ , Ωθ ) Gamma(a, b) Metropolis step Single Parallel Parallel Parallel Sequential generally adopted the priors proposed by Smith and LeSage (2004). β ∼ Normal(ν β , Ωβ ) σ 2 ∼ Gamma(a, b) ρ ∼ Normal(ν ρ , Ωρ ) β follows normal distribution with mean ν β and variance Ωβ . σ 2 follows inverted gamma distribution, where parameters a and b. ρ follows a normal distribution. We then use Markov chain Monte Carlo (MCMC) to generate draws of conditional posterior distributions for the parameters. Detailed description of the implementation of my method, including the conditional distribution of all parameters, is given in B. 3.4 Validation of Bayesian Software One challenge of Bayesian method is getting an error-free solution. Usually Bayesian solution has high complexityh, and lacking of software causes many researchers to develop their own, hence increases the chance of error. Unfortunately most of the models are not validated, and many of them have errors and do not return correct estimations. So it is very necessary to confirm that the code returns correct results. Compared with the available software of Bayesian method, its validation does not have many literature available. We use posterior quantiles method (Cook et al., 2006) to validate our software. The basic idea is, we draw parameter θ from its prior distribution p(Θ); then generate data distribution p(y | θ). If the software is correctly coded, the 95% credible interval should contain the true parameter with probability 95%. The details are described as follow. Assume we want to estimate the parameter θ in Bayesian model p(θ | y) = p(y | θ)p(θ), where p(θ) is the prior distribution of θ, p(y | θ) is the distribution of data, and p(θ | y) is the posterior distribution. 10 Define the quantile as: q(θ0 ) = P (Θ < θ0 ) where θ0 is the true value drawn from prior distribution; Θ is a series of draw from posterior distribution. and the estimated quantile is: q̂(θ0 ) = P̂ (θ < θ0 ) = N 1 X I(θi < θ0 ) N i=1 where θ is a series of draw from posterior distribution generated by the software to-be-tested; N is the number of draws in MCMC. The quantile is the probability of posterior sample smaller than the true value, and the estimated quantile is the number of posterior draws generated by software smaller than the true value. If the software is correctly coded, then the quantile distribution for parameter θ, q̂(θ0 ) should approaches Uniform(0, 1), when N → ∞ (Cook et al., 2006). The whole process up to now is defined as one replication. If run a number of replications, we expect to observe a uniformly distribution q̂(θ0 ) around θ0 , meaning posterior should be randomly distributed around the true value.. We then demonstrate the simulations we ran. Assume the model we want to estimate is: z = X1 β1 + X2 β2 + θ + θ = ρ1 W1 θ + ρ2 W2 θ + u (2) The data is generated using normal distribution: X ∼ Normal(1, 4) We then specified a conjugate prior distribution for each parameter, and use Metropolis-Hasting sampling to simulate the posterior distributions. β ∼ Normal(0, 1) σ 2 ∼ InvGamma(5, 10) ρ ∼ Normal(0.05, 0.052 ) We performed a simulation of 10 replication to validate our hierarchical Bayesian MCMC software. The generated sample size for X is 50, so the size of the network structure W is 50 by 50. In each replication we generated 20000 draws from the posterior distribution of all the parameters in φ (φ = {β1 , β2 , ρ1 , ρ2 , σ 2 }), and kept one from every 20 draws. So finally we have 1000 draws for each parameter. We then count the number of draws larger than the true parameters in each replication. If the software is correctly written, each estimated value should be randomly distributed around the true value, so the number of estimates larger than the true value should be uniformly distributed among the 10 replications. We pooled all these quantiles for the five parameters, 50 in total, and the sorted results are shown in Figure 2. The X-axis is the total replications of the five parameters – 50. The Y-axis is the number of draws larger than true parameters in each replication. The red line represents the uniform distribution line. As we can see, the combined results of the five parameters are all uniformly distributed around the true value, thus confirmed that our Bayesian software is correctly written. 11 Figure 2: Distribution of sorted quantiles of parameters, β1 , β2 , ρ1 , ρ2 , σ 2 , 10 replications of posterior quatiles experiments 4 Experiments Before use real data to validate the accuracy of parameter estimates of our Bayesian solution, we want to make sure our solution generates robust estimation from the posterior distribution, thus we want to diagnose our MCMC. We first study the sensitivity of prior distribution of parameters, because the solution will not have a proper posterior distribution without a proper distribution even the code is correctly written. We first choose a flat prior distribution for ρ, ρ ∼ Normal(0, 100). As shown in Figure 3(a), the posterior draws of ρ have strong autocorrelation. We then choose a narrow prior distribution for ρ, ρ ∼ Normal(0.05, 0.052 ), the posterior draws for ρ are shown in Figure 3(b), and we could find the draws tend to be random. So posterior distribution of ρ is sensitive to its prior distribution. In order to generate random estimates, we choose Normal(0.05, 0.052 ) as the prior distribution for ρ. Second, since ρs are generated by using sequential draws, there is autocorrelations exist between consecutive draws, sometimes the autocorrelation is still high even the lag between two draws is large. The autocorrelation plot for ρ1 (Figure 4) shows two ρ1 draws have strong correlation even the lag is larger than 300, so thinning for the draws is necessary. We use Yang and Allenby (2003)’s Japanese car data to validate the accuracy of parameter estimates of our Bayesian solution. Such data consists of 857 actors’ midsize car purchase information. The dependent variable is whether an actor purchased a Japanese or not, where 1 stands for purchased and 0 otherwise. All the car models in the data are substitutable and roughly have similar prices. Researchers are interested in whether the preferences of Japanese car among actors are interdependent 12 (b) ρ ∼ Normal(0.05, 0.052 ) (a) ρ ∼ Normal(0, 100) Figure 3: Prior sensitivity for parameter ρ, hierarchical Bayesian solution or not. The interdependence in the network are measured by geographical location: ( 1, if i and j have the same zip code; Wij = 0, otherwise. Explanatory variables include actors’ demographic information such as age, annual household income, ethnic group, education and other information such as the price of the car, whether the optional accessories are purchased for the car, latitude and longitude of the actor’s location. The comparison of the coefficient estimates from Yang and Allenby’s code and our Bayesian solution is shown in Figure 5. In order to make a peer comparison, we set all the network effects except the first one as 0n,n matrix. Our W1 has the same definition as Yang and Allenby’s W. For the third method, we add one more network structure W2 , the structure equivalence of two consumers. We use Euclidean distance to measure structural equivalence. In a directed network with non-weighted edges the Euclidean distance between two nodes i and j is the sum of squared common neighbors between the nodes that i and j connect to respectively, and from all nodes to i and j respectively. The distance is shown in the equation (3). v u u dij = t N X (Aik − Ajk )2 k=1,k6=i,j where Aik = ( 1 0 if node i and k are neighbors otherwise 13 (3) Figure 4: Autocorrelation plot of ρ The larger d between node i and j, the less structurally equivalent they are. We get the inverse of dij plus one in order to construct a measure with a positive relationship with role equivalence: sij = 1 dij + 1 The comparison is shown in Figure 5. Each box contains the estimates of one parameter from three methods. The left one is from Yang and Allenby’s, the middle one is from NAP with 1 network, and the right one is from NAP with 2 networks. All the coefficient estimates, β̂ i , ρˆ2 , and σ̂ 2 of the three methods have similar mean, standard deviation and credible interval. Such results confirm again that NAP returns correct estimates of parameters in the model. One thing interesting here is the effect size of the second network, structural equivalence, has a significant negative effect. Which suggests a diminishing cluster effect, when the number of people in the cluster gets bigger, the influence is not proportionally bigger. When the the structural equivalence between two customers is large, meaning they are in the same community (zip code), and the size, i.e. number of customers, of such community is large, so they have more common neighbors, thus more same scalar component in the vector. 5 Conclusion We introduced an auto-probit model to study binary choice of a group of actors that have multiple network relationships among them. We specified the model in both E-M and hierarchical Bayesian 14 Figure 5: Coefficient estimates comparison 15 methods, and developed estimation solutions for both of them. We found E-M solution cannot estimate the parameters thus only hierarchical Bayesian solution can be used here. We also validated our Bayesian solution by using posterior quantiles methods and the results show our software returns accurate estimates. Finally we compare the estimates returned by Yang and Allenby, NAP with one network effect, cohesion, and NAP with two network effect, cohesion and structural equivalence, by using real data. Experiments showed all three returned same estimates, thus confirmed our softweare return correct parameter estimates. Our future plan includes first, run our software on more benchmark data with better defined network structure. We also want to run more experiments with simulated populations to evaluate the properties of the solution. For example, let W have different features, such as network with randomly distributed edges, clustered edges, and skewed distributed edges etc. Second, assume Wθ has strong effect, we will vary ρ’s true value from small number to large number, and observe whether our solution can capture the variation. Third, we want to compare our program with QAD, because although people know parameter estimates returned by QAP is biased, we do not know how different they are from the true value. Finally we also want to study how multicollinearities between Xs, and between X and Wθ affect estimated results. Acknowledgement This work was supported in part by AT&T and the iLab at Heinz College, Carnegie Mellon University. A A.1 E-M solution implementation Deduction In − k X ! ρi Wi θ=u i=1 θ= In − k X !−1 ρi Wi u i=1 θ ∼ Normal 0, σ 2 In − k X !−1 ρi Wi In − i=1 k X !−1 > ρi Wi i=1 We then get the distribution of z|β, ρ, σ 2 : z ∼ Normal (Xβ, Q) , where Q = In + σ 2 In − k X !−1 ρi Wi In − i=1 k X !−1 > ρi Wi i=1 The joint distribution of y and z can transformed as: p(y|z)p(z|β, ρ, σ 2 ) = p(y, z|β, ρ, σ 2 ) = p(z|y; β, ρ, σ 2 )p(y) 16 (4) The right side of equation (4) are two distributions we already have, as shown below. Please be aware, although z follow normal distribution, z|y follows truncated normal distribution. 1 1 > √ exp − (z − Xβ) (z − Xβ) 2 2π p(y) = I(z > 0) Φ(Xβ) ! n 1X 1 √ exp − (zi − xi β)2 2 2π i=1 I((z) > 0) = Φ(Xβ) z|β, ρ, σ 2 ∼ Normal(Xβ, Q) z|y, X; β, ρ, σ 2 ∼ TrunNormal(Xβ, Q) We now show how the algorithm works by using a simple example. Assume we only consider parameter β p(β, z|y) = p(β|z, y)p(z|y) z|y, X; β ∼ TrunNormal(Xβ, Q) Assume Var(z)=1, n 1 1 X exp − (zi − Xi β)2 L(β|z) = √ 2 2π i=1 β̂ = (X> X)−1 X> R (R = E[z]) Then we include more parameters, ρ and σ 2 , and Var(z) = Q: E[z](t+1) = E[z|y, β (t) ] = f (β (t) , y) log L(β, ρ, σ 2 |z) = log p(z|β, ρ, σ 2 ) n Y = log p(zi |β, ρ, σ 2 ) i=1 n X 1 1 log p − (z − Xβ)> Q−1 (z − Xβ) 2π|Q| 2 i=1 n X 1 1 > −1 > −1 > −1 > −1 = log p − z Q z − z Q Xβ − X βQ z + X βQ Xβ 2 2π|Q| = i=1 If we decompose the matrices above as vector product, then: n n n X n n n X n n n X 1 XX (5) = log p − (zi − Xi β)q̌ij (zj − Xj β) 2π|Q| 2 i=1 j=1 i=1 = 1 1 1 XX log p − q̌ij (zi − Xi β)(zj − Xj β) 2π|Q| 2 i=1 j=1 i=1 1 1 XX = log p − q̌ij (zi zj − zi Xj β − zj Xi β + Xi Xj β 2 ) 2 2π|Q| i=1 i=1 j=1 17 (5) where q̌ij is the element in Q̌, and Q̌ = Q−1 . Z E[z|θ, y] = A.2 zp(z)dz = R Expectation step E-M algorithm consists of two steps, the first one is expectation. Q(φ|φ(t) ) = Ez|y,φ(t) [log L(φ|z, y)] " n # X 1 1 > −1 =E log p − E (z − Xβ) Q (z − Xβ) 2 2π|Q i=1 n 1 1 = n log √ − log |Q| − E (z − Xβ)> Q−1 (z − Xβ) 2 2 2π n X n X n 1 n q̌ij (zi zj − zi Xj β − zj Xi β + Xi Xj β 2 ) = − log 2π − log |Q| − E 2 2 2 i=1 j=1 n 1 n = − log 2π − log |Q| − 2 2 2 n X n X q̌ij (E[zi zj ] − E[zi ]Xj β − E[zj ]Xi β + Xi Xj β 2 ) i=1 j=1 where φ is the parameter set, and t is the number of steps. A.3 Maximization step The second step of E-M algorithm is maximization. β (t+1) = arg max Q(φ|φ(t) ) β = arg max β n X 1 1 − (z − Xβ)> Q−1 (z − Xβ) log p 2π|Q| 2 i=1 (6) If we directly use analytical method to solve the Equation (6) above, then: ∂ 1 ∂ log L > −1 = − (z − Xβ) Q (z − Xβ) ∂β ∂β 2 ∂ ∂ (z − Xβ)> Q−1 (z − Xβ) = (z> Q−1 z − z> Q−1 Xβ − β > X> Q−1 z + β > X> Q−1 Xβ) ∂β ∂β = −z> Q−1 X − X> Q−1 z + X> Q−1 Xβ Set Equation (7) as 0, then: −z> Q−1 X − X> Q−1 z + X> Q−1 Xβ = 0 X> Q−1 Xβ = z> Q−1 X + X> Q−1 z β = (X> Q−1 X)−1 z> Q−1 X + (X> Q−1 X)−1 X> Q−1 z −1 β̂ = X> Q−1 X X> Q−1 R 18 (7) ρ(t+1) = arg max Q(φ|φ(t) ) ρ Assume ρ = {ρ1 , ..., ρk }, without losing any generalizabiliy, let us take a look at the situation of ρ1 : (t+1) ρ1 ∂ log L ∂ = ∂ρ1 ∂ρ1 = arg max Q(φ|φ(t) ) ρ1 1 1 > −1 − log |Q| − (z − Xβ) Q (z − Xβ) 2 2 ∂ log |Q| = − tr(W1 Q−1 ) ∂ρ1 ∂ ∂ (z − Xβ)> Q−1 (z − Xβ) = (z> Q−1 z − z> Q−1 Xβ − β > X> Q−1 z + β > X> Q−1 Xβ) ∂ρ1 ∂ρ1 Impossible to get the analytical solution for ρi . Let σ 2 = Σ Σ(t+1) = arg max Q(φ|φ(t) ) Σ 1 ∂ 1 ∂ log L > −1 − log |Q| − (z − Xβ) Q (z − Xβ) = ∂Σ ∂Σ 2 2 (8) The first term at the the right hand side of equation above is: !−1 !−1 > k k X X ∂ ∂ In − log |Q| = log In + Σ In − ρi Wi ρi Wi ∂Σ ∂Σ i=1 i=1 The second term is: ∂ (z − Xβ)> Q−1 (z − Xβ) ∂Σ = ∂ (z − Xβ)> In + Σ In − ∂Σ k X !−1 ρ i Wi In − i=1 k X !−1 > ρi Wi −1 (z − Xβ) i=1 This is again not solvable by using analytical method. B Markov chain Monte Carlo estimation The Markov chain Monte Carlo method generate chain of draws from the conditional posterior distributions of parameters. Our solution consists of steps as follows. 19 Step 1. Generate z, z follows truncated normal distribution. z ∼ TrunNormaln (Xβ + θ, In ) where In is the n × n identity matrix. If yi = 1, then zi ≥ 0, if yi = 0, then zi < 0 Step 2. Generate β, β ∼ Normal(ν β , Ωβ ) 1. define β 0 , where β0 = 0 0 .. . 0 2. define D = hIn , D is a baseline variance matrix, corresponding to the prior p(β), where h is a large constant, e.g. 400, In = (n × n), n is the number of observations. 1 0 ... 0 2 400 σ 0 . . . 0 0 1 0 0 σ02 . . . 0 ... 0 −1 400 = . D = .. .. .. .. .. .. . . . . . . ... . . 0 0 . . . σ02 1 0 0 ... 400 1 , a small number close to 0, compared with Normal(0, 1), where σ02 = 1 Set σ02 as 400 −1 3. Ωβ = D−1 + X> X This is because: z = Xβ + θ + z − θ − = Xβ β = X−1 (z − θ − ) ∴ β ∼ Normal X−1 (z − θ), (X> X)−1 −1 We need to have an offset for the initial variance, so Ωβ = D−1 + X> X 4. Then ν β can be represented by ν β = Ωβ X> (z − θ) + D−1 Step 3. Generate θ, θ ∼ Normal(ν θ , Ωθ ) 1. First, define B = In − ρ1 W1 − ρ2 W2 θ = ρ1 W1 θ + ρ2 W2 θ + u θ − ρ1 W1 θ − ρ2 W2 θ = u (In − ρ1 W1 − ρ2 W2 )θ = u Bθ = u 20 2. Then Ωθ = B> B In + σ2 −1 Bθ = u θ = B−1 u Since Var(u) = σ 2 In Var(θ) = Var(B−1 u) > = B−1 B−1 Var(u) = (B> B)−1 σ 2 In −1 = σ −2 B> B > −1 B B = σ2 −1 B> B B> B . So Var(θ) = Ωθ = In + We then add an offset In to σ2 σ2 3. ν θ = Ωθ (z − Xβ) z = Xβ + θ + z = Xβ + (In − ρ1 W1 − ρ2 W2 )−1 u + θ = (z − Xβ) − 7 Step 4. Generate σ 2 , σ 2 ∼ Gamma(a, b) a = s0 + n 2 2 b= θ > B> Bθ + 2 q0 where s0 = 5, n is the size of data. Step 5. Finally we generate coefficient for W, ρi using Metropolis-Hasting sampling. ρnew = ρold i i + ∆i , where ∆i ∼ Normal(0, 0.01). The accepting probability α is obtained by: 1 > > |Bnew | exp − 2σ 2 θ Bnew Bnew θ , 1 min 1 |Bold | exp − 2 θ > B> B θ old old 2σ 21 C Solution diagnose We run MCMC experiment to confirm there is no autocorrelation among draws of each parameter. In this experiment, we set the length of MCMC chain as 30,000, burn-in as 10,000, and thinning as 20, which is used for removing the autocorrelations between draws. The trace plots for the 1000 draws after burn-in and thinning are listed in the Figure 6 below. Figure 6: Trace plot of a two-network auto-probit model We have 12 plots total. Each plot depicts draws for a particular parameter estimation. The first 9 plots, from left to right and top to bottom, are the trace for the βi , coefficient of independent variables. Each point represents the value of estimated coefficient β̂i , and the red line represents the mean. We observe all β̂i s are randomly distributed around the mean, and the mean is significant, showing the estimation results are valid. The 10th and 11th plots are for the two estimated network effect coefficients ρˆ1 and ρˆ2 . We found both ρˆi are also significant, and randomly distributed around their means. The only coefficient showing autocorrelation is σ 2 . Note that not all values of ρ1 and ρ2 can make B (B = In − ρ1 W1 − ρ2 W2 ) invertible. The plot below shows the relationship between the values of ρ1 and ρ2 , and the invertibility of B. The green area is where B is invertible, and red area is otherwise. If limit draws to the green area, we will have correlated ρ1 and ρ2 . When we draw ρ1 and ρ2 using bivariate normal, there is no correlation between they (see 22 Figure 7). We understand the correlation between ρ1 and ρ2 comes from the definition of W1 and W2 , not the prior non-correlation. Figure 7: Scatter plot of ρ1 and ρ2 on valid region for invertible B, 23 References Allenby, G. M. and Rossi, P. E. (1998). Marketing models of consumer heterogeneity. Journal of Econometrics, 89(1-2):57–78. Anselin, L. (1988). Spatial econometrics: Methods and models. Studies in Operational Regional Science. Springer, 1st edition. Bass, F. M. (1969). A new product growth for model consumer durables. Management Science, 15(5):215–227. Bernheim, B. D. (1994). A theory of conformity. Journal of Political Economy, 102(5):841–77. Burt, R. S. (1987). Social contagion and innovation: Cohesion versus structural equivalence. American Journal of Sociology, 92(6):1287. Case, A. C. (1991). Spatial patterns in household demand. Econometrica, 59(4):953–965. Cook, S. R., Gelman, A., and Rubin, D. B. (2006). Validation of software for bayesian models using posterior quantiles. Journal of Computational and Graphical Statistics, 15:675–692. Cressie, N. A. C. (1993). Statistics for spatial data. Probability and Statistics series. Wiley-Interscience, revised edition. Doreian, P. (1980). Linear models with spatially distributed data: Spatial disturbances or spatial effects. Sociological Methods and Research, 9(1):29–60. Doreian, P. (1982). Maximum likelihood methods for linear models: Spatial effects and spatial disturbance terms. Sociological Methods and Research, 10(3):243–269. Doreian, P. (1989). Two Regimes of Network Effects Autocorrelation. The Small World. Ablex Publishing. Duncan, O. D., Haller, A. O., and Portes, A. (1968). Peer influences on aspirations: A reinterpretation. The American Journal of Sociology, 74(2):119–137. Fujimoto, K. and Valente, Thomas, W. (2011). Network influence on adolescent alcohol use: Relational, positional, and affiliation-based peer influence. Kamakura, W. A. and Russell, G. J. (1989). A probabilistic choice model for market segmentation and elasticity structure. Journal of Marketing Research, 26(4):379–390. Ord, K. (1975). Estimation methods for models of spatial interaction. Journal of the American Statistical Association, 70(349):120–126. Rogers, E. M. (1962). Diffusion of Innovations. Free Press. Smirnov, O. A. (2005). Computation of the information matrix for models with spatial interaction on a lattice. Journal of Computational and Graphical Statistics, 14(4):910–927. 24 Smith, T. E. and LeSage, J. P. (2004). A bayesian probit model with spatial dependencies. In Pace, K. R. and LeSage, J. P., editors, Advances in Econometrics: Volume 18: Spatial and Spatiotemporal Econometrics, pages 127–160. Elsevier. Thomas, A. C. (2009). Hierarchical Models for Relational Data. Phd dissertation, Harvard University, Cambridge, MA. Valente, T. W. (2005). Models and methods for innovation diffusion. In Carrington, P. J., Scott, J., and Wasserman, S., editors, Models and Methods in Social Network Analysis. Cambridge University Press. Wu, C. F. J. (1983). On the convergence properties of the em algorithm. The Annals of Statistics, 11(1):95–103. Yang, S. and Allenby, G. M. (2003). Modeling interdependent consumer preferences. Journal of Marketing Research, XL:282–294. 25
© Copyright 2025 Paperzz