Auto-probit Model for Multiple Regimes of Network Effects

Carnegie Mellon University
Research Showcase @ CMU
Heinz College Research
Heinz College
2011
Auto-probit Model for Multiple Regimes of
Network Effects
Bin Zhang
Carnegie Mellon University
Andrew C. Thomas
Carnegie Mellon University
David Krackhardt
Carnegie Mellon University, [email protected]
Patrick Doreian
University of Pittsburgh
Ramayya Krishnan
Carnegie Mellon University, [email protected]
Follow this and additional works at: http://repository.cmu.edu/heinzworks
Part of the Databases and Information Systems Commons, and the Public Policy Commons
This Working Paper is brought to you for free and open access by the Heinz College at Research Showcase @ CMU. It has been accepted for inclusion
in Heinz College Research by an authorized administrator of Research Showcase @ CMU. For more information, please contact [email protected].
Auto-probit Model for Multiple Regimes of Network Effects
Bin Zhang
Heinz College, iLab
Carnegie Mellon University
David Krackhardt
Heinz College, iLab
Carnegie Mellon University
Andrew C. Thomas
Department of Statistics, iLab
Carnegie Mellon University
Patrick Doreian
Department of Sociology
University of Pittsburgh
and
Faculty of Social Sciences
University of Ljubljana
Ramayya Krishnan
Heinz College, iLab
Carnegie Mellon University
Abstract
Many researchers believe that consumers’ decisions are not only decided by their personal tastes,
but also by the decisions of people who are in their networks. On the other hand, social scientists are
more interested in consumers’ dichotomous choice. So an auto-probit model accommodating multiple
networks are very useful. However, Current methods to investigate multiple autocorrelated network
effects on the same group of actors, embedded in social networks, are primitive. Few solutions
have been done for two networks (e.g. Doreian 1989), but not easily on more than three. Even
fewer solutions when the dependent variable is binary. We developed two solutions, ExpectationMaximization (E-M) and hierarchical Bayesian, for auto-probit models that accommodate multiple
network structures. Both solutions are one of the first in their kinds. The behaviors of the solutions,
such as the impact of prior distribution, network structures, and sizes of network effects etc, on
parameter estimation will also be studied, by using both real and simulated data.
1
Contents
1 Introduction
3
2 Literatures
3
3 Method
3.1 Model Specification . . . . . . . .
3.2 Expectation-Maximization Model
3.3 Hierarchical Bayesian Solution .
3.4 Validation of Bayesian Software .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
7
9
10
4 Experiments
12
5 Conclusion
14
A E-M solution implementation
A.1 Deduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2 Expectation step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.3 Maximization step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
16
18
18
B Markov chain Monte Carlo estimation
19
C Solution diagnose
22
2
1
Introduction
In earlier times, researchers believed consumers’ preference and choice are decided by their attributes
only, such as age, education, income etc (Kamakura and Russell, 1989; Allenby and Rossi, 1998). More
and more researchers realized that those decisions are not only decided by individuals’ personal taste,
but also by the decisions of people who are connected to, i.e. decisions of neighbors in individuals’ social
networks, such effect has been called as “peer influence” (Duncan et al., 1968), “neighborhood effects”
(Case, 1991), and “conformity” (Bernheim, 1994) etc. Thus autocorrelation models with a network
structure term were born to allow researchers to investigate such problems. Eventually such model
became insufficient because an actor very often is under the influence of multiple networks. If we want
to compare which network effect plays a more influential role in individual’s decision, we have very
limited tools. Although some models were developed to include two network autocorrelations terms,
such as Doreian (1989) two regimes of network effect autocorrelation model. On the other hand, social
scientists usually are more interested in consumers’ dichotomous choice, such as purchase a product or
not, adopt a technology or not etc. With such dichotomous dependent variable, if we have more than
two network effects to compare, for example, we want to compare which network of a group of actors’,
friend, colleague, or family, plays a more influential role on their decision of purchasing iPad or not, no
model can solve this kind of problem.
We develop a probit model with multi-network autocorrelation terms to study the competing effects
of network. We first use Expectation-Maximization (E-M) algorithm, which is similar to maximum
likelihood, a traditional method widely adopted, then use hierarchical Bayesian, a Bayesian statistics,
to develop two solutions. Both solutions are one of the first in their kinds. we also study the behaviors
of both solutions. for example, how sensitive is the solution with regard to the change of parameters’
prior distribution. Preliminary experiments show that E-M method cannot obtain variance-covariance
matrix for parameters, thus hierarchical Bayesian is the only option. Our software is also validated by
using posterior quantiles method (Cook et al., 2006). We also study whether the solutions can return
correctly estimated parameters by using real and simulated data.
The rest of the paper is organized as follows. We discuss the literature on the network effects model
in Section 2; Our two solutions for multi-network auto-probit, E-M and hierarchical Bayesian, are
presented in Section 3. In Section 4 we present the results of experiments for software validation and
parameter estimation behavior observation. Conclusions and suggestions for future work complete the
paper in Section 5.
2
Literatures
The theoretical foundation for our research is social influence on the diffusion process. Diffusion is
the “process by which an innovation is communicated through certain channels over time among the
members of a social system ... a special type of communication concerned with the spread of messages
that are perceived as new ideas” (Rogers, 1962). The network model has been widely used to study
diffusion since the Bass (1969) model. A history of network models used to study diffusion of innovations
3
is reviewed by Valente (2005). He categorized the evolution of network models as three stages: macro
models, spatial autocorrelation and network (effect) models. In the early era, the probability of adoption
is only related to the time that an actor gets exposed to the object of diffusion. In 1969, Bass proposed
a famous model to include both rate of innovation and imitation. This model can estimate both the
influence from the social network and innovativeness, and is shown in the equation below. Let Yt−1 be
the proportion that has adopted at time period t − 1; yt be the proportion of new adoption at t; b0
be the coefficient of innovation, which is the probability of initial adoption; and b1 be the coefficient of
innovation.
yt
= b0 + b1 Yt−1
1 − Yt−1
2
yt = b0 + (b1 − b0 )Yt−1 − b1 Yt−1
Bass’ model is still at the population level. Its assumption is that everyone in the social network has
the same probability of interacting. Such an assumption is not realistic because given a large social
network, the probability of any random two nodes connecting to each other is not the same. It seems fair
to assume people with closer physical distance communicate more and exert greater influence on each
other thus spatial autocorrelation was brought in to the models used in the literature. Simultaneous
autoregressive models (SAR) is widely used. The general method of SAR are described in Anselin
(1988), and Cressie (1993). The SAR model is an autocorrelation model as follow. We observed actor
i’s choice yi , (i = 1, ..., n). X is a vector of explanatory variables.
y = Xβ + θ
θ = ρWθ + where coefficient of explanatory variable β = (β1 , ..., βm )> and m is the number of explanatory variables;
= (1 , ..., n ); each i follows normal distribution Normal(0, σi2 ). The most common method used to fit
SAR models has been maximum likelihood, see Ord (1975), Doreian (1980, 1982), and Smirnov (2005).
SAR is still a model for measure of diffusion at the population level, and does not account for whether
one actor is more or less likely to adopt based on his network position. It does not show how network
structure influence diffusion either. So researchers turned to network models to more accurately reflect
these influences. The diffusion network model explains that the initial adoption is based on actor’s
innovativeness and exposure to sources of influence, and that this influence originates from alters who
have already adopted and are able to persuade nonusers to adopt. Before moving on, we need to take a
look at event history analysis, which offers some quite useful tools for network analysis. Sometimes the
factors that affect adoption also affect the formation of network. Thus it is necessary to collect data at
different time periods (panel data), and bring in event history analysis. The purpose of event history
analysis is to explain why certain individuals are at a higher risk of experiencing the event of interest
than others. The most commonly used analysis methods include failure-time models, survival models,
and hazard models etc. (An event is a transition from one status to another, e.g. from non-adoption to
adoption.)
Network influences are captured by contagion model. Social contagion is the interpersonal connection
over which innovation is transmitted (Burt, 1987). The probability of each actor’s adoption increases
4
when the number or proportion of the adopters in his network increases. The network exposure is
defined as below, where Ei is the proportion of actor i’s neighbors who have adopted; y is the variable
of adoption; and w is social network structure matrix.
N
X
Ei =
wij yj
j=1
N
X
wi
i=1
Currently most of the network effect models can only accommodate one network, for example Burt’s
model (1987), and Leenders’ model (1997). However, an actor is very often under influence of multiple
networks. For example, friends and colleagues. So if a research requires investigation of which effect out
of multiple networks plays a more significant role in consumers’ decision, none of these models could
work, thus a model that can accommodate two or more networks is necessary. Cohesion and structural
equivalence are two competing social network models to explain diffusion of innovation. While considerable work has been done on these models, the question of which network model explains diffusion has
not been resolved.
Doreian (1989) introduced two regimes of network effects autocorrelation model. Such method allows
us investigate effects of two network effect on consumers’ choices. The network autocorrelation model
takes both interdependence of actors and their attributes such as demographics into consideration. Such
interdependence are described by a weight matrix. Doreian’s model can capture both actor’s intrinsic
opinion and influence from alters in his social network. The model is described as below:
y = Xβ + ρ1 W1 y + ρ2 W2 y + where y is the dependent variable, an integer variable; X is a vector of explanatory variables; Ws
represent the social structures underlying each autoregressive regime; ρ1 and ρ2 are the parameters of
two network effect respectively; is normally distributed disturbance term.
However, all of these autocorrelation models mentioned above have continuous dependent variable.
Fujimoto and Valente (2011) developed a plausible solution by using logistics regression to investigate
network influence on whether an adult use alcohol or not. The model is shown below:
logit(y) = Xβ + ρWy + y is the dichotomous dependent variable; X are independent variables and β are the correspondent
parameters; W is the social network structure and ρ are the correspondent parameters; is the error
term. This kind of method is called as “quick and dirty” (QAD) by Doreian (1982). Although supporting
binary dependent variable and multiple network, this model does not satisfy the assumption of logistics
regression – the observations are not independent. So the estimation results are biased.
5
3
Method
Although there are some network effect models available, but there are still not many models with
dichotomous dependent variable. One notable work is Yang and Allenby (2003)’s. They developed a
hierarchical Bayesian autoregressive mixture model. Their model can only accommodate one network
effect, although this single network effect could be compounded. Yang and Allenby defined each component network be associated with an explanatory variable, the sum of component coefficients is 1.
Thus, they made an assumption that all component network coefficient must be at the same side, and
statistically significant at the same time. Such assumptions does not hold if the effect of any of the
component networks are statistically insignificant. We thus propose a variant auto-probit model that
accommodates multiple regimes of network effects for the same group of actors. We then provide two
solutions for our model. The first one is an E-M method, which is in the track of maximum likelihood,
and the second one is a hierarchical Bayesian method. Detailed description of both estimations are
shown in Appendix A and B.
3.1
Model Specification
Assume we observe the choice of actors yi , (yi = {0, 1}, i = 1, ..., n) who are in multiple networks. The
choice is dichotomous, yi =0 or 1 and is driven by a latent variable zi . The probability that an actor
make the choice is:
y = I(z > 0)
z = Xβ + θ + , ∼ Normaln (0, In )
θ=
k
X
ρi Wi θ + u, u ∼ Normaln (0, σ 2 In )
i=1
where z is the latent preference of consumers. If it is larger than or equal to a threshold, 0, consumers
would choose y as 1; if it is smaller than 0, then consumers would choose y as 0. X is an n × m covariate
matrix, such as [1 X0 ]. These covariates are the exogenous characteristics of consumers. β is a m × 1
coefficient vector associated with X. θ is the autocorrelation term. θ can be described as the aggregation of multiple network structure Wi and coefficient ρi where i = 1, ..., k. Ws are network structures
describing connections among consumers. ρs are the coefficients of Ws describing effect size of such
network. The error of the model is defined as two parts, and u, hence it is modeled as augmented
error. is the unobservable error term of z and u is the error term of θ. The benefit of such augmented
model is that the latent error term u accounts for the nonzero covariances in the latent variable z, if
we marginalize on θ, all the unobserved interdependency will be isolated.
The augmented error results in latent preferences with nonzero covariance:
z ∼ Normal (Xβ, Q)
where Q = In + σ 2
In −
k
X
i=1
!−1 
ρi Wi
 In −
k
X
i=1
6
!−1 >
ρi Wi
 . From the specification of Q we can
could sense computationally this is a significant problem.
The network structures describing the relationships among actors are represented by matrix W. It is
important that we develop models allowing multiple competitively network effects. So far all the models
can only allow one network structure. Since a group of actors could have multiple relationships, and
connections could be explained by different network theories. Ws could be defined on the base of relevant
theories, for example, Wi describes cohesion and Wj describes structural equivalence; or defined by
different network relationship, such as Wi describes friendship and Wj describes colleagueship. The
coefficient ρi describe the effect size of correspondent matrix Wi . By accommodating multiple networks
in an auto-probit model we can compare the effects among competing network structures for the same
group of actors embedded in social networks.
3.2
Expectation-Maximization Model
We first develop a model using MLE method. Since z is latent, we treat it as unobservable data. E-M
algorithm is one of the most used methods to solve this kind of problem. Detailed description of our
solution for k regimes of network effects is in Appendix A.
Our method consists of: first estimate the value for unobserved z given the current parameter set φ,
(φ = {β, ρ, σ 2 }). Then use it to complete data set {y, X, z}. We then use this completed data to
estimate a new φ by maximizing the expectation of the likelihood of the complete data.
We first initialize the parameters.
βi ∼ Normal(νβ , Ωβ )
ρj ∼ Normal(νρ , Ωρ )
σ 2 ∼ Gamma(a, b)
where i = 1, ..., m, and j = 1, ..., k.
We then calculate the conditional expectation of parameters in the E-step.
Q(φ) | φ(t) = Ezky,φ(t) [log L(φ | z, y)]
n
n
n
n
1 XX
= − log 2π − log | Q | −
q̌ij (E[zi zj ] − E[zi ]Xj β − E[zj ]Xi β + Xi Xj β 2 )
2
2
2
i=1 j=1
where t is the number of step, Q = Var(z), Q = In
+σ 2
In −
Pk
i=1 ρi Wi
−1 In −
Pk
i=1 ρi Wi
−1 >
and q̌ij is an element in the matrix Q−1 .
In the M-step, we maximize Q(φ | φ(t) ) to get β t+1 , β t+1 and Σ(t+1) (Σ = σ 2 ) for the next step.
7
,
β (t+1) = arg max Q(β | β (t) )
β
ρ
(t+1)
Σ
(t+1)
= arg max Q(ρ | ρ(t) )
ρ
= arg max Q(Σ | Σ(t) )
Σ
We replace φ(t) with φ(t+1) and repeat the E-step and M-step until all the parameters converge. Parameter estimates from the E-M algorithm converge to the MLE estimates (Wu, 1983).
It is worth noting that the analytical solution for all the parameters is very complicated. For example
β (t+1) = arg max Q(φ | φ(t) )
β
(t)
∂ log Q(φ | φ )
∂
1
> −1
=
− (z − Xβ) Q (z − Xβ)
∂β
∂β
2
−1
β̂ = X> Q−1 X
X> Q−1 R
where R = E(z).
Although the analytical solution of β can still be obtained, it is very complicated. And the situation
gets even more complicated for the next few parameters. Consider ρ:
ρ(t+1) = arg max Q(φ | φ(t) )
ρ
Specify ρ = {ρ1 , ..., ρk }, without losing any generalizability, let us take a look at the situation of ρ1 :
(t+1)
ρ1
= arg max Q(φ | φ(t) )
ρ1
(t)
∂
1
1
∂ log Q(φ | φ )
> −1
=
− log | Q | − (z − Xβ) Q (z − Xβ)
∂ρ1
∂ρ1
2
2
∂
∂
(z − Xβ)> Q−1 (z − Xβ) =
(z> Q−1 z − z> Q−1 Xβ − β > X> Q−1 z + β > X> Q−1 Xβ)
∂ρ1
∂ρ1
Establish an analytical solution for this parameter is possible only for very simple specification. Given
multiple rhos, an analytical solution is impossible. So we have to use numerical method to solve it.
We finally come to σ 2 . Let σ 2 = Σ
Σ(t+1) = arg max Q(φ | φ(t) )
Σ
∂ log L
∂
1
1
> −1
=
− log | Q | − (z − Xβ) Q (z − Xβ)
∂Σ
∂Σ
2
2
8
(1)
The first term at the the right hand side of Equation (1) is:
!−1 
!−1 > k
k
X
X
∂
∂
 In −
 log | Q | =
log In + Σ In −
ρi Wi
ρi Wi
∂Σ
∂Σ
i=1
i=1
The second term is:
∂
(z − Xβ)> Q−1 (z − Xβ)
∂Σ

=
∂

(z − Xβ)> In + Σ In −
∂Σ
k
X
!−1 
 In −
ρ i Wi
k
X
!−1 >
ρi Wi
−1
 

(z − Xβ)
i=1
i=1
This is again not solvable analytically, and numerical method are needed to get the estimators for
all parameters. Although could be solved by numerical method, E-M still cannot be applied in this
situation. This is because the mode of σ 2 , the error term of the autocorrelation term θ, is at 0 (see
Figure 1), so the estimated value of it by maximum likelihood is at 0, and makes the variance-covariance
matrix singular. Thus we have to find another solution.
Figure 1: Distribution of σ 2 , variance of θ, estimated by E-M solution
3.3
Hierarchical Bayesian Solution
We then turn to Bayesian methods. Since the observed choice of consumer’s is decided by his/her
unobserved preference, such problem has a hierarchical structure, so it is natural to think of using
hierarchical Bayesian method. The model specification is the same as for the E-M solution. y is
9
the observed dichotomous variable and modeled by probit function, z is the latent variable. The
estimation (MCMC method) is done by sequentially generating draw from a series of distributions.
MCMC requires specification of prior distributions for the model parameters and derivation of the full
conditional distribution of parameters. The full conditional distributions of all the parameters we need
to estimate are presented in the Rotational Conditional Maximization and/or Sampling (RCMS) table
(Thomas, 2009) below. The prior distributions of the model parameters are specified as follows. We
Table 1: RCMS table for hierarchical Bayesian solution
Parameter
z
β
θ
σ2
ρi
Density
Draw Type
TrunNormaln (Xβ + θ)
Normaln (ν β , Ωβ )
Normaln (ν θ , Ωθ )
Gamma(a, b)
Metropolis step
Single
Parallel
Parallel
Parallel
Sequential
generally adopted the priors proposed by Smith and LeSage (2004).
β ∼ Normal(ν β , Ωβ )
σ 2 ∼ Gamma(a, b)
ρ ∼ Normal(ν ρ , Ωρ )
β follows normal distribution with mean ν β and variance Ωβ . σ 2 follows inverted gamma distribution,
where parameters a and b. ρ follows a normal distribution. We then use Markov chain Monte Carlo
(MCMC) to generate draws of conditional posterior distributions for the parameters. Detailed description of the implementation of my method, including the conditional distribution of all parameters, is
given in B.
3.4
Validation of Bayesian Software
One challenge of Bayesian method is getting an error-free solution. Usually Bayesian solution has high
complexityh, and lacking of software causes many researchers to develop their own, hence increases the
chance of error. Unfortunately most of the models are not validated, and many of them have errors
and do not return correct estimations. So it is very necessary to confirm that the code returns correct
results. Compared with the available software of Bayesian method, its validation does not have many
literature available. We use posterior quantiles method (Cook et al., 2006) to validate our software.
The basic idea is, we draw parameter θ from its prior distribution p(Θ); then generate data distribution
p(y | θ). If the software is correctly coded, the 95% credible interval should contain the true parameter
with probability 95%. The details are described as follow.
Assume we want to estimate the parameter θ in Bayesian model p(θ | y) = p(y | θ)p(θ), where p(θ) is
the prior distribution of θ, p(y | θ) is the distribution of data, and p(θ | y) is the posterior distribution.
10
Define the quantile as:
q(θ0 ) = P (Θ < θ0 )
where θ0 is the true value drawn from prior distribution; Θ is a series of draw from posterior distribution.
and the estimated quantile is:
q̂(θ0 ) = P̂ (θ < θ0 )
=
N
1 X
I(θi < θ0 )
N
i=1
where θ is a series of draw from posterior distribution generated by the software to-be-tested; N is the
number of draws in MCMC. The quantile is the probability of posterior sample smaller than the true
value, and the estimated quantile is the number of posterior draws generated by software smaller than
the true value. If the software is correctly coded, then the quantile distribution for parameter θ, q̂(θ0 )
should approaches Uniform(0, 1), when N → ∞ (Cook et al., 2006). The whole process up to now is
defined as one replication. If run a number of replications, we expect to observe a uniformly distribution
q̂(θ0 ) around θ0 , meaning posterior should be randomly distributed around the true value..
We then demonstrate the simulations we ran. Assume the model we want to estimate is:
z = X1 β1 + X2 β2 + θ + θ = ρ1 W1 θ + ρ2 W2 θ + u
(2)
The data is generated using normal distribution:
X ∼ Normal(1, 4)
We then specified a conjugate prior distribution for each parameter, and use Metropolis-Hasting sampling to simulate the posterior distributions.
β ∼ Normal(0, 1)
σ 2 ∼ InvGamma(5, 10)
ρ ∼ Normal(0.05, 0.052 )
We performed a simulation of 10 replication to validate our hierarchical Bayesian MCMC software.
The generated sample size for X is 50, so the size of the network structure W is 50 by 50. In each
replication we generated 20000 draws from the posterior distribution of all the parameters in φ (φ =
{β1 , β2 , ρ1 , ρ2 , σ 2 }), and kept one from every 20 draws. So finally we have 1000 draws for each parameter.
We then count the number of draws larger than the true parameters in each replication. If the software
is correctly written, each estimated value should be randomly distributed around the true value, so
the number of estimates larger than the true value should be uniformly distributed among the 10
replications. We pooled all these quantiles for the five parameters, 50 in total, and the sorted results
are shown in Figure 2. The X-axis is the total replications of the five parameters – 50. The Y-axis
is the number of draws larger than true parameters in each replication. The red line represents the
uniform distribution line. As we can see, the combined results of the five parameters are all uniformly
distributed around the true value, thus confirmed that our Bayesian software is correctly written.
11
Figure 2: Distribution of sorted quantiles of parameters, β1 , β2 , ρ1 , ρ2 , σ 2 , 10 replications of posterior
quatiles experiments
4
Experiments
Before use real data to validate the accuracy of parameter estimates of our Bayesian solution, we want
to make sure our solution generates robust estimation from the posterior distribution, thus we want
to diagnose our MCMC. We first study the sensitivity of prior distribution of parameters, because the
solution will not have a proper posterior distribution without a proper distribution even the code is
correctly written. We first choose a flat prior distribution for ρ, ρ ∼ Normal(0, 100). As shown in Figure
3(a), the posterior draws of ρ have strong autocorrelation. We then choose a narrow prior distribution
for ρ, ρ ∼ Normal(0.05, 0.052 ), the posterior draws for ρ are shown in Figure 3(b), and we could find the
draws tend to be random. So posterior distribution of ρ is sensitive to its prior distribution. In order
to generate random estimates, we choose Normal(0.05, 0.052 ) as the prior distribution for ρ.
Second, since ρs are generated by using sequential draws, there is autocorrelations exist between consecutive draws, sometimes the autocorrelation is still high even the lag between two draws is large. The
autocorrelation plot for ρ1 (Figure 4) shows two ρ1 draws have strong correlation even the lag is larger
than 300, so thinning for the draws is necessary.
We use Yang and Allenby (2003)’s Japanese car data to validate the accuracy of parameter estimates
of our Bayesian solution. Such data consists of 857 actors’ midsize car purchase information. The
dependent variable is whether an actor purchased a Japanese or not, where 1 stands for purchased
and 0 otherwise. All the car models in the data are substitutable and roughly have similar prices.
Researchers are interested in whether the preferences of Japanese car among actors are interdependent
12
(b) ρ ∼ Normal(0.05, 0.052 )
(a) ρ ∼ Normal(0, 100)
Figure 3: Prior sensitivity for parameter ρ, hierarchical Bayesian solution
or not. The interdependence in the network are measured by geographical location:
(
1, if i and j have the same zip code;
Wij =
0, otherwise.
Explanatory variables include actors’ demographic information such as age, annual household income,
ethnic group, education and other information such as the price of the car, whether the optional accessories are purchased for the car, latitude and longitude of the actor’s location.
The comparison of the coefficient estimates from Yang and Allenby’s code and our Bayesian solution
is shown in Figure 5. In order to make a peer comparison, we set all the network effects except the
first one as 0n,n matrix. Our W1 has the same definition as Yang and Allenby’s W. For the third
method, we add one more network structure W2 , the structure equivalence of two consumers. We use
Euclidean distance to measure structural equivalence. In a directed network with non-weighted edges
the Euclidean distance between two nodes i and j is the sum of squared common neighbors between
the nodes that i and j connect to respectively, and from all nodes to i and j respectively. The distance
is shown in the equation (3).
v
u
u
dij = t
N
X
(Aik − Ajk )2
k=1,k6=i,j
where
Aik =
(
1
0
if node i and k are neighbors
otherwise
13
(3)
Figure 4: Autocorrelation plot of ρ
The larger d between node i and j, the less structurally equivalent they are. We get the inverse of dij
plus one in order to construct a measure with a positive relationship with role equivalence:
sij =
1
dij + 1
The comparison is shown in Figure 5. Each box contains the estimates of one parameter from three
methods. The left one is from Yang and Allenby’s, the middle one is from NAP with 1 network, and
the right one is from NAP with 2 networks. All the coefficient estimates, β̂ i , ρˆ2 , and σ̂ 2 of the three
methods have similar mean, standard deviation and credible interval. Such results confirm again that
NAP returns correct estimates of parameters in the model. One thing interesting here is the effect
size of the second network, structural equivalence, has a significant negative effect. Which suggests a
diminishing cluster effect, when the number of people in the cluster gets bigger, the influence is not
proportionally bigger. When the the structural equivalence between two customers is large, meaning
they are in the same community (zip code), and the size, i.e. number of customers, of such community
is large, so they have more common neighbors, thus more same scalar component in the vector.
5
Conclusion
We introduced an auto-probit model to study binary choice of a group of actors that have multiple
network relationships among them. We specified the model in both E-M and hierarchical Bayesian
14
Figure 5: Coefficient estimates comparison
15
methods, and developed estimation solutions for both of them. We found E-M solution cannot estimate the parameters thus only hierarchical Bayesian solution can be used here. We also validated our
Bayesian solution by using posterior quantiles methods and the results show our software returns accurate estimates. Finally we compare the estimates returned by Yang and Allenby, NAP with one network
effect, cohesion, and NAP with two network effect, cohesion and structural equivalence, by using real
data. Experiments showed all three returned same estimates, thus confirmed our softweare return correct parameter estimates. Our future plan includes first, run our software on more benchmark data with
better defined network structure. We also want to run more experiments with simulated populations
to evaluate the properties of the solution. For example, let W have different features, such as network
with randomly distributed edges, clustered edges, and skewed distributed edges etc. Second, assume
Wθ has strong effect, we will vary ρ’s true value from small number to large number, and observe
whether our solution can capture the variation. Third, we want to compare our program with QAD,
because although people know parameter estimates returned by QAP is biased, we do not know how
different they are from the true value. Finally we also want to study how multicollinearities between
Xs, and between X and Wθ affect estimated results.
Acknowledgement
This work was supported in part by AT&T and the iLab at Heinz College, Carnegie Mellon University.
A
A.1
E-M solution implementation
Deduction
In −
k
X
!
ρi Wi
θ=u
i=1
θ=
In −
k
X
!−1
ρi Wi
u
i=1


θ ∼ Normal 0, σ 2
In −
k
X
!−1 
ρi Wi
 In −
i=1
k
X
!−1 >
ρi Wi

 

i=1
We then get the distribution of z|β, ρ, σ 2 :
z ∼ Normal (Xβ, Q) , where Q = In + σ 2
In −
k
X
!−1 
ρi Wi
 In −
i=1
k
X
!−1 >
ρi Wi

i=1
The joint distribution of y and z can transformed as:
p(y|z)p(z|β, ρ, σ 2 ) = p(y, z|β, ρ, σ 2 )
= p(z|y; β, ρ, σ 2 )p(y)
16
(4)
The right side of equation (4) are two distributions we already have, as shown below. Please be aware,
although z follow normal distribution, z|y follows truncated normal distribution.
1
1
>
√ exp − (z − Xβ) (z − Xβ)
2
2π
p(y) =
I(z > 0)
Φ(Xβ)
!
n
1X
1
√ exp −
(zi − xi β)2
2
2π
i=1
I((z) > 0)
=
Φ(Xβ)
z|β, ρ, σ 2 ∼ Normal(Xβ, Q)
z|y, X; β, ρ, σ 2 ∼ TrunNormal(Xβ, Q)
We now show how the algorithm works by using a simple example. Assume we only consider parameter
β
p(β, z|y) = p(β|z, y)p(z|y)
z|y, X; β ∼ TrunNormal(Xβ, Q)
Assume Var(z)=1,
n
1
1 X
exp − (zi − Xi β)2
L(β|z) = √
2
2π i=1
β̂ = (X> X)−1 X> R
(R = E[z])
Then we include more parameters, ρ and σ 2 , and Var(z) = Q:
E[z](t+1) = E[z|y, β (t) ] = f (β (t) , y)
log L(β, ρ, σ 2 |z) = log p(z|β, ρ, σ 2 )
n
Y
= log
p(zi |β, ρ, σ 2 )
i=1
n
X
1
1
log p
− (z − Xβ)> Q−1 (z − Xβ)
2π|Q| 2
i=1
n
X
1
1 > −1
> −1
>
−1
>
−1
=
log p
−
z Q z − z Q Xβ − X βQ z + X βQ Xβ
2
2π|Q|
=
i=1
If we decompose the matrices above as vector product, then:
n
n
n
X
n
n
n
X
n
n
n
X
1 XX
(5) =
log p
−
(zi − Xi β)q̌ij (zj − Xj β)
2π|Q| 2 i=1 j=1
i=1
=
1
1
1 XX
log p
−
q̌ij (zi − Xi β)(zj − Xj β)
2π|Q| 2 i=1 j=1
i=1
1
1 XX
=
log p
−
q̌ij (zi zj − zi Xj β − zj Xi β + Xi Xj β 2 )
2
2π|Q|
i=1
i=1 j=1
17
(5)
where q̌ij is the element in Q̌, and Q̌ = Q−1 .
Z
E[z|θ, y] =
A.2
zp(z)dz = R
Expectation step
E-M algorithm consists of two steps, the first one is expectation.
Q(φ|φ(t) ) = Ez|y,φ(t) [log L(φ|z, y)]
" n
#
X
1
1
> −1
=E
log p
− E (z − Xβ) Q (z − Xβ)
2
2π|Q
i=1
n
1
1
= n log √ − log |Q| − E (z − Xβ)> Q−1 (z − Xβ)
2
2
2π


n X
n
X
n
1
n
q̌ij (zi zj − zi Xj β − zj Xi β + Xi Xj β 2 )
= − log 2π − log |Q| − E 
2
2
2
i=1 j=1
n
1
n
= − log 2π − log |Q| −
2
2
2
n X
n
X
q̌ij (E[zi zj ] − E[zi ]Xj β − E[zj ]Xi β + Xi Xj β 2 )
i=1 j=1
where φ is the parameter set, and t is the number of steps.
A.3
Maximization step
The second step of E-M algorithm is maximization.
β (t+1) = arg max Q(φ|φ(t) )
β
= arg max
β
n
X
1
1
− (z − Xβ)> Q−1 (z − Xβ)
log p
2π|Q| 2
i=1
(6)
If we directly use analytical method to solve the Equation (6) above, then:
∂
1
∂ log L
> −1
=
− (z − Xβ) Q (z − Xβ)
∂β
∂β
2
∂
∂
(z − Xβ)> Q−1 (z − Xβ) =
(z> Q−1 z − z> Q−1 Xβ − β > X> Q−1 z + β > X> Q−1 Xβ)
∂β
∂β
= −z> Q−1 X − X> Q−1 z + X> Q−1 Xβ
Set Equation (7) as 0, then:
−z> Q−1 X − X> Q−1 z + X> Q−1 Xβ = 0
X> Q−1 Xβ = z> Q−1 X + X> Q−1 z
β = (X> Q−1 X)−1 z> Q−1 X + (X> Q−1 X)−1 X> Q−1 z
−1
β̂ = X> Q−1 X
X> Q−1 R
18
(7)
ρ(t+1) = arg max Q(φ|φ(t) )
ρ
Assume ρ = {ρ1 , ..., ρk }, without losing any generalizabiliy, let us take a look at the situation of ρ1 :
(t+1)
ρ1
∂ log L
∂
=
∂ρ1
∂ρ1
= arg max Q(φ|φ(t) )
ρ1
1
1
> −1
− log |Q| − (z − Xβ) Q (z − Xβ)
2
2
∂
log |Q| = − tr(W1 Q−1 )
∂ρ1
∂
∂
(z − Xβ)> Q−1 (z − Xβ) =
(z> Q−1 z − z> Q−1 Xβ − β > X> Q−1 z + β > X> Q−1 Xβ)
∂ρ1
∂ρ1
Impossible to get the analytical solution for ρi .
Let σ 2 = Σ
Σ(t+1) = arg max Q(φ|φ(t) )
Σ
1
∂
1
∂ log L
> −1
− log |Q| − (z − Xβ) Q (z − Xβ)
=
∂Σ
∂Σ
2
2
(8)
The first term at the the right hand side of equation above is:
!−1 
!−1 > k
k
X
X
∂
∂
 In −
 log |Q| =
log In + Σ In −
ρi Wi
ρi Wi
∂Σ
∂Σ
i=1
i=1
The second term is:
∂
(z − Xβ)> Q−1 (z − Xβ)
∂Σ

=
∂

(z − Xβ)> In + Σ In −
∂Σ
k
X
!−1 
ρ i Wi
 In −
i=1
k
X
!−1 >
ρi Wi
−1
 

(z − Xβ)
i=1
This is again not solvable by using analytical method.
B
Markov chain Monte Carlo estimation
The Markov chain Monte Carlo method generate chain of draws from the conditional posterior distributions of parameters. Our solution consists of steps as follows.
19
Step 1. Generate z, z follows truncated normal distribution.
z ∼ TrunNormaln (Xβ + θ, In )
where In is the n × n identity matrix.
If yi = 1, then zi ≥ 0, if yi = 0, then zi < 0
Step 2. Generate β, β ∼ Normal(ν β , Ωβ )
1. define β 0 , where



β0 = 


0
0
..
.






0
2. define D = hIn , D is a baseline variance matrix, corresponding to the prior p(β), where h is a
large constant, e.g. 400, In = (n × n), n is the number of observations.


1

0
...
0   2
 400
σ
0
.
.
.
0
0
 

1

 0
 0 σ02 . . . 0 
...
0 


−1
400
= .
D =
..
.. 

..
.. 
..
 
 ..
.
.
.
.
. 

.
...
. 
 .

0 0 . . . σ02
1 
0
0
...
400
1
, a small number close to 0, compared with Normal(0, 1), where σ02 = 1
Set σ02 as
400
−1
3. Ωβ = D−1 + X> X
This is because:
z = Xβ + θ + z − θ − = Xβ
β = X−1 (z − θ − )
∴ β ∼ Normal X−1 (z − θ), (X> X)−1
−1
We need to have an offset for the initial variance, so Ωβ = D−1 + X> X
4. Then ν β can be represented by ν β = Ωβ X> (z − θ) + D−1
Step 3. Generate θ, θ ∼ Normal(ν θ , Ωθ )
1. First, define B = In − ρ1 W1 − ρ2 W2
θ = ρ1 W1 θ + ρ2 W2 θ + u
θ − ρ1 W1 θ − ρ2 W2 θ = u
(In − ρ1 W1 − ρ2 W2 )θ = u
Bθ = u
20
2. Then Ωθ =
B> B
In +
σ2
−1
Bθ = u
θ = B−1 u
Since Var(u) = σ 2 In
Var(θ) = Var(B−1 u)
>
= B−1 B−1 Var(u)
= (B> B)−1 σ 2 In
−1
= σ −2 B> B
> −1
B B
=
σ2
−1
B> B
B> B
. So Var(θ) = Ωθ = In +
We then add an offset In to
σ2
σ2
3. ν θ = Ωθ (z − Xβ)
z = Xβ + θ + z = Xβ + (In − ρ1 W1 − ρ2 W2 )−1 u + θ = (z − Xβ) − 7
Step 4. Generate σ 2 , σ 2 ∼ Gamma(a, b)
a = s0 +
n
2
2
b=
θ > B> Bθ +
2
q0
where s0 = 5, n is the size of data.
Step 5. Finally we generate coefficient for W, ρi using Metropolis-Hasting sampling.
ρnew
= ρold
i
i + ∆i ,
where ∆i ∼ Normal(0, 0.01).
The accepting probability α is obtained by:


1 > >
 |Bnew | exp − 2σ 2 θ Bnew Bnew θ

, 1
min 


1
|Bold | exp − 2 θ > B>
B
θ
old
old
2σ
21
C
Solution diagnose
We run MCMC experiment to confirm there is no autocorrelation among draws of each parameter. In
this experiment, we set the length of MCMC chain as 30,000, burn-in as 10,000, and thinning as 20,
which is used for removing the autocorrelations between draws. The trace plots for the 1000 draws after
burn-in and thinning are listed in the Figure 6 below.
Figure 6: Trace plot of a two-network auto-probit model
We have 12 plots total. Each plot depicts draws for a particular parameter estimation. The first 9 plots,
from left to right and top to bottom, are the trace for the βi , coefficient of independent variables. Each
point represents the value of estimated coefficient β̂i , and the red line represents the mean. We observe
all β̂i s are randomly distributed around the mean, and the mean is significant, showing the estimation
results are valid. The 10th and 11th plots are for the two estimated network effect coefficients ρˆ1 and
ρˆ2 . We found both ρˆi are also significant, and randomly distributed around their means. The only
coefficient showing autocorrelation is σ 2 .
Note that not all values of ρ1 and ρ2 can make B (B = In − ρ1 W1 − ρ2 W2 ) invertible. The plot below
shows the relationship between the values of ρ1 and ρ2 , and the invertibility of B. The green area is
where B is invertible, and red area is otherwise. If limit draws to the green area, we will have correlated
ρ1 and ρ2 . When we draw ρ1 and ρ2 using bivariate normal, there is no correlation between they (see
22
Figure 7). We understand the correlation between ρ1 and ρ2 comes from the definition of W1 and W2 ,
not the prior non-correlation.
Figure 7: Scatter plot of ρ1 and ρ2 on valid region for invertible B,
23
References
Allenby, G. M. and Rossi, P. E. (1998). Marketing models of consumer heterogeneity. Journal of
Econometrics, 89(1-2):57–78.
Anselin, L. (1988). Spatial econometrics: Methods and models. Studies in Operational Regional Science.
Springer, 1st edition.
Bass, F. M. (1969). A new product growth for model consumer durables. Management Science,
15(5):215–227.
Bernheim, B. D. (1994). A theory of conformity. Journal of Political Economy, 102(5):841–77.
Burt, R. S. (1987). Social contagion and innovation: Cohesion versus structural equivalence. American
Journal of Sociology, 92(6):1287.
Case, A. C. (1991). Spatial patterns in household demand. Econometrica, 59(4):953–965.
Cook, S. R., Gelman, A., and Rubin, D. B. (2006). Validation of software for bayesian models using
posterior quantiles. Journal of Computational and Graphical Statistics, 15:675–692.
Cressie, N. A. C. (1993). Statistics for spatial data. Probability and Statistics series. Wiley-Interscience,
revised edition.
Doreian, P. (1980). Linear models with spatially distributed data: Spatial disturbances or spatial effects.
Sociological Methods and Research, 9(1):29–60.
Doreian, P. (1982). Maximum likelihood methods for linear models: Spatial effects and spatial disturbance terms. Sociological Methods and Research, 10(3):243–269.
Doreian, P. (1989). Two Regimes of Network Effects Autocorrelation. The Small World. Ablex Publishing.
Duncan, O. D., Haller, A. O., and Portes, A. (1968). Peer influences on aspirations: A reinterpretation.
The American Journal of Sociology, 74(2):119–137.
Fujimoto, K. and Valente, Thomas, W. (2011). Network influence on adolescent alcohol use: Relational,
positional, and affiliation-based peer influence.
Kamakura, W. A. and Russell, G. J. (1989). A probabilistic choice model for market segmentation and
elasticity structure. Journal of Marketing Research, 26(4):379–390.
Ord, K. (1975). Estimation methods for models of spatial interaction. Journal of the American Statistical
Association, 70(349):120–126.
Rogers, E. M. (1962). Diffusion of Innovations. Free Press.
Smirnov, O. A. (2005). Computation of the information matrix for models with spatial interaction on
a lattice. Journal of Computational and Graphical Statistics, 14(4):910–927.
24
Smith, T. E. and LeSage, J. P. (2004). A bayesian probit model with spatial dependencies. In Pace,
K. R. and LeSage, J. P., editors, Advances in Econometrics: Volume 18: Spatial and Spatiotemporal
Econometrics, pages 127–160. Elsevier.
Thomas, A. C. (2009). Hierarchical Models for Relational Data. Phd dissertation, Harvard University,
Cambridge, MA.
Valente, T. W. (2005). Models and methods for innovation diffusion. In Carrington, P. J., Scott, J.,
and Wasserman, S., editors, Models and Methods in Social Network Analysis. Cambridge University
Press.
Wu, C. F. J. (1983). On the convergence properties of the em algorithm. The Annals of Statistics,
11(1):95–103.
Yang, S. and Allenby, G. M. (2003). Modeling interdependent consumer preferences. Journal of Marketing Research, XL:282–294.
25