Operations Research Technical Note—Stochastic Sequential

This article was downloaded by: [128.205.28.166] On: 18 November 2014, At: 07:35
Publisher: Institute for Operations Research and the Management Sciences (INFORMS)
INFORMS is located in Maryland, USA
Operations Research
Publication details, including instructions for authors and subscription information:
http://pubsonline.informs.org
Technical Note—Stochastic Sequential Decision-Making
with a Random Number of Jobs
Alexander G. Nikolaev, Sheldon H. Jacobson,
To cite this article:
Alexander G. Nikolaev, Sheldon H. Jacobson, (2010) Technical Note—Stochastic Sequential Decision-Making with a Random
Number of Jobs. Operations Research 58(4-part-1):1023-1027. http://dx.doi.org/10.1287/opre.1090.0778
Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions
This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval, unless otherwise noted. For more information, contact [email protected].
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.
Copyright © 2010, INFORMS
Please scroll down for article—it is on subsequent pages
INFORMS is the largest professional society in the world for professionals in the fields of operations research, management
science, and analytics.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
OPERATIONS RESEARCH
informs
Vol. 58, No. 4, Part 1 of 2, July–August 2010, pp. 1023–1027
issn 0030-364X eissn 1526-5463 10 5804 1023
®
doi 10.1287/opre.1090.0778
© 2010 INFORMS
Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved.
TECHNICAL NOTE
Stochastic Sequential Decision-Making with a
Random Number of Jobs
Alexander G. Nikolaev
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60201,
[email protected]
Sheldon H. Jacobson
Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801,
[email protected]
This paper addresses a class of problems in which available resources need to be optimally allocated to a random number
of jobs with stochastic parameters. Optimal policies are presented for variations of the sequential stochastic assignment
problem and the dynamic stochastic knapsack problem, in which the number of arriving jobs is unknown until after the
final arrival, and the job parameters are assumed to be independent but not identically distributed random variables.
Subject classifications: sequential assignment; dynamic stochastic knapsack; policy.
Area of review: Stochastic Models.
History: Received June 2008; revision received April 2009; accepted August 2009. Published online in Articles in
Advance February 5, 2010.
1. Introduction
DSKP was first defined by Papastavrou et al. (1996):
Given a limited fixed resource capacity, and jobs with i.i.d.
weights and reward values that arrive sequentially, one at
a time, how should the available resource be allocated
by nonanticipatively accepting or rejecting jobs? Papastavrou et al. (1996) analyze the case of DSKP formulated
for a time horizon of a given number of discrete periods,
with a fixed constant probability of a job arrival in each
such period, for different forms of a joint probability distribution function for job weights and values. Kleywegt
and Papastavrou (1998, 2001) consider Poisson arrivals in
DSKP. Other variations and applications of DSKP have
been discussed by Prastakos (1983), Lu et al. (1999), and
Van Slyke and Young (2000).
This paper uses conditioning arguments and the results
from Kennedy (1986) to consider extensions to SSAP and
DSKP, where the number of jobs is unknown until after
the final arrival, but follows a (given) discrete distribution
that has either finite or infinite support. The arriving jobs
are assumed to be independent but not necessarily identically distributed. Note that an optimal policy for a more
restricted version of SSAP was presented by Sakaguchi
(1983) in a different form and without a formal proof; this
paper formally proves and extends those results. Also, the
original finite-horizon formulation of DSKP (Papastavrou
et al. 1996) considers discrete arrivals with a random
number of jobs that follows a binomial distribution; this
paper generalizes these results to include other discrete
distributions.
Sequential resource allocation problems with uncertainty
have received much attention in the literature. This paper
focuses on two problems in this field: the sequential
stochastic assignment problem (SSAP), and the dynamic
stochastic knapsack problem (DSKP).
Derman et al. (1972) introduced the SSAP: Given a
known finite number of jobs with independent and identically distributed (i.i.d.) reward values that arrive sequentially, one at a time, how should these jobs be assigned to
workers with known finite success rates, where the assignment of each job should be determined nonanticipatively at
the time the job arrives? Derman et al. (1972) establish an
optimal policy that maximizes the total expected reward,
where the reward is the sum of products of job values and
worker success rates over all assignments.
Theoretical extensions to the investigation by Derman
et al. (1972) include scenarios in which various continuous
distributions of job arrival times are considered (Albright
1974, Sakaguchi 1972, Righter 1987). Other variations and
applications of SSAP have been addressed by Derman et al.
(1975), Nakai (1986), Su and Zenios (2005), Nikolaev et al.
(2007), and McLay et al. (2009). Kennedy (1986) established the most general result for SSAP by removing the
assumption of independence and proving that threshold
policies are optimal for any problem of this type, although
the thresholds that define such policies may be random
variables and difficult to compute.
1023
Nikolaev and Jacobson: Stochastic Sequential Decision-Making
Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved.
1024
Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS
The paper is organized as follows. Section 2 shows how
the results by Kennedy (1986) can be used to address
the SSAP extension. Section 3 presents a dynamic programming (DP) algorithm to solve the DSKP extension.
Section 4 offers concluding comments.
in BP and AP have the same values. Also, the values of
the subsequent jobs j = N + 1 N + 2 Nmax in AP are
equal to zero, and no additional reward can be earned (i.e.,
event Job “j” does not arrive in BP ≡ Xj = 0 in AP).
Theorem 1 establishes that if an optimal policy for AP is
available, then an optimal policy for BP can be obtained.
2. SSAP with a Random Number of Jobs
Theorem 1. Let ∗ be an optimal policy for AP. Then,
an optimal policy for BP, ∗ , is obtained using rules:
(1) whenever a job arrives and ∗ assigns a worker with
success rate zero, discard the job, and (2) whenever a job
arrives and ∗ assigns a worker with success rate p > 0,
assign a worker with the same success rate.
This section addresses the SSAP with a random number of
jobs. First, the case in which the distribution of the number
of jobs has finite support is considered. This result is then
extended to the infinite support case.
2.1. Finite Case
The base problem (BP) is formally stated.
Given. M ∈ + workers available to perform N
jobs; a fixed success rate pw associated with worker
w = 1 2 M; probability mass function (pmf) Pn for the
number of jobs with independent values arriving sequentially, one at a time; for each job j = 1 2 N , a job value
cumulative distribution function (cdf) Fj xj .
Objective. Find a policy ∗ that determines the
assignment of jobs to workers, Awj ∈ 0 1 w = 1
N
2 M, j = 1 2 N , such that
j=1 Awj 1,
M
w = 1 2 M,
A
1,
j
=
1
2
N , and
M N w=1 wj
∗
EP F N w=1 j=1 pw Awj Xj is maximized.
n
j j=1
The main challenge presented by BP is the randomness
in the number of arriving jobs. To address this challenge, an
auxiliary problem (AP) can be created where the number of
jobs is fixed, but the job values are dependent. Using BP, the
AP is created as follows. Fix the number of workers
Nmax at Nmax ,
Pn = 1).
the largest value that N can take on (i.e., n=0
If Nmax M, set pi = pi for i = 1 2 Nmax . If Nmax > M,
set pi = pi for i = 1 2 M and pM+1
= pM+2
= ··· =
pN max = 0. Also, let X1 = X1 , and for any j = 2 Nmax , set
⎧
Nmax
⎪
i=j Pi
⎪
⎪
if Xj−1
>0
⎪Xj with probability Nmax
⎪
⎪
P
i=j−1 i
⎪
⎪
⎨
Pj−1
Xj =
(1)
if Xj−1
>0
0 with probability Nmax
⎪
⎪
⎪
⎪
i=j−1 Pi
⎪
⎪
⎪
⎪
⎩
= 0
0
if Xj−1
The AP is now formally stated.
Given. Nmax workers available to perform Nmax jobs;
a fixed success rate pw associated with worker w =
1 2 Nmax ; Nmax jobs with independent values arriving
sequentially, one at a time; for each job j = 1 2 Nmax ,
a job value cdf Fj xj .
Objective. Find a policy ∗ that determines the assignment of jobs to workers, Awj ∈ 0 1, w = 1
Nmax Awj 1,
2 Nmax , j = 1 2 Nmax , such that j=1
M
w = 1 2 Nmax , w=1 Awj 1, j = 1 2 Nmax , and
Nmax Nmax ∗
w=1
EF
j=1 pw Awj Xj is maximized.
Nmax
j j=1
By design, BP and AP are closely related. Because
Xj > 0 for any j = 1 2 N , then by (1), the first N jobs
Proof. See e-companion. An electronic companion to this
paper is available as part of the online version that can be
found at http://or.journal.informs.org/.
To determine an optimal policy ∗ for AP, the result
by Kennedy (1986) can be applied. Using the notations
introduced for AP, let job values Xj , j = 1 2 Nmax , be
any (not necessarily i.i.d.) random variables. For any n =
1 2 Nmax and m = 0 1 Nmax , define random variNmax
Nmax
ables Zm
n such that (1) Z0 n ≡ +, for 1 n Nmax ;
N
Nmax
(2) Zm n ≡ −, for m > Nmax − n + 1; (3) Z1max
Nmax = XNmax ;
N
N
max
max
Nmax
and (4) Zm
n = Xn ∨ EZm n+1 n ∧ EZm−1 n+1 n , for
1 m Nmax − n + 1, n Nmax − 1, where n , n =
1 2 Nmax − 1, is a sigma-field over all possible realizations of vector Xi ni=1 , n = 1 2 Nmax − 1, ∨ denotes
the maximum, and ∧ denotes the minimum.
For any n = 1 2 Nmax and m = 1 Nmax , the ranNmax
dom variable Zm
n represents the expected value of a job to
which the mth most skilled (mth best) worker is expected
to be assigned upon the arrival and assignment of job n. At
the time when job n with value xn arrives, the following
hold:
• If job n is assigned to the mth best worker, then the
Nmax
value of Zm
n is equal to xn .
• If job n is assigned to a more skilled worker than the
mth best, then the mth best worker becomes the m − 1th
Nmax
Nmax
best, and Zm
n is equal to EZm−1 n+1 n .
• If job n is assigned to a less skilled worker than the
mth best, then the mth best worker remains the mth best,
Nmax
Nmax
and Zm
n is equal to EZm n+1 n .
Theorem 2 shows that it makes sense to assign job n to a
more skilled worker than the mth best only if xn is greater
Nmax
than the value of EZm−1
n+1 n , and to assign job n to a
less skilled worker than the mth best only if xn is less than
N
the value of EZmmax
n+1 n .
Theorem 2 (Kennedy 1986). Whenever job n = 1
2 Nmax − 1 arrives, the line segment − + ⊂ is partitioned into Nmax − n + 1 random intervals defined
N
Nmax
by the breakpoints + EZ1max
n+1 n EZ2 n+1 n Nmax
· · · EZNmax −n n+1 n −. Then, the optimal
assignment policy is to assign the nth job to the worker
with the mth highest success rate (available at the time
of the assignment) if xn lies in the mth highest of these
Nmax
intervals, or, equivalently, if Zm
n = xn .
Nikolaev and Jacobson: Stochastic Sequential Decision-Making
1025
Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved.
Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS
Theorem 2 establishes the form of an optimal policy
for any problem, where the objective function is given
as the expectation of a summation of products. However, this result has seen limited use because finding
the conditional expectations of recursively defined ranNmax
dom variables Zm
n , 1 m Nmax − n + 1, n Nmax − 1,
is computationally intractable in many cases, especially
when Xj , j = 1 2 Nmax are dependent. For any
n = 1 2 Nmax − 1, conditioning on the sigma-field n
implies that the interval breakpoints depend on a sequence
of values of jobs 1 through n, and hence, for any such
sequence, the breakpoint values may be different. However,
if the nature of the dependency is as defined in (1), then
a closed form optimal assignment policy for AP can be
obtained.
Theorem 3. Whenever job n = 1 2 Nmax − 1 arrives
in AP, the optimal assignment policy is to assign the nth job
to the worker with the mth highest success rate (available
at the time the assignment decision has to be made) if xn
lies in the mth highest of the intervals, defined by the fixed
breakpoints
EZ1max
n+1 Xn > 0
N
max
EZ2max
n+1 Xn > 0
EZNmax −n n+1 Xn > 0
N
N
These breakpoints are computed recursively:
EZmmax
n+1 Xn > 0
Nmax
P N = i
N
Fn+1 EZmmax
= i=n+1
n+2 Xn+1 > 0
Nmax
i=n P N = i
Nmax
EZm−1
n+1 Xn+1 >0
Nmax
· EZm n+2 Xn+1 > 0
+
x dFn+1 x
N
infinite support, solves BP. Kennedy (1986) establishes the
form of an optimal policy for such problems, as summarized in Theorem 4.
Theorem 4 (Kennedy 1986). Assume that Esupn Xn <
+, and limn→+ Xn = 0. Then, an infinite sequence
Nmax +
Zm
n Nmax =1 converges to a finite limit
Zm n ≡
lim
Nmax →+
Nmax
Zm
n and Theorem 2 holds with the breakpoints expressed as +,
EZ1 n+1 n , EZ2 n+1 n −.
Theorem 4 establishes that finding an optimal policy for
AP (and, using Theorem 1, BP), where the pmf for the
number of jobs has infinite support, can be approached by
considering a sequence of AP’s with fixed (bounded) Nmax ,
and letting Nmax → +. Note that the distributions of job
values in such APs (see (1)) depend on the pmfs of the
number of jobs, and hence it is necessary to define the
pmf P Nmax of the number of jobs for each AP with Nmax =
1 2 . To match the set-up described in Kennedy (1986),
the distribution of the value of job j = 1 2 has to be
the same in each of those APs with Nmax
N=max1 2 . To
N
Pk for any
satisfy this requirement, set Pi max = Pi / k=1
i = 1 2 Nmax , Nmax = 1 2 .
2.3. Illustrative Example
N
EZmmax
n+2 Xn+1 >0
max
+ 1 − Fn+1 EZm−1
n+1 Xn+1 > 0
N
max
· EZm−1
n+1 Xn+1 > 0
N
Theorems 3 and 4 describe the necessary computations
involved in deriving optimal policies for SSAP with a random number of jobs. An example is provided to illustrate
how these computations are performed.
Example. Given M = 4. P1 = P2 = P3 = P4 = 1/4.
Fj x = x, 0 x 1, for j = 1 2 3 4.
N
(2)
Proof. See e-companion.
The backward recursion (2) begins with the last Nmax th
job, for which the breakpoints are 0, + (therefore, the
job is assigned to the best remaining worker available).
Next, the breakpoints for job (Nmax − 1) are 0, PNmax /
PNmax −1 + PNmax EXNmax , +. To compute the breakpoints
for all Nmax jobs, proceed in the same manner, down to
job 1.
2.2. Infinite Case
The results of Theorems 1 and 3 can be extended to the
case in which the pmf for the number of jobs in BP has
infinite support. In this case, the proof of Theorem 1 is
unchanged. The rewards earned in BP and AP by making assignments for any pair of sequences s and s (see
Theorem 1), respectively, remain the same, because every
such sequence has only a finite number of jobs. Therefore, solving AP, where the pmf of the number of jobs has
N
max
For this example, Nmax = 4. Define bmmax
n+1 = EZm n+1 > 0
for 1 m Nmax − n + 1 and 1 n Nmax − 1.
By (2),
Xn
4
+
1/4 1 1
i=4 Pi
4
·2= =
x dF4 x =
n = 3 b1
4
4
2/4
4
−
P
i=3 i
1/4
4
i=3 Pi
4
4
4
n = 2 b2
=
x
dF
x
+
1
−
F
b
b
4
3
3 1 4
3
1 4
−
i=2 Pi
2
1
3 1
7
= ·
+ ·
= 3
32 4 4
48
4
1/4
17
i=3 Pi
4
4
4
b1 3 = 4
F3 b1 4 b1 4 +
x dF3 x = 48
−
i=2 Pi
4
4
4
n = 1 b3
2 ≈ 01014 b2 2 ≈ 02266 b1 2 ≈ 0422
The derived optimal policy can be compared with the policy that would be optimal if the number of jobs was not
random. According to an optimal policy for SSAP (Derman
et al. 1972) with N = 4, with job values distributed as in
the example, the interval breakpoints that determine the
Nikolaev and Jacobson: Stochastic Sequential Decision-Making
1026
Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS
assignments for the third, second, and first arriving jobs
(respectively) would be
18 4
30
a = 48 1 3 48
n = 1 a43 2 ≈ 03047 a42 2 = 05 a41 2 ≈ 06953
Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved.
n = 3 a41 4 = 05
n = 2 a42 3 =
The interval breakpoint values obtained in the example’s
solution are smaller, which means that workers with higher
success rates are used earlier than in the solution to the
respective instance of SSAP with the known number of
arrivals, at all assignment stages.
3. DSKP with a Random Number of Jobs
This section analyzes the DSKP with a random number
of jobs and presents a dynamic program that leads to the
derivation of an optimal assignment policy. The DSKP is
formally stated.
Given. Resource of capacity C available for allocation
to N jobs; pmf Pn for the number of jobs with independent
weights and values arriving sequentially, one at a time; for
each job j = 1 2 N , a joint cdf Fj w x for the job
weight and value.
Objective. Find a policy ∗ that determines the assign
ments, Aj ∈ 0 1, j = 1 2 N , such that Nj=1 Aj 1,
N
∗
Nj=1 Aj Xj is maximized.
j=1 Aj wj C, and EPn Fj N
j=1
For any j = 1 2 and c ∈ 0 C
, let Vjc denote the
optimal accumulated reward from the allocation of resource
capacity c to jobs j j + 1 N , and let EVjc denote the
optimal conditional expected accumulated reward from the
allocation of resource capacity c to jobs j j + 1 N ,
given that job j − 1 has arrived. By definition, EV1C =
EP∗ F N Nj=1 Aj Xj . Theorem 5 establishes an assignn
j j=1
ment policy that guarantees the optimal expected resource
allocation.
Theorem 5. Suppose that the remaining resource capacity
is c, and job j with weight wj and value xj arrives. Then,
it is optimal to set
Aj =
⎧
⎨1
⎩
0
if
c−w
xj + EVj+1 j
c
EVj+1
and wj c
(3)
c−w
c
or wj > c
if xj + EVj+1 j < EVj+1
c−w
Note that the quantity xj + EVj+1 j depends on the
parameters (weight and value) of job j. These parameters
are known at the time the assignment decision for job j is
to be made. Therefore, each optimal assignment decision,
c−w
c
described by (3), is determined by EVj+1
and EVj+1 j .
Theorem 5 follows from the fundamental argument of
DP: Each assignment must maximize the sum of an immediate reward and the expected future reward. Note that rule
(3) is of the same form as in Papastavrou et al. (1996),
except that EVjc , j = 1 2 , c ∈ 0 C
, are conditional.
This allows one to include the consideration of the pmf of
N into the DP formulation and hence determine the optimal allocation policy for the case with a random number
of jobs.
The expected values EVjc , j = 1 2 , c ∈ 0 C
can be
computed using a DP recursion. However, the recursion and
its boundary conditions depend on the number of arriving
jobs, which is random. First, the case where the pmf of N
has finite support is considered. Then the result is extended
to the case where the pmf of N has infinite support.
3.1. Finite Case
Theorem 6. The optimal expected accumulated reward
EV1C can be computed using the recursion
Nmax
EVjc
i=j
= Nmax
Pi
i=j−1 Pi
c−W
c
· P Wj c Rj + EVj+1 j EVj+1
c−W
c−W
c
× E Rj + EVj+1 j Wj c Rj + EVj+1 j EVj+1
c−W
c
+ P Rj + EVj+1 j < EVj+1
Wj c
c
+ P Wj > c
EVj+1
(4)
with boundary conditions EVjc = 0 for any c and j Nmax .
Proof. See e-companion.
3.2. Infinite Case
The result of Theorem 5 can be extended to the case where
the pmf for the number of jobs in DSKP has infinite support. For any j = 1 2 c ∈ 0 C
, and Nmax = 1 2 let EVjc Nmax denote the optimal conditional expected
accumulated reward from the allocation of resource capacity c to jobs j j + 1 Nmax , given that job j − 1 has
arrived, in the DSKP with the pmf of the number of jobs
Nmax
N
Pk for any i = 1 2 Nmax .
Pi max = Pi / k=1
Theorem 7. Assume that B ≡ Esupj Xj /Wj < +, and
P N < +
= 1. Then for any j = 1 2 and c ∈
0 C
, the infinite sequence EVjc Nmax +
Nmax =1 converges
to the finite limit EVjc ≡ limNmax →+ EVjc Nmax , and
Theorem 5 establishes an optimal policy for DSKP, where
the pmf of the number of jobs has infinite support, with
EVjc replaced by EVjc , j = 1 2 c ∈ 0 C
.
Proof. See e-companion.
Theorem 7 establishes that an optimal policy for DSKP,
where the pmf of the number of jobs has infinite support, can be obtained by sequentially solving a sequence
of DSKPs with finite support. First, consider only two
jobs, then three, and so on. Then evaluate the limits
limNmax →+ EVjc Nmax , j = 1 2 c ∈ 0 C
. Finally,
apply Theorem 5 to establish an optimal resource allocation
policy.
Nikolaev and Jacobson: Stochastic Sequential Decision-Making
1027
Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS
Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved.
4. Conclusion
This paper analyzes SSAP and DSKP under the assumption
that the number of arriving jobs is random and follows
a given discrete distribution. Optimal assignment policies
with proofs are provided. Conditioning arguments are key
to the solutions to both problems. Note that the complexity
of the proposed algorithms is the same as the complexity of
the original algorithms introduced by Derman et al. (1972)
and Papastavrou et al. (1996).
Note that DSKP, where the pmf of the number of jobs
has infinite support, can be solved by alternative methods
such as a total reward Markov decision process. Further
research is required to assess and compare the performance
of these methods. Other challenges include discrete sequential assignment problems in which job values are dependent on each other and/or the workers to whom the jobs
are assigned. Also, the proposed models assume that the
sequences in which the jobs with their respective value cdfs
arrive are fixed and known. Identifying optimal resource
allocation policies for the cases in which such sequences
could be random is another hard yet important problem.
5. Electronic Companion
An electronic companion to this paper is available as part
of the online version that can be found at http://or.journal.
informs.org/.
Acknowledgments
This research has been supported by the U.S. Air Force
Office of Scientific Research under grant FA9550-07-10232, and the National Science Foundation under grant
CMMI-0900226. Any opinions, findings, and conclusions
or recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of the
United States Government, the U.S. Air Force Office of
Scientific Research, or the National Science Foundation.
References
Albright, S. C. 1974. Optimal sequential assignments with random arrival
times. Management Sci. 21(1) 60–67.
Derman, C., G. J. Lieberman, S. M. Ross. 1972. A sequential stochastic
assignment problem. Management Sci. 18(7) 349–355.
Derman, C., G. J. Lieberman, S. M. Ross. 1975. A stochastic sequential
allocation model. Oper. Res. 23(6) 1120–1130.
Kennedy, D. P. 1986. Optimal sequential assignment. Math. Oper. Res.
11(4) 619–626.
Kleywegt, A. J., J. D. Papastavrou. 1998. The dynamic and stochastic
knapsack problem. Oper. Res. 46(1) 17–35.
Kleywegt, A. J., J. D. Papastavrou. 2001. The dynamic and stochastic knapsack problem with random sized items. Oper. Res. 49(1)
26–41.
Lu, L. L., S. Y. Chiu, L. A. Cox Jr. 1999. Optimal project selection:
Stochastic knapsack with finite time horizon. J. Oper. Res. Soc. 50(6)
645–650.
McLay, L. A., S. H. Jacobson, A. G. Nikolaev. 2009. A sequential stochastic passenger screening problem for aviation security. IIE Trans. 41(6)
575–591.
Nakai, T. 1986. A sequential stochastic assignment problem in a partially
observable Markov chain. Math. Oper. Res. 11(2) 230–240.
Nikolaev, A. G., S. H. Jacobson, L. A. McLay. 2007. A sequential stochastic security system design problem for aviation security. Transportation Sci. 41(2) 182–194.
Papastavrou, J. D., S. Rajagopalan, A. J. Kleywegt. 1996. The dynamic
and stochastic knapsack problem with deadlines. Management Sci.
42(12) 1706–1718.
Prastakos, G. P. 1983. Optimal sequential investment decisions under conditions of uncertainty. Management Sci. 29(1) 118–134.
Righter, R. L. 1987. The stochastic sequential assignment problem with
random deadlines. Probab. Engrg. Inform. Sci. 1(2) 189–202.
Sakaguchi, M. 1972. A sequential assignment problem for randomly arriving jobs. Rep. Statist. Appl. Res. 19 99–109.
Sakaguchi, M. 1983. A sequential stochastic assignment problem
with an unknown number of jobs. Mathematika Japonica 29(2)
141–152.
Su, X., S. A. Zenios. 2005. Patient choice in kidney allocation: A sequential stochastic assignment model. Oper. Res. 53(3) 443–455.
Van Slyke, R., Y. Young. 2000. Finite horizon stochastic knapsacks with
applications to yield management. Oper. Res. 48(1) 155–172.
Transportation Research Part A 44 (2010) 182–193
Contents lists available at ScienceDirect
Transportation Research Part A
journal homepage: www.elsevier.com/locate/tra
Evaluating the impact of legislation prohibiting hand-held cell
phone use while driving
Alexander G. Nikolaev a, Matthew J. Robbins b, Sheldon H. Jacobson c,*
a
Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, United States
Department of Industrial and Enterprise Systems Engineering, University of Illinois, Urbana, IL, United States
c
Department of Computer Science, University of Illinois, Urbana, IL, United States
b
a r t i c l e
i n f o
Article history:
Received 9 April 2009
Received in revised form 21 December 2009
Accepted 16 January 2010
Keywords:
Cell phone
Driving safety
Legislation
Automobile accident
a b s t r a c t
As of November 2008, the number of cell phone subscribers in the US exceeded 267 million,
nearly three times more than the 97 million subscribers in June 2000. This rapid growth in
cell phone use has led to concerns regarding their impact on driver performance and road
safety. Numerous legislative efforts are under way to restrict hand-held cell phone use
while driving. Since 1999, every state has considered such legislation, but few have passed
primary enforcement laws. As of 2008, six states, the District of Columbia (DC), and the
Virgin Islands have laws banning the use of hand-held cell phones while driving. A review
of the literature suggests that in laboratory settings, hand-held cell phone use impairs
driver performance by increasing tension, delaying reaction time, and decreasing awareness.
However, there exists insufficient evidence to prove that hand-held cell phone use
increases automobile-accident-risk. In contrast to other research in this area that uses
questionnaires, tests, and simulators, this study analyzes the impact of hand-held cell
phone use on driving safety based on historical automobile-accident-risk-related data
and statistics, which would be of interest to transportation policy-makers. To this end, a
pre-law and post-law comparison of automobile accident rate measures provides one
way to assess the effect of hand-held cell phone bans on driving safety; this paper provides
such an analysis using public domain data sources. A discussion of what additional data are
required to build convincing arguments in support of or against legislation is also provided.
Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction
As of 2008, the Cellular Telecommunications and Internet Association (CTIA) reported that the number of cell phone subscribers in the US exceeded 267 million. The latest data available from the National Highway Traffic Safety Administration
(NHTSA) estimated that in 2007, about 11% of the population used a phone while driving at some point during the day, as
reported in USA Today (O’Donnell, 2009). Earlier studies revealed that approximately one-half of interviewed drivers reported using cell phones while driving, either to make outgoing calls or take incoming calls, spending an average of
4.5 min per call (Royal, 2003). Hand-held cell phones are believed to be an important factor in driver distraction (Williams,
2007). Driver distraction is thought to be the cause of nearly 80% of automobile accidents and 65% of near-accidents (Klauer
et al., 2006), resulting in approximately 2600 deaths, 330,000 moderate to critical injuries, and 1.5 million instances of property damage annually in the US (Cohen and Graham, 2003). Nonetheless, car cell phones have been marketed for nearly half a
century and continue to be viewed by many as a high-profile product, as evidenced by a recent article in New York Times
* Corresponding author.
E-mail address: [email protected] (S.H. Jacobson).
0965-8564/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.tra.2010.01.006
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
183
(Richtel, 2009). Indeed, these facts are drawing a significant amount of public attention to the issue of hand-held cell phone
use while driving.
Hand-held cell phone use while driving imposes no less than three tasks upon drivers: locating/glancing at the phone,
which draws the eyes away from the road; reaching for the phone and dialing, which impairs control of the vehicle; and
conversing via the phone, which distracts attention from driving (Klauer et al., 2006). Dialing a hand-held cell phone is a
particularly dangerous task that forces a driver to take their eyes off the road, and thereby, increases the risk of accidents
and near-accidents. The CTIA safe driving tips include never dialing a telephone or taking notes while driving (CTIA,
2008a). Cell phone use while driving has been considered and studied as a primary factor in automobile accidents, due to
the high frequency of this activity (NHTSA, 1997).
Numerous investigations have been undertaken to determine whether hand-held cell phone use impairs driver performance. Such efforts are typically based on simulators, tests, questionnaires, telephone surveys, and observations. Redelemeier
and Tibshirani (1997) associate hand-held cell phone use with automobile accidents by analyzing questionnaire responses of
699 drivers as well as phone and police records. They suggest that automobile-accident-risk is equivalent to impairment resulting from legal intoxication. Caird et al. (2008) and Horrey and Wickens (2006) show that the costs associated with cell phone use
while driving are seen in reaction time tasks, with smaller costs in performance on lane keeping and tracking tasks. Strayer and
Drews (2004) report that hand-held cell phone use while driving increases braking times by 18%, increases following distances
by 12%, and increases the time for speed resumption after braking by 17%. The NHTSA used a driver simulator to investigate the
effects of hand-held cell phone use while performing four tasks: car following, lead-vehicle braking, lead-vehicle cut in, and
merging. They observed that hand-held cell phone use while driving impairs driver performance, increases the response to
lead-vehicle speed changes during car following, and degrades automobile control (Ranney, 2005).
The growing use of cell phones and the associated research on how they impact driver performance have led many, including
some state legislators, to question their safety while driving. Royal (2003) claims that 71% of drivers support restrictions on
hand-held cell phones and 57% approve a ban on hand-held cell phone use while driving, although most drivers that do use cell
phones oppose such outright bans or traffic fines on hands-free cell phones. Acknowledging a potential negative impact of handheld cell phone use while driving, a number of legislative initiatives have passed that ban hand-held cell phone use while driving. In fact, since 1999, every state has considered such legislation (Sundeen, 2004). In 2001, New York became the first state to
enact such a law. Since that time, similar bans have taken effect in New Jersey, DC, Connecticut, Utah, California, Washington,
and the Virgin Islands, with all primary enforcement laws (except Utah where the law is primary only in regards to text messaging), which allows law enforcement officers to ticket drivers for using a hand-held cell phone while driving without any
other traffic violation (Governors Highway Safety Association, 2008). A number of states (e.g., Illinois) restrict hand-held cell
phone use by requiring sound to travel unimpaired to at least one ear or to have at least one hand on the steering wheel at
all times (Sundeen, 2001). In addition to state statutes, local ordinances have been passed that prohibit hand-held cell phone
use while driving in certain counties, cities, towns, and municipalities. For example, Chicago, Illinois, implemented such a policy
in 2005. There are a total of 28 jurisdictions that enforced such local ordinances in Florida, Illinois, Massachusetts, Michigan,
New Jersey, New Mexico, New York, Ohio, Pennsylvania, and Utah (Cellular News, 2008). However, no state or local ordinance
completely bans all types of cell phones (hand-held and hands-free) while driving, though many prohibit cell phone use by certain segments of the population (Glassbrenner and Ye, 2007). For example, California enforces an all-type cell phone ban for
school bus drivers and drivers under 18 years of age (AAA Auto Insurance, 2008).
While proponents believe that laws banning hand-held cell phone use while driving may reduce driver distraction and
improve driver performance, opponents of such laws believe that it is premature to act. Although research suggests that multi-tasking impairs driver performance, there is still insufficient evidence to definitively prove that hand-held cell phone use increases automobile-accident-risk (McCartt et al., 2006; Williams, 2007; Olson, 2003). Note that in this domain, definitive proofs
are practically impossible to obtain, given the inability of researchers to conduct controlled experiments where the dependent
variables are accidents, property damage, personal injuries and even death. A study on distracted driving, released by the NHTSA
and the Virginia Tech Transportation Institute (Dingus et al., 2006; Klauer et al., 2006), suggests that drivers talking or listening
to a wireless device are no more likely to be involved in an accident or near-accident, than those not involved in such activities.
Of course, the safety and highway travel benefits provided by cell phones, especially for public health and safety considerations,
cannot be overlooked (Lissy et al., 2000). For example, cell phones can reduce emergency response time to automobile accidents
(Savage et al., 2000). Moreover, given that legislation narrowly aimed toward cell phone use does not adequately address the
larger issue of driver distraction, the CTIA believes that education is a more effective approach to enhance drivers’ awareness
and responsibility (CTIA, 2008b). A number of safety and elected officials agree with this sentiment, including the Chairman
of the Governors Highway Safety Administration (CTIA, 2008b). To prove this point, in 2008, CTIA along with Sprint Nextel, Cingular Wireless, Dobson Cellular Systems, and other wireless companies, developed programs and sponsored public service
announcement campaigns designed to educate drivers on distraction while operating vehicles.
In addition to education, the cell phone industry has focused on enhancing driving safety beyond the issue of hands-free
operation, by eliminating in-hand manipulation and reducing distractions while driving (Goodman et al., 1997). Recent research and technological advances in this area are providing innovative solutions to the problem of distracted drivers, such
as hands-free car kits and the ‘‘Polite Phone” prototype, using ground-breaking Bluetooth technology to provide a voice-command interface between a car and a cell phone and enable hands-free voice dialing, answering, and hanging up (Auto News,
2006; Funponsel Network, 2005). However, early reports failed to observe a significant risk reduction due to the use of this
new technology (Strayer et al., 2003; McEvoy et al., 2005).
184
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
An important question to ask is: are bans on hand-held cell phone use while driving effective for reducing automobileaccident-risk, and do such laws make the roads safer? Although a significant amount of research has investigated the effect
of hand-held cell phone use on automobile-accident-risk, there are no definitive conclusions on the issue. This paper focuses
specifically on traffic safety both before and after hand-held cell phone bans, to explore whether such laws have any meaningful effect. Note that the issue of compliance is very important for such a study. In the paper, it is assumed, just as lawmakers assume, that the bans do make many drivers refrain from using hand-held cell phones while driving. The main
contribution of this paper is to provide statistical measures in support of or against laws banning hand-held cell phone
use while driving, based on their historical (statistical) impact on road safety, and to suggest what additional data are necessary to establish such connections.
The paper is organized as follows: Section 2 describes the available data and the statistical methods that can be used to
conduct comparative studies of automobile accident rates in selected territorial units between pre-law and post-law time
periods. Section 3 presents the observed results on the effects of law enforcement on improving driving safety. Section 4
summarizes the findings, discusses the limitations of the presented analysis of the effects of law enforcement on improving
driving safety and points out possible directions for further research on this issue.
2. Methods
This section describes the data and the tools that can be used to compare automobile accident rates in selected territorial
units for the time periods before and after hand-held cell phone ban laws were enacted in these units.
There is a dearth of systematically collected data on automobile accident rates in the United States that can be used to
study the consequences of hand-held cell phone ban laws. Most territorial units have passed such laws just recently, and
hence, can not be used as reliable testbeds for drawing any significant, long-lasting conclusions. In other cases, the ban laws
have been passed individually by only a limited number of minor territorial units (e.g., isolated single counties), which
makes it difficult to put the observed corresponding accident rate changes in a meaningful perspective. This paper looks
to conduct a statistically significant, comprehensive analysis of pre-law and post-law periods, and focuses on the data for
New York State, where a state-wide ban on hand-held cell phone use while driving began in November of 2001 (first in
the US) and has been in effect for over 8 years. For the aforementioned reasons, New York data represent the only reliable
source for evaluating the effect of hand-held cell phone ban laws in the United States.
Due to a change in the definition of property damage automobile accidents in New York State regulations in 1997 and
again in 2003, the number of property damage automobile accidents, and hence, the total number of automobile accidents
can not be used as a measure for evaluating the effectiveness of the ban. Therefore, for all 62 counties in New York State,
the measures of traffic safety adopted in this study are the number of fatal automobile accidents per 100,000 licensed
drivers per year and the number of personal injury accidents per 1000 licensed drivers per year. To allow for a proper
comparison between time periods, 1997–2001 is treated as the pre-law time period and 2002–2007 is treated as the
post-law time period. Note that these two accident rate measures are positively correlated, yet differ by the severity of
the tallied accidents’ consequences. Note also that some counties passed local ban laws prior to the enactment of the
state-wide law. However, this consideration makes the results for any such county where the accident rates are found
to have dropped, even stronger.
The main portion of the analysis is conducted by testing the hypothesis that the New York state-wide hand-held cell
phone ban had no impact on the described measures. A one-tailed t-test is applied to determine whether the expected values
for these measures show a statistically significant decrease after the law was enacted. First, to ensure that the data used are
normally distributed, the Shapiro–Wilk test is conducted. Second, in order to determine a proper statistical test to be applied,
the variances of the compared populations (the data collected over the two time periods) must be the same for each of the
three measures. To assess this, a two-sided F-test is used. Third, for those localities when the null hypothesis of equal variances is not rejected at a 5% significance level, a one-sided t-test for samples with equal variances is used to determine
whether the measures described above have the same means in the two time periods versus having larger means before
hand-held cell phone ban laws were enacted. On the other hand, for those localities when the null hypothesis of equal variances is rejected at a 5% significance level, a one-sided t-test for samples with unequal variances is used.
3. Results
This section reports the results of the comparisons of two automobile accident measures in all New York State counties for
the time periods before and after hand-held cell phone ban laws were enacted. The automobile accident rates data as well as the
number of licensed drivers by county are all published by the New York State Department of Motor Vehicles (2008a–c).
The relevant data for each individual county of New York State are summarized in Tables 1 and 2. In particular, the two
measures of interest, fatal accidents per 100,000 licensed drivers and personal injury accidents per 1000 licensed drivers, are
reported for years 1997 through 2007. The counties are arranged in decreasing order by licensed driver density, computed as
the number of licensed drivers per square mile (averaged over the 11 years comprising the pre-law and post-law periods).
The last columns in Tables 1 and 2 give the p-values for the hypothesis test of equal variances in the pre-law and post-law
accident rate measures, for each county.
185
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
Table 1
Fatal accidents per 100,000 licensed drivers for New York counties.
Index
County
Driver
density
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
pvalue
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
New York
Kings
Bronx
Queens
Richmond
Nassau
Westchester
Rockland
Suffolk
Monroe
Erie
Schenectady
Onondaga
Albany
Putnam
Niagara
Orange
Dutchess
Broome
Saratoga
Rensselaer
Chemung
Oneida
Tompkins
Ontario
Ulster
Wayne
Seneca
Oswego
Genesee
Montgomery
Chautauqua
Fulton
Cayuga
Orleans
Madison
Columbia
Tioga
Livingston
Cortland
Warren
Greene
Sullivan
Jefferson
Clinton
Washington
Steuben
Wyoming
Yates
Cattaraugus
Otsego
Chenango
Schuyler
Schoharie
Herkimer
Allegany
Saint
Lawrence
Delaware
Franklin
Essex
Lewis
Hamilton
28,659
11,450
9617
9363
4807
3410
1468
1175
1149
769
613
546
405
373
323
301
286
251
201
190
165
153
132
131
116
116
114
93
90
89
89
89
80
79
76
76
74
73
71
65
57
57
56
55
52
52
52
50
50
45
44
42
42
38
32
32
28
13.812
17.044
18.982
12.616
5.484
11.696
8.782
13.201
14.84
8.45
12.119
5.421
12.584
11.29
18.59
11.564
18.452
16.589
9.92
7.745
9.606
16.228
14.413
19.402
15.485
9.759
23.856
25.735
23.795
32.214
11.25
29.249
7.779
29.745
23.814
12.466
28.978
5.414
16.16
25.135
23.675
51.821
19.333
11.695
24.982
14.629
27.115
37.657
24.316
20.932
21.451
29.848
15.064
23.144
13.289
18.571
19.003
9.879
14.551
12.816
11.373
8.514
10.55
8.442
13.135
16.311
7.262
11.889
5.481
12.008
10.207
17.025
13.569
14.545
14.373
15.688
13.278
17.386
13.036
10.121
10.591
23.871
16.22
20.872
25.775
27.533
25.378
17.038
19.938
12.933
14.958
34.124
35.295
31.133
10.834
18.422
22.104
6.415
25.845
33.009
16.203
11.592
26.801
12.858
34.172
0
17.466
23.776
13.604
7.431
35.815
13.335
18.622
21.812
10.769
14.265
15.258
11.778
5.708
12.811
8.24
11.055
15.257
9.906
11.935
6.444
11.049
7.483
21.034
11.654
19.884
15.271
15.067
15.861
10.632
17.929
19.673
8.752
20.963
16.116
13.403
8.572
28.665
30.064
2.838
10.547
7.746
18.618
23.918
20.7
17.704
8.144
9.157
25.402
16.971
19.949
23.165
16.261
28.862
19.408
22.784
27.245
48.603
24.437
21.272
21.711
29.595
26.711
11.119
40.497
31.354
10.96
12.777
12.236
9.698
8.865
9.639
8.844
8.868
14.706
9.536
9.882
11.801
6.396
12.045
13.66
8.308
13.448
14.862
13.432
10.744
14.251
9.6
11.254
10.268
19.145
16.574
16.101
25.261
28.248
25.054
16.813
18.729
5.06
18.316
6.772
18.341
23.875
10.709
17.976
9.341
14.454
19.524
20.783
18.947
28.333
26.077
18.166
37.133
29.826
22.344
20.959
24.057
7.266
8.739
21.955
12.274
28.322
8.371
13.431
13.457
9.954
9.745
9.442
8.439
9.731
15.656
10.857
11.859
9.876
9.505
10.322
13.428
15.884
15.71
19.581
9.132
11.841
14.102
9.564
12.427
11.74
24.269
20.948
30.527
16.657
21.003
43.008
22.327
13.475
7.557
9.082
30.322
24.266
6.41
26.491
17.862
18.537
14.29
22.018
18.526
24.553
24.127
18.748
16.583
30.181
17.724
22.165
25.312
23.897
7.219
25.843
15.302
27.47
21.473
7.426
11.437
14.27
10.506
7.129
11.226
11.242
12.218
14.278
9.88
9.401
6.251
11.076
9.181
15.951
15.932
18.226
7.389
9.168
9.618
16.731
11.291
16.835
13.228
17.293
15.268
20.384
41.418
26.824
15.876
13.972
12.544
17.534
9.074
23.742
14.103
25.275
13.276
40.178
18.513
10.013
32.378
21.745
21.532
14.655
25.441
26.395
23.606
23.687
8.58
15.963
18.578
7.224
8.502
13.147
36.623
22.919
9.133
10.116
10.839
9.652
7.459
10.377
7.874
8.45
14.253
7.149
9.993
4.373
10.844
0.998
15.331
11.273
19.146
15.262
8.942
9.979
9.105
17.336
12.281
14.167
20.856
17.126
17.144
24.323
21.769
37.913
24.858
13.281
27.195
14.294
39.953
21.735
18.572
20.633
15.305
3.028
19.529
21.189
33.329
11.223
10.713
18.117
17.69
26.445
17.431
16.817
13.452
26.036
14.169
20.917
17.359
20.911
17.203
6.757
10.421
12.651
7.574
7.83
11.082
9.074
13.801
14.578
11.703
10.093
10.511
10.619
8.52
10.294
10.752
15.817
18. 622
9.749
11.167
12. 764
9.6
14.254
20.79
23. 469
18.626
21.542
16.392
19.66
11.242
13. 921
16. 615
12.424
27.058
16.782
29.719
35.276
10.387
15. 416
12.267
5.847
13.202
24. 532
15. 525
14.25
13. 481
11.036
30.03
17.592
18. 731
15.754
28.904
0
16.649
24.09
30.246
13.34
9.392
12. 093
10. 366
8. 811
7.208
10. 128
6. 746
10. 664
13. 909
9.294
8.036
6. 17
10. 099
6.57
14. 379
8. 949
19. 174
14.898
11.27
12. 424
11.927
12.99
8. 809
14. 654
22.321
18. 765
18. 943
29. 01
24. 493
9. 138
22.521
17.898
35. 086
9. 116
13. 614
11. 977
31. 372
10. 493
20.088
27. 967
27. 319
23. 843
22.989
22.835
12. 554
13.561
19. 586
23. 625
5.923
20.777
13. 589
5. 296
21.825
20. 896
15. 55
21. 451
14. 877
9. 722
10.632
10.505
8.13
7. 634
8.56
8.341
13.653
13.391
9. 743
8. 69
9.493
9.328
11. 899
17.129
13.819
17.771
8.578
13. 225
14.017
10.776
17. 685
16.654
11.385
15.449
20.773
15.702
12.285
18.398
31.356
16.489
18.854
14.659
16.098
19.906
23.576
10.284
15.589
19.702
18.304
15.284
20.749
33.414
18.296
19.408
15.448
20.645
19.939
17. 57
20.549
20.183
10.343
7. 11
20.39
10.852
36.325
14.563
8.054
11.218
6.646
6.588
5.781
9.025
5.566
8.176
14.358
7.47
8.096
5.932
10.399
10.237
18.185
9.894
15.028
10.279
9.596
8.344
10.557
7.913
12.148
11.145
21.424
24.058
12.678
48.469
18.146
30.908
16.221
14.482
19.232
12.356
19.693
11.602
8.081
22.979
10.798
18.034
15.036
17.84
13.855
11.036
20.811
15.156
21.654
26.374
23.115
20.375
33.131
22.891
0
8.044
0
17.923
24.734
0.1434
0.0549
0.4385
0.4052
0.0274
0.2781
0.0008
0.3339
0.1813
0.3621
0.4706
0.3303
0.0048
0.0751
0.3556
0.4333
0.1807
0.087
0.1108
0.2127
0.3562
0.4691
0.3281
0.3492
0.3646
0.2807
0.0719
0.1306
0.4712
0.1737
0.163
0.0221
0.0276
0.391
0.3916
0.375
0.4286
0.1767
0.0402
0.346
0.3687
0.0651
0.3321
0.4609
0.1183
0.3789
0.4139
0.2812
0.023
0.1336
0.0101
0.2016
0.3781
0.165
0.1143
0.2736
0.4164
26
21
16
15
3
16.496
21.306
18.691
21.055
0
16.576
6.103
18.623
26.348
63.264
35.834
21.373
14.841
31.45
20.877
37.901
27.089
32.639
31.035
41.212
10.704
20.805
28.627
15.397
40.984
21.084
20.62
24.773
30.714
41.152
23.326
11.572
27.644
25.183
20.321
49. 643
23.238
27.773
15.156
0
21.208
20. 49
28. 001
30. 713
20.678
21.018
11.492
48.349
55.041
20. 82
31.205
5.649
23.822
29.516
0
0.4119
0.3884
0.3678
0.115
0.184
58
59
60
61
62
Tables 3 and 4 present the results of the t-tests for each individual county, reporting the test type, the standardized t-statistic values, and the p-values. A drop in the number of fatal accidents per 100,000 licensed drivers per year has been observed from the selected pre-law period to the post-law period in 46 counties. A drop in the number of personal injury
186
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
Table 2
Personal injury accidents per 1000 licensed drivers for New York counties.
Index
County
Driver
density
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
pvalue
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
New York
Kings
Bronx
Queens
Richmond
Nassau
Westchester
Rockland
Suffolk
Monroe
Erie
Schenectady
Onondaga
Albany
Putnam
Niagara
Orange
Dutchess
Broome
Saratoga
Rensselaer
Chemung
Oneida
Tompkins
Ontario
Ulster
Wayne
Seneca
Oswego
Genesee
Montgomery
Chautauqua
Fulton
Cayuga
Orleans
Madison
Columbia
Tioga
Livingston
Cortland
Warren
Greene
Sullivan
Jefferson
Clinton
Washington
Steuben
Wyoming
Yates
Cattaraugus
Otsego
Chenango
Schuyler
Schoharie
Herkimer
Allegany
Saint
Lawrence
Delaware
Franklin
Essex
Lewis
Hamilton
28,659
11,450
9617
9363
4807
3410
1468
1175
1149
769
613
546
405
373
323
301
286
251
201
190
165
153
132
131
116
116
114
93
90
89
89
89
80
79
76
76
74
73
71
65
57
57
56
55
52
52
52
50
50
45
44
42
42
38
32
32
28
26.789
32.509
28.747
22.410
16.188
17.573
13.949
15.222
14.796
12.766
12.646
11.185
14.313
17.139
14.772
12.399
15.727
15.139
12.201
10.659
12.285
10.370
13.937
12.435
12.726
14.849
10.124
10.894
12.707
14.404
13.866
11.887
12.318
12.530
10.104
11.926
13.218
9.582
10.666
14.327
15.432
12.552
15.640
12.776
11.991
11.460
10.347
10.441
8.085
11.617
11.988
11.125
9.641
10.692
10.410
10.585
10.180
25.523
33.809
28.956
23.517
16.269
17.688
14.037
14.393
15.076
12.176
12.220
10.249
14.734
16.057
12.911
11.075
15.783
14.086
11.367
9.644
11.774
11.358
13.846
10.997
10.700
14.330
10.287
10.009
12.174
13.335
12.523
14.276
11.666
11.387
9.896
10.215
12.387
8.613
9.833
12.378
14.583
10.740
13.903
11.607
11.997
10.550
9.258
10.627
6.983
10.759
11.365
10.067
9.066
10.028
8.890
9.280
9.325
23.951
33.173
30.929
23.830
14.491
17.374
13.586
14.723
14.580
12.266
12.516
10.725
14.653
15.922
13.378
12.243
16.235
14.839
12.096
9.806
11.512
11.425
13.803
12.515
11.097
13.876
9.531
12.130
11.860
13.390
11.238
11.170
11.928
11.506
8.405
10.184
11.685
9.528
10.210
13.082
14.574
10.943
13.899
12.122
11.102
11.426
10.552
11.102
6.622
10.630
13.401
11.398
10.136
9.927
9.696
10.405
9.447
22.898
33.088
32.947
23.721
14.055
17.744
13.696
15.568
14.835
13.108
13.057
11.084
14.171
16.433
13.100
12.104
16.415
14.119
13.106
9.542
11.867
11.039
13.767
12.664
11.719
14.206
9.471
11.452
12.747
14.258
10.424
11.352
12.044
12.455
8.871
11.249
10.527
9.290
10.202
11.956
14.000
10.934
13.452
11.018
11.390
10.194
11.193
10.465
6.920
11.138
11.737
11.040
10.682
10.880
10.165
9.114
10.007
21.396
30.526
30.453
22.451
14.351
16.781
13.168
14.548
14.372
12.038
11.882
10.648
12.794
14.533
12.757
11.074
15.378
13.250
11.415
8.789
9.975
10.010
12.980
10.734
10.220
12.499
8.489
10.369
12.089
12.336
9.712
10.542
10.681
10.444
8.356
9.322
10.982
7.338
10.271
11.678
13.004
10.046
12.838
10.977
10.449
10.405
9.577
9.624
7.267
9.633
10.355
10.515
9.241
11.457
8.569
8.790
9.703
19.812
29.461
29.470
22.038
13.749
17.307
13.161
14.515
14.735
12.673
12.746
11.912
13.202
14.935
11.618
11.018
16.238
13.055
11.397
8.579
10.680
10.307
12.670
11.095
11.028
13.482
8.284
11.390
12.118
14.447
11.177
11.248
12.249
10.852
9.124
11.182
10.763
8.178
10.111
13.082
13.397
11.467
13.989
10.723
11.504
10.408
10.030
10.386
8.290
11.240
12.246
10.191
9.824
10.925
10.364
9.827
9.545
16.887
24.973
25.819
18.333
12.586
16.137
12.175
13.745
13.696
11.410
12.123
10.696
12.985
16.230
11.268
11.166
15.077
13.077
11.020
8.881
9.988
8.778
12.724
10.704
10.806
13.105
8.329
10.662
11.549
13.760
10.772
11.238
11.100
10.649
8.424
10.077
10.214
7.763
9.380
13.262
12.538
12.396
13.788
9.540
10.159
9.285
9.662
9.884
6.798
9.552
10.919
9.816
9.423
10.542
9.200
9.559
9.117
15.650
22.261
21.688
16.159
11. 786
15.608
11. 135
13. 578
13. 564
11.403
12.083
9. 740
11. 940
14.048
11.645
10. 158
14.929
12. 825
10.362
8.356
9.691
9.680
11.943
10. 219
9.609
12.777
8.114
9.262
9.899
12.434
10. 190
10. 634
11.703
11.040
7.821
9.688
9.628
8.180
9.030
11. 991
12.746
10.878
13. 510
11. 362
9.369
9.482
9.229
9.776
6.157
9.570
10.871
9.354
9.742
9.823
9.373
8.832
8.671
15. 110
21. 002
20. 270
15. 725
11.156
14. 899
11.081
13.355
12.728
11. 189
11. 996
10.622
11.488
14. 469
11. 360
10.732
14. 062
12.288
9. 397
8.287
9. 881
9. 613
11. 634
9.948
9. 558
13. 015
6.980
10. 402
9. 074
13. 638
9.769
10.296
9. 799
10. 009
7.964
9. 661
9. 809
7.713
8. 861
10.938
12. 118
11. 524
13.369
10.861
9. 272
8. 973
8. 758
8. 674
5.864
9. 678
9. 671
7. 997
8. 439
9. 821
8. 375
9.101
7.641
15.197
20.094
20.424
14.862
9.550
14.031
10.582
11.615
12.673
9.809
11.248
9.260
10.255
12.459
11.147
8.580
12.754
10.665
8.729
7. 563
8.405
8.456
9.937
9.108
8.549
11.099
6. 538
8.845
8.797
9.922
9.179
9.490
7.989
9.677
6. 337
8.841
8.289
6. 314
7.246
10.739
10.087
8.715
11.537
9.767
8.981
7.923
8.465
8.108
6. 559
9.281
9.105
8.688
10.451
7.911
6.815
7. 749
7. 639
14.202
19.150
20.792
14.511
9.776
13.628
9.986
11.591
12.568
9.876
11.266
8.991
10.558
12.294
11.028
8.577
12.837
10.447
8.959
6.818
8.454
8.514
9.463
9.601
8.733
11.183
6.593
9.209
9.130
12.032
9.652
9.330
8.558
8.949
7.057
8.373
8.626
7.328
8.444
10.219
10.788
8.716
11.863
10.098
8.758
8.206
7.796
8.539
8.148
9.287
9.873
8.164
9.391
8.608
7.990
9.171
7.668
0.4379
0.0255
0.0753
0.0089
0.2190
0.0153
0.0185
0.0539
0.0227
0.0574
0.3205
0.0299
0.2077
0.1992
0.0118
0.1356
0.0195
0.1829
0.2019
0.4166
0.4981
0.3709
0.0145
0.2998
0.4704
0.3920
0.3630
0.3825
0.0142
0.1087
0.0541
0.1293
0.0363
0.4282
0.3764
0.4580
0.3788
0.2642
0.0201
0.3707
0.2642
0.1671
0.4854
0.4204
0.2053
0.2192
0.4719
0.1632
0.1323
0.5109
0.4968
0.1686
0.5134
0.1360
0.2076
0.3996
0.0623
26
21
16
15
3
11.932
11.414
13.196
9.685
12.234
10.249
11.320
11.733
8.273
8.224
10.861
11.786
12.911
9.540
12.526
11.370
10.625
11.750
9.466
12.776
10.089
9.957
11.272
8.366
8.607
10.516
10.516
11.148
10.289
12.551
8.864
9.489
10.159
8.864
10.567
10.033
9.498
11.838
9.245
13.769
10. 100
8. 664
10. 361
7.371
12. 614
8.066
9.337
8.254
7. 205
8.745
8.503
9.632
9.019
6.789
6.160
0.3166
0.3307
0.1951
0.1006
0.3326
58
59
60
61
62
accidents per 1000 licensed drivers per year has been observed in all 62 counties. According to Table 3, which looks at the
number of fatal automobile accidents per year per 100,000 licensed drivers, a total of 10 out of 62 counties have p-values
lower than 0.05 in the t-tests, providing sufficient evidence for the rejection of the ‘‘no effect” hypotheses at the 5% level
187
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
Table 3
Post-law and pre-law comparison – fatal injury accidents per 100,000 licensed drivers.
Index
County
Driver density
Test type
T
p-value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
New York
Kings
Bronx
Queens
Richmond
Nassau
Westchester
Rockland
Suffolk
Monroe
Erie
Schenectady
Onondaga
Albany
Putnam
Niagara
Orange
Dutchess
Broome
Saratoga
Rensselaer
Chemung
Oneida
Tompkins
Ontario
Ulster
Wayne
Seneca
Oswego
Genesee
Montgomery
Chautauqua
Fulton
Cayuga
Orleans
Madison
Columbia
Tioga
Livingston
Cortland
Warren
Greene
Sullivan
Jefferson
Clinton
Washington
Steuben
Wyoming
Yates
Cattaraugus
Otsego
Chenango
Schuyler
Schoharie
Herkimer
Allegany
Saint Lawrence
Delaware
Franklin
Essex
Lewis
Hamilton
28,659
11,450
9617
9363
4807
3410
1468
1175
1149
769
613
546
405
373
323
301
286
251
201
190
165
153
132
131
116
116
114
93
90
89
89
89
80
79
76
76
74
73
71
65
57
57
56
55
52
52
52
50
50
45
44
42
42
38
32
32
28
26
21
16
15
3
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
2.4282
4.6581
2.297
3.1277
0.5757
1.0108
0.5062
0.0272
3.7777
0.0043
4.4579
0.4335
0.0754
1.2737
0.849
0.2607
0.8355
1.6785
1.6448
0.6341
0.7025
0.1973
0.0393
0.8955
0.302
1.5012
1.0435
1.1665
2.071
1.3517
1.0848
0.822
3.1774
0.812
0.2531
0.7186
0.0223
0.794
0.8802
0.8181
0.0818
1.0117
0.4882
0.2667
2.4495
1.4584
0.0001
3.4703
0.7844
1.6268
1.2213
0.8245
0.9009
1.6895
0.3764
0.651
2.1479
0.6149
0.8606
1.4389
0.9157
1.352
0.019
0.0006
0.0236
0.0061
0.2895
0.1693
0.3167
0.4894
0.0022
0.5017
0.0008
0.3374
0.5284
0.1173
0.2089
0.4001
0.7875
0.0638
0.0672
0.2709
0.2501
0.424
0.4848
0.8031
0.3848
0.9162
0.162
0.8633
0.0341
0.1047
0.8469
0.2248
0.9944
0.2189
0.4029
0.2453
0.4914
0.7762
0.7992
0.2172
0.5317
0.1691
0.6815
0.3979
0.0184
0.0894
0.5
0.0035
0.2346
0.0691
0.1348
0.2155
0.1955
0.0627
0.3577
0.7343
0.0301
0.7231
0.2059
0.908
0.8081
0.1047
of significance. According to Table 4, which looks at the number of personal injury automobile accidents per year per 1000
licensed drivers, a total of 46 out of 62 counties have p-values lower than 0.05 in the t-tests. Fig. 1 presents the personal
injury accident rate standardized t-statistic values for the hypothesis tests for all counties, respectively, plotted against licensed driver density.
188
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
Table 4
Post-law and pre-law comparison – personal injury accidents per 1000 licensed drivers.
Index
County
Driver density
Test type
T
p-value
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
New York
Kings
Bronx
Queens
Richmond
Nassau
Westchester
Rockland
Suffolk
Monroe
Erie
Schenectady
Onondaga
Albany
Putnam
Niagara
Orange
Dutchess
Broome
Saratoga
Rensselaer
Chemung
Oneida
Tompkins
Ontario
Ulster
Wayne
Seneca
Oswego
Genesee
Montgomery
Chautauqua
Fulton
Cayuga
Orleans
Madison
Columbia
Tioga
Livingston
Cortland
Warren
Greene
Sullivan
Jefferson
Clinton
Washington
Steuben
Wyoming
Yates
Cattaraugus
Otsego
Chenango
Schuyler
Schoharie
Herkimer
Allegany
Saint Lawrence
Delaware
Franklin
Essex
Lewis
Hamilton
28,659
11,450
9617
9363
4807
3410
1468
1175
1149
769
613
546
405
373
323
301
286
251
201
190
165
153
132
131
116
116
114
93
90
89
89
89
80
79
76
76
74
73
71
65
57
57
56
55
52
52
52
50
50
45
44
42
42
38
32
32
28
26
21
16
15
3
Pooled
Not pooled
Pooled
Not pooled
Pooled
Not pooled
Not pooled
Pooled
Not pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Not pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Not pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
Pooled
6.4023
5.4407
4.0087
5.2087
4.2763
3.6932
4.7516
3.1667
3.862
2.7131
1.7638
1.1196
3.7802
2.4899
5.4204
2.9254
2.7064
3.6016
3.5655
3.6897
3.6164
3.8226
3.8372
3.5403
2.5992
2.5664
4.375
1.7797
3.6964
1.0372
1.9045
2.1273
1.817
2.9287
2.4076
1.5606
3.6246
2.6183
3.3509
1.3652
3.5244
0.5391
1.485
2.957
3.2393
3.7247
2.4693
2.6474
0.4013
2.2114
1.9531
3.9199
0.5177
1.7181
1.3353
1.289
3.3032
2.8361
3.7613
2.979
1.1327
0.0879
0.0001
0.0002
0.0015
0.0012
0.001
0.0052
0.0015
0.0057
0.0039
0.0119
0.0558
0.1459
0.0022
0.0172
0.0018
0.0084
0.0174
0.0029
0.003
0.0025
0.0028
0.002
0.0044
0.0032
0.0144
0.0152
0.0009
0.0544
0.0052
0.1633
0.0446
0.0311
0.0513
0.0084
0.0197
0.0765
0.0028
0.0139
0.0075
0.1027
0.0032
0.3015
0.0858
0.008
0.0051
0.0024
0.0178
0.0133
0.3488
0.0272
0.0413
0.0018
0.3086
0.06
0.1073
0.1148
0.0046
0.0098
0.0022
0.0077
0.1433
0.4659
A condensed version of further results is given in Tables 5 and 6, where a summary of the t-test results is presented for
three different cases of pooled groups of counties. In the first case, the measures of all the counties in New York are pooled in
order to obtain a statewide result. In the second case, the measures of the counties are pooled according to geopolitical designation in order to examine results for New York City and upstate New York. In the third case, the measures of the counties
189
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
Fig. 1. Personal injury accident t-test statistic by county, pre-law to post-law.
Table 5
Post-law and pre-law comparison – fatal injury accidents per 100,000 licensed drivers.
Group
X post X pre
Sp
npre ; npost
T
p-value
NY state (1–62)
NY city (1–5)
Upstate (6–62)
NY county (1)
Kings (2)
Bronx–Queens (3–4)
Richmond (5)
Nassau (6)
Westchester–Suffolk (7–9)
Monroe–Schenectady (10–12)
Onondaga–Dutchess (13–18)
Broome–Wayne (19–27)
Seneca–Hamilton (28–62)
1.399
2.4942
1.3029
2.3441
3.4271
3.1052
0.4895
0.7614
0.5574
1.0548
1.1261
0.4038
1.6651
8.516
8.6093
1.5942
1.2150
2.5026
1.4041
1.2439
3.0764
2.1419
3.9705
4.6819
9.2897
310, 372
25, 30
285, 342
5, 6
5, 6
10, 12
5, 6
5, 6
15, 18
15, 18
30, 36
45, 54
175, 210
2.1362
3.3787
1.8869
2.4282
4.6581
2.8979
0.5757
1.0108
0.5182
1.4086
1.1473
0.4273
1.7512
0.0165
0.0008
0.0298
0.0190
0.0006
0.0044
0.2895
0.1693
0.3040
0.0845
0.1278
0.3351
0.0404
Table 6
Post-law and pre-law comparison – personal injury accidents per 1000 licensed drivers.
Group
X post X pre
Sp
npre ; npost
T
p-value
NY state (1–62)
NY city (1–5)
Upstate (6–62)
NY county (1)
Kings (2)
Bronx–Queens (3–4)
Richmond (5)
Nassau (6)
Westchester–Suffolk (7–9)
Monroe–Schenectady (10–12)
Onondaga–Dutchess (13–18)
Broome–Wayne (19–27)
Seneca–Hamilton (28–62)
1.8870
6.9960
1.4388
7.9685
9.7975
6.7886
3.6370
2.1637
1.8542
0.8465
1.9896
1.8302
1.2382
5.8108
2.0508
2.0554
2.9739
4.2875
1.4046
1.0439
1.7729
1.6593
1.6556
310, 372
25, 30
285, 342
5, 6
5, 6
10, 12
5, 6
5, 6
15, 18
15, 18
30, 36
45, 54
175, 210
6.1656
4.4459
8.7473
6.4023
5.4407
3.6979
4.2763
3.6932
5.1424
2.3196
4.5398
5.4646
7.3070
0.0000
0.0000
0.0000
0.0001
0.0002
0.0007
0.0010
0.0052
0.0000
0.0136
0.0000
0.0000
0.0000
are pooled according to licensed driver density values. In particular, a k-means clustering algorithm (Seber, 1984) is used to
form 10 groups of counties with similar licensed driver density values. The algorithm selects group membership in order to
minimize the total intra-group Euclidean distance between a county’s density value and its group’s mean density value. Each
table reports the difference in its respective measure from the selected pre-law period to the post-law period, the pooled
sample standard deviation (when appropriate), the number of data points in the samples, the values of the test statistic
(T is distributed t npre þnpost 2 Þ and the p-values. For most of the pooled groups, the hypothesis of equal variances of accident
rate measures between the pre-law and post-law periods was not rejected. Those groups that rejected the hypothesis of
equal variances have a in the Sp column.
190
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
4. Discussion
As the number of drivers that use cell phones while driving grows, the interest in linking hand-held cell phone use while
driving and road safety increases. As more technologies, including cameras, music, text messaging, and internet browsing
become available from mobile devices, they may pose an even greater cause of driver distraction. As of 2009, more than
250 bills prohibiting or restricting cell phone use while driving are pending in 42 state legislatures, despite disagreement
over the risks cell phones pose and the effectiveness of enforcement (O’Donnell, 2009).
This paper conducts a comparative analysis of two automobile accident rate measures in the counties of New York State
for the periods before and after the state-wide hand-held cell phone ban law was enacted. Section 4.1 summarizes the findings, Section 4.2 discusses the limitations of the presented analysis of the effects of law enforcement on improving driving
safety, and Section 4.3 points out the possible directions for further research on this subject.
4.1. Summary
The results presented in Section 3 indicate that after banning hand-held cell phone use while driving, 46 counties in New
York experienced lower fatal automobile accident rates, 10 of which did so at a statistically significant level, and all 62 counties experienced lower personal injury automobile accident rates, 46 of which did so at a statistically significant level.
The analysis strongly suggests that the mean fatal accident rate measure decreased at a significant level for New York State
(p-value of 0.0165, see Table 5), for New York City and upstate New York (p-values of 0.0008 and 0.0298, respectively, see
Table 5), and for four of the 10 groups partitioned by similar licensed driver density. Three of these four groups contained high
density New York City counties (New York County, Bronx, and Queens with p-values of 0.0190, 0.0006, and 0.0044, respectively, see Table 5). The fourth group contained the lowest density subset of upstate New York counties (Seneca–Hamilton,
with a p-value of 0.0404, see Table 5).
The mean personal injury accident rate measure decreased at a significant level for all groups in each of the three cases
examined (see Table 6). Moreover, it has been observed that, in general, the personal injury accident rate decrease is more
substantive in counties with a high density of licensed drivers (see Fig. 1). Overall, the personal injury accident rate proved to
be a more appropriate measure than the fatal accident rate for the analysis.
4.2. Limitations
There exist several issues that limit the statistical validity of the presented analysis and hamper one’s ability to definitively establish the effect of laws banning hand-held cell phone use while driving on automobile accident rates using publicly available, historical data as the basis for analysis. First, one should take care not to project the results of this analysis
based only on New York data to the national level, given that each state, county, city, and town has their own unique highway and roadway transportation network that, by their very design, must be considered.
Second, hand-held cell phone ban legislation may not be the only way to affect automobile accident rates. This observation makes it difficult to judge whether the changes in automobile accident rates in counties with hand-held cell phone bans
are primarily attributed to the bans, or to some other factors, including but not limited to road construction, safety education, introduction of new automobile safety features, and/or changes in alcohol and illegal substance control policies. Considerations of such confounding factors should ideally be included in the analysis, but unfortunately, the relevant data are
unavailable due to their absence in the public records as well as proprietary concerns of the companies that use such data
for their business interests. Also, the impact of traffic safety improvement thrusts, such as the ‘‘Safe Streets NYC” program in
New York City (Bloomberg and Sadik-Khan, 2007), should be taken into account.
Third, proper enforcement of hand-held cell phone ban laws, and hence, driver compliance is an important issue. McCartt
and Geary (2003, 2004) reported that the hand-held cell phone user rate while driving in New York dropped from 2.3% (before the law was enacted) to 1.1% in the first few months immediately following the enactment of the law. However, this rate
rebounded back up to 2.1% about a year later. Since the initial drop in hand-held cell phone use while driving was not sustained, it is possible that the reduction in automobile accident rates in New York may be due to other factors.
Fourth, data linking the number of cell phone subscribers to automobile accident rates suggest that increased cell phone
use does not translate into increased automobile accident rates. In particular, there has been an exponential growth in the
number of cell phone subscribers from the late-1980s, while automobile accident rates in the US during this same time period have remained at a fairly constant level (see Fig. 2). Driving statistics from the National Center for Statistics and Analysis
of the NHTSA reveals that from 1994 to 2004, the number of cell phone subscribers increased 655%, with their average
monthly minutes-of-use increasing 3900%, while annual automobile accident rates reported decreased by approximately
5% over the same time period (Information Please Database, 2007; NHTSA, 2008). These facts should not go unnoticed, even
though it is likely that the changes in the transportation policy and the advances in safety in the automotive industry between 1988 and 2006 have influenced the accident rates.
As of February 2007, sixteen states had published data on the number of automobile accidents that cited hand-held cell
phones or radios as a causal factor. These data indicate that hand-held cell phone use is reported as a factor in less than 1% of
automobile accidents (Sundeen, 2007). Although such data are controversial and potentially unreliable, due to the challenge
191
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
25
250
20
200
15
150
10
100
5
50
0
1988
1990
1992
1994
1996 1998
Year
2000
2002
2004
2006
Cell Phone Subscribers (x106)
Automobile Accident Rate (x106)
Automobile Accident Rate
Cell Phone Subscribers
0
Fig. 2. Automobile accident rates and the number of cell phone subscribers in the US, 1988–2006.
in knowing the precise cause of accidents and how such information is reported, it does suggest that hand-held cell phone
use may account for a negligible percentage of automobile accidents, which means that if such accidents could be completely
eliminated by hand-held cell phone ban laws, there would be only a slight reduction in the total number of automobile
accidents.
4.3. Future research directions
A large body of literature suggests that hand-held cell phone use while driving impairs driver performance (Ranney, 2005;
Strayer et al., 2006; Sundeen, 2001; Redelemeier and Tibshirani, 1997). Drivers using hand-held cell phones have slower
reaction times, longer following distances, and longer speed resume time compared to those drivers who do not use
hand-held cell phones (Strayer and Drews, 2004). Although studies using driving simulators and test tracks indicate that
hand-held cell phone use negatively impacts driver performance, the results drawn from experiments in such controlled
environments cannot directly measure the impact of hand-held cell phone use on accident rates (Hedlund, 2006). Indeed,
there is insufficient evidence to broadly assert that hand-held cell phone use results in higher accident rates or that
hand-held cell phone bans decrease accident rates (Williams, 2007). Several organizations, including CTIA and AAA Auto
Insurance, believe it is premature to ban hand-held cell phone use while driving. They argue that road safety can be improved more effectively through education and ease-of-use cell phone designs, rather than legislation.
Studies conducted in actual driving conditions, not only in laboratory environments, are needed to provide convincing
evidence that hand-held cell phone use while driving impairs driver performance, and hence, increases automobile accident
rates. However, staging a set of potentially dangerous situations on the road just to evaluate the driver’s ability to avoid a
collision is unthinkable, and hence, the statistical approach taken in this paper may be the only one where data from actual
accidents can be used to answer questions regarding cell phone use while driving. Although at this point one should be cautious about drawing conclusions from the current analysis (for reasons described in Section 4.2), the approach taken in this
paper looks very promising for providing useful information on the need for hand-held cell phone ban laws.
In order to conduct a more substantive and conclusive analysis, the data that would allow for blocking the confounding
factors are required. Also, the property damage automobile accident rate should be considered as another, more appropriate
measure of safety than fatal or personal injury accident rates. A measure that ideally would replace the density of licensed
drivers in the analysis is the daily vehicle throughput per square mile of a county’s land. Moreover, in order to investigate the
effects of restricting hand-held cell phone use while driving, wider-coverage data related to cell phone usage and road safety
are needed to support additional research on this important problem. Such data could include the fraction of drivers actually
using hand-held cell phones while driving, the total amount of time that hand-held cell phones are used while driving, and
the fraction of automobile accidents that are directly attributable to hand-held cell phone use. Note that the bonanza of the
described data lies in the hands of insurance companies that must be interested in the correct evaluation of the impact of cell
phone ban laws on driving safety, albeit only for the sake of gaining a competitive edge over their rivals. Moreover, national
and state transportation policy law-makers would welcome a fair and unbiased analysis with such data, to put to rest the
growing debate on this issue and allow for appropriate national and state legislation policies and decisions to be made.
192
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
Given more data, a logical step to take from the statistical point of view is to conduct a time series cross-sectional multivariate regression analysis and employ analysis of variance techniques to establish whether laws prohibiting hand-held cell
phone use while driving have a significant effect on the driving environment. The authors do not intend to stop short of finding the truth and actively seek potential collaborations with interested parties.
Acknowledgements
The computational work was done in the Simulation and Optimization Laboratory housed within the Department of Computer Science at the University of Illinois. The views expressed in this paper are those of the authors and do not reflect the
official policy or position of the United States Air Force, Department of Defense, or the United States Government. The
authors wish to thank the editor and two anonymous referees for their helpful comments and suggestions, which have significantly improved the manuscript. The authors would also like to thank Qianyi C. Zhao for her contributions during the
initial stages of this research.
References
AAA Auto Insurance, Public Affairs, 2008. State Distracted Driving Laws Chart. %3chttp://www.aaapublicaffairs.com/Assets/Files/200891214340.
DistractedDrivingLaws8.08.doc%3e (accessed 28.01.09).
Auto News, 2006. Hands-free Phone Kits Help Reduce Cell Phone Distraction. %3chttp://www.motortrend.com/auto_news/112_news060906_hands_
free_cell_phone_use/index.html%3e (accessed 18.02.09).
Bloomberg M.R., Sadik-Khan, J., 2007. Safe Streets NYC. Traffic Safety Improvements in New York City. %3chttp://www.SafeNY.com%3e (accessed 18.10.09).
Caird, J.K., Willness, C.R., Steel, P., Scialfa, C., 2008. A meta-analysis of the effects of cell phones on driver performance. Accident Analysis and Prevention 40
(4), 1282–1293.
Cellular News, 2008. Countries that Ban Cell Phones while Driving. %3chttp://www.cellular-news.com/car_bans/%3e (accessed 28.01.09).
Cohen, J.T., Graham, J.D., 2003. A revised economic analysis of restrictions on the use of cell phones while driving. Risk Analysis 23 (1), 5–17.
CTIA, 2008a. Safe Driving Tips. %3chttp://www.ctia.org/consumer_info/safety/index.cfm/AID/10369%3e (accessed 28.01.09).
CTIA, 2008b. Safe Driving: CTIA Position. %3chttp://www.ctia.org/ad-vocacy/policy_topics/topic.cfm/TID/17%3e (accessed 28.01.09).
Dingus, T.A., Klauer, S.G., Neale, V.L., Petersen, A., Lee, S.E., Sudweeks, J., Perez, M.A., Hankey, J., Ramsey, D., Gupta, S., Bucher, C., Doerzaph, Z.R., Jermeland, J.,
Knipling, R.R., 2006. The 100-car Naturalistic Driving Study, Phase II – Results of the 100-car Field Experiment. NHTSA, DOT HS 810 593. %3chttp://
www.nhtsa.gov/portal/nhtsa_static_file_downloader.jsp?file=/sta-ticfiles/DOT/NHTSA/NRD/Multimedia/PDFs/Crash%20Avoidance/2006/
100CarMain.pdf%3e (accessed 28.01.09).
Funponsel Network, 2005. Motorola Develops ’Polite’ Phone. %3chttp://www.funponsel.com/blog/news/motorola-develops-polite-phone.html%3e
(accessed 18.02.09).
Glassbrenner, D., Ye, J.Q., 2007. Driver Cell Phone Use in 2006 – Overall Results. NHTSA, DOT HS 810 790 (July).
Goodman, M.J., Benel, D., Lerner, N., Wierwille, W., Tijerina, L., Bents, F., 1997. An Investigation of the Safety Implications of Wireless Communications in
Vehicles. US Dept. of Transportation, NHTSA. %3chttp://www.nhtsa.dot.gov/people/injury/research/wireless/%3e (accessed 28.01.09).
Governors Highway Safety Association, 2008. Cell Phone Driving Laws. %3chttp://www.ghsa.org/html/stateinfo/laws/cellphone_laws.html#4%3e (accessed
28.01.09).
Hedlund, J.H., 2006. Countermeasures that Work: A Highway Safety Countermeasures Guide for State Highway Safety Offices. NHTSA, Washington, DC, DOT
HS 809 980 (January). %3chttp://www.nhtsa.dot.gov/people/injury/airbags/Countermeasures/pages/0Introduction.htm%3e (accessed 28.01.09).
Horrey, W.J., Wickens, C.D., 2006. Examining the impact of cell phone conversations on driving using meta-analytic techniques. Journal of Experimental
Psychology: Applied 12 (2), 67–78.
Information Please Database, 2007. Cell Phone Subscribers in the US, 1985–2005. %3chttp://www.infoplease.com/ipa/A0933563.html%3e (accessed
28.01.09).
Klauer, S.G., Dingus, T.A., Neale, V.L., Sudweeks, J.D., Ramsey, D.J., 2006. The Impact of Driver Inattention on Near-crash/Crash Risk: An Analysis Using the
100-car Naturalistic Driving Study Data. NHTSA, DOT HS 810 5940 (April). %3chttp://www.noys.org/Driver%20Inattention%20Report.pdf%3e (accessed
28.01.09).
Lissy, S.K., Cohen, J.T., Park, M.Y., Graham, J.D., 2000. Cellular Phone Use while Driving: Risks and Benefits. Harvard Center for Risk Analysis, Harvard School
of Public Health. %3chttp://www-nrd.nhtsa.dot.gov/departments/nrd-13/driver-distraction/PDF/Harvard.PDF%3e (accessed 28.01.09).
McCartt, A.T., Geary, L.L., 2003. Drivers’ use of handheld cell phones before and after New York State’s cell phone law. Preventive Medicine 36, 629–635.
McCartt, A.T., Geary, L.L., 2004. Long term effects of New York State’s law on drivers’ handheld cell phone use. Injury Prevention 10, 11–15.
McCartt, A.T., Hellinga, L.A., Bratiman, K.A., 2006. Cell phones and driving: review of research. Traffic Injury Prevention 7 (2), 89–106.
McEvoy, S.P., Stevenson, M.R., McCartt, A.T., Woodward, M., Haworth, C., Palamara, P., Cercarelli, R., 2005. Role of mobile phones in motor vehicle crashes
resulting in hospital attendance: a case-crossover study. British Medical Journal 331 (7514), 428–430.
NHTSA, 1997. Traffic Safety Facts 1996: Young Drivers. NHTSA, Washington, DC.
NHTSA, 2008. Table 2-17: Motor Vehicle Safety Data, National Transportation Statistics. %3chttp://www.bts.gov/publications/national_trans-portation_
statistics/html/table_02_17.html%3e (accessed 28.01.09).
New York State Department of Motor Vehicles, 2008a. Ticket and Crash Data Reports by County in 2001–2006. %3chttp://www.nysgtsc.state.ny.us/
hsdata.htm%3e (accessed 28.01.09).
New York State Department of Motor Vehicles, 2008b. The Number of All New York States Vehicle Registrations Considered Active in 1997–2006. %3chttp://
www.nydmv.state.ny.us/stats-arc.htm%3e (accessed 28.01.09).
New York State Department of Motor Vehicles, 2008c. Driver Licenses and Vehicle Registrations by County in 1997–2006. %3chttp://www.nydmv.state.
ny.us/stats-arc.htm%3e (accessed 28.01.09).
O’Donnell, J., 2009. Efforts to Limit Cellphone Use while Driving Grow. USA Today, March 30, 2009.
Olson, R.K., 2003. Cell Phone Bans for Drivers: Wise Legislation? International Risk Management Institute. %3chttp://www.irmi.com/Expert/Articles/2003/
Olson05-a.aspx%3e (accessed 28.01.09).
Ranney, T.A., 2005. Examination of the Distraction Effects of Wireless Phone Interfaces Using the National Advanced Driving Simulator-Final Report on the
Freeway Study. NHTSA, DOT HS 809 787.
Redelemeier, D.A., Tibshirani, R.J., 1997. Association between cellular-telephone calls and motor vehicle collisions. The New England Journal of Medicine
336, 453–458.
Richtel, M., 2009. Promoting the Car Phone, Despite Risks. New York Times, December 7, 2009.
Royal, D., 2003. Volume I: Findings National Survey of Distracted and Drowsy Driving Attitudes and Behavior: 2002. NHTSA, DOT HS 809 566.
Savage, M.A., Goehring, J.B., Mejeur, J., Reed, J.B., Sundeen, M., 2000. State Traffic Safety Legislative Summary 2000. Transportation Series (National
Conference of State Legislatures). %3chttp://www.ncsl.org/programs/transportation/trafsafe00.htm%3e (accessed 28.01.09).
Seber, G.A.F., 1984. Multivariate Observations. John Wiley and Sons, Hoboken, NJ.
A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193
193
Strayer, D.L., Drews, F.A., 2004. Profiles in driver distraction: effects of cell phone conversations on younger and old drivers. Human Factors 46 (4), 640–649.
Strayer, D.L., Drews, F.A., Johnston, W.A., 2003. Cell phone-induced failures of visual attention during simulated driving. Journal of Experimental Psychology:
Applied 1 (9), 23–32.
Strayer, D.L., Drews, F.A., Crouch, D.J., 2006. A comparison of the cell phone driver and the drunk driver. Human Factors 48 (2), 381–391.
Sundeen, M., 2001. Driving while calling – what’s the legal limit? State Legislatures 27 (9), 24–26.
Sundeen, M., 2004. Cell Phones and Highway Safety: 2003 State Legislative Update. National Conference of State Legislatures. %3chttp://www.ncsl.org/print/
transportation/cellphoneupdate12-03.pdf%3e (accessed 18.02.09).
Sundeen, M., 2007. Cell Phones and High Way Safety 2006 State Legislative Update. National Conference of State Legislatures. %3chttp://www.ncsl.org/
print/transportation/2006cellphone.pdf%3e (accessed 28.01.09).
Williams, A.F., 2007. Contribution of the components of graduated licensing to crash reductions. Journal of Safety Research 38 (2), 177–184.
This article was downloaded by: [192.17.144.156] On: 30 June 2014, At: 06:17
Publisher: Institute for Operations Research and the Management Sciences (INFORMS)
INFORMS is located in Maryland, USA
Operations Research
Publication details, including instructions for authors and subscription information:
http://pubsonline.informs.org
Balance Optimization Subset Selection (BOSS): An
Alternative Approach for Causal Inference with
Observational Data
Alexander G. Nikolaev, Sheldon H. Jacobson, Wendy K. Tam Cho, Jason J. Sauppe, Edward C.
Sewell,
To cite this article:
Alexander G. Nikolaev, Sheldon H. Jacobson, Wendy K. Tam Cho, Jason J. Sauppe, Edward C. Sewell, (2013) Balance
Optimization Subset Selection (BOSS): An Alternative Approach for Causal Inference with Observational Data. Operations
Research 61(2):398-412. http://dx.doi.org/10.1287/opre.1120.1118
Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions
This article may be used only for the purposes of research, teaching, and/or private study. Commercial use
or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher
approval, unless otherwise noted. For more information, contact [email protected].
The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness
for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or
inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or
support of claims made of that product, publication, or service.
Copyright © 2013, INFORMS
Please scroll down for article—it is on subsequent pages
INFORMS is the largest professional society in the world for professionals in the fields of operations research, management
science, and analytics.
For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org
OPERATIONS RESEARCH
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Vol. 61, No. 2, March–April 2013, pp. 398–412
ISSN 0030-364X (print) ISSN 1526-5463 (online)
http://dx.doi.org/10.1287/opre.1120.1118
© 2013 INFORMS
Balance Optimization Subset Selection (BOSS):
An Alternative Approach for Causal Inference with
Observational Data
Alexander G. Nikolaev
Department of Industrial and Systems Engineering, University at Buffalo (SUNY), Buffalo, New York 14260, [email protected]
Sheldon H. Jacobson
Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected]
Wendy K. Tam Cho
Departments of Political Science and Statistics and the National Center for Supercomputing Applications,
University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected]
Jason J. Sauppe
Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected]
Edward C. Sewell
Department of Mathematics and Statistics, Southern Illinois University Edwardsville, Edwardsville, Illinois 62026, [email protected]
Scientists in all disciplines attempt to identify and document causal relationships. Those not fortunate enough to be able to
design and implement randomized control trials must resort to observational studies. To make causal inferences outside the
experimental realm, researchers attempt to control for bias sources by postprocessing observational data. Finding the subset
of data most conducive to unbiased or least biased treatment effect estimation is a challenging, complex problem. However,
the rise in computational power and algorithmic sophistication leads to an operations research solution that circumvents
many of the challenges presented by methods employed over the past 30 years.
Subject classifications: causal inference; balance optimization; subset selection.
Area of review: Optimization.
History: Received September 2011; revisions received February 2012, July 2012; accepted September 2012. Published
online in Articles in Advance March 19, 2013.
1. Problem Description
tools for measuring estimation accuracy (e.g., calculating
p-values, confidence intervals), randomization is powerful
because it allows the effect of treatment to be isolated from
that of confounding factors.
There are numerous situations where conducting a randomized experiment is impractical or not even possible
(due to ethical dilemmas). For example, to determine
whether smoking causes lung cancer, it would not be possible to randomly select people to smoke. Similarly, although
it would be beneficial to understand the perils of radiation
exposure, randomly choosing people and exposing them
to high levels of radiation is unethical. Although experiments cannot be conducted for these pressing and important
research queries, one can often collect observational data.
So, although we would not expose people to situations that
might put their health in peril, because these situations do
occur, we can observe people who choose to smoke or find
people who have been inadvertently exposed to radiation.
This type of data is called observational data because it is
observed (rather than created via experiments).
Observational data are both more prevalent than experimental data and available for a larger set of important
Randomized experiments have been used by a diverse
swath of researchers to isolate treatment effects and establish causal relationships. Such experiments have informed
our understanding of medicine (e.g., the effect of drugs,
the causes of cancer, the benefit of vitamins), and have
been instrumental in the implementation of public policy (e.g., shedding insight on the effect of racial campaign appeals, testing the effect of get-out-the-vote appeals,
determining the impact of new voting technologies). The
randomized experimental framework is best suited for
exploring causal inferences. In an experiment, a study population is chosen (ideally) at random, or otherwise, by a
careful selection of a convenient sample. Another random
process determines whether or not each unit will receive
a treatment. Because randomization ensures that the treatment and control units are identical in distribution, save that
the treatment units have received a treatment, the treatment
effect can then be defined as the difference in response
(measurable outcome) between the units in the treatment
group and those in the control group. In addition to offering
398
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
queries. Indeed, there are already many instances of
research attempting to make causal inferences using observational data. In the health field, for example, studies
have examined the impact of generic substitution of presumptively chemically equivalent drugs (Rubin 1991), the
consequences of in utero exposure to phenobarbital on
intelligence deficits (Reinisch et al. 1995), and the effect
of maternal smoking on birthweight (da Veiga and Wilder
2008). Public policy applications of causal analysis have
included the impact of different voting technologies for
counting votes (Herron and Wand 2007), the varying role
of information on voters in mature versus new democracies (Sekhon 2004), and the effect of electoral rules on
the presence of the elderly in national legislatures (Terrie
2008). At the same time, there is no consensus on how best
to proceed if one wishes to make causal inferences with
observational data.
The critical difference between experiments and observational studies is that in experiments, because units are
randomly assigned to a treatment, the distributions of their
covariates (attributes) in the treatment and control groups
are identical, isolating the effect of treatment and permitting
its determination in expectation. Although various mechanisms have been proposed for random assignment in the
statistical literature to handle such issues (Morris 1985),
working with observational data sets requires a different set
of tools.
It is well recognized that confounding effects in a data
set may exist due to both observed (those reflected in the
data set) and unobserved covariates. Dealing with unobservable covariates is a fundamental challenge for causal
inference and requires additional information to supplement
the available data, whereas the effects of observed covariates can be isolated by data postprocessing, which has
received significant interest from practitioners as reported
above. A large body of literature has been sparked by the
works of Rubin and Rosenbaum, the first to present definitions, assumptions, and discussions to arrive at a technically sound formulation of the causal inference problem
with observational data (see individual references in the
text below). This paper makes a contribution to this already
rich literature, offering an alternative approach to causal
analysis.
In order to analyze observational data, where treatment
assignment has already been made (a priori nonrandomly),
one must postprocess the data with respect to the observed
covariates so as to remove confounding effects by creating treatment and control groups with statistically indistinguishable distributions of their covariates. How to best
postprocess observational data and assess the success of
this venture is an open question.
To transition from a randomized experimental setting to
an observational setting, the nuances and similarities of
each must be examined. For unit u, let Yu1 (Yu0 ) denote
a treated (untreated) response; Tu , a treatment indicator (1 means treated, 0 means not treated); and Xu =
8X1u 1 X2u 1 0 0 0 1 XKu 9, a vector of values for K covariates.
399
In both experimental and observational settings, a population of units is under consideration. For a particular
unit u, the causal effect of the treatment (relative to the
control) is defined as the difference in response that results
from receiving and not receiving the treatment, Yu1 − Yu0 .
The fundamental problem of causal inference is that it
is impossible to observe both values Yu1 and Yu0 on the
same unit u (Holland 1986) (e.g., a person either smokes
or does not smoke). The outcome of an observation of a
unit is termed the observed response, Tu Yu1 + 41 − Tu 5Yu0 .
The Rubin causal model (Rubin 1974, 1978) reconceptualizes this causal inference framework so that the response
under either treatment or control, but not both, needs to be
observed for each unit. That is, one statistical solution to
the fundamental problem of causal inference is to shift to
an examination of an average causal effect over all units
in the population, E4Yu1 − Yu0 5 = E4Yu1 5 − E4Yu0 5, where
E4Yu1 5 is computed from the treatment group and E4Yu0 5 is
computed from the control group.
An important consideration is how one determines which
units will inform the values of Yu1 and Yu0 . In an observational study, one observes some pool of units who
have received a treatment, giving E4Yu1 T = 15, and some
pool of units who have not received a treatment, giving E4Yu0 T = 05. In general, E4Yu1 5 6= E4Yu1 T = 15 and
E4Yu0 5 6= E4Yu0 T = 05. Moreover, the average treatment
effect (ATE), E4Yu1 − Yu0 5, is not the same as the average treatment effect for the treated (ATT), E4Yu1 T = 15 −
E4Yu0 T = 15. By design, ATE and ATT are interchangeable if the independence assumption holds. That is, if exposure to treatment (T = 1) or control (T = 0) is statistically independent of response and covariate values, then
the units have been properly randomized into treatment and
control pools, rendering ATE and ATT to be the same.
This situation is not typically the case in observational
studies because units are not randomly placed into treatment and control pools. Instead, ATT = E4Yu1 T = 15 −
E4Yu0 T = 15 = E4Yu1 T = 15 − E4Yu0 T = 05 + B, where
selection bias is present, defined as B ≡ E4Yu0 T = 05 −
E4Yu0 T = 15.
One approach for estimating treatment effects outside the
experimental realm relies on multivariate statistical techniques, which fall under the broad rubric of matching methods (Rubin 2006). The core of these methods is to employ
tools to match units based on their covariate similarity. This
results in each treatment unit being matched with a control
unit. If the matching venture is successful, then treatment
and control groups are obtained such that the two groups
are similar in their covariates, differing only on the treatment indicator value, thereby reducing the bias in the estimation of treatment effects.
Although this set of techniques has been widely used,
there remains a lack of consensus on how best to achieve
matching or how to assess the success of a matching
process. However, a generally accepted principle is that
balance on the covariates leads to minimal bias in the
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
400
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
estimated treatment effect (Rosenbaum and Rubin 1985).
Here, balance has been loosely understood as similarity
between distributions of covariates in the treatment and
control groups. Therefore, whereas most researchers agree
that a reasonable goal of matching procedures is to obtain
balance, there remains disagreement on how to measure
balance, leading to a difficulty in assessing how a particular
matched group compares to other possible matched groups
that achieve varying levels of balance. The resulting lack of
guidance is a critical omission, because different matched
sets can lead to conflicting conclusions.
Interestingly, few of the existing matching methods
directly attempt to obtain optimal covariate balance despite
claiming that covariate balance is the measure by which
to judge the success of the matching procedure. Instead,
researchers perform some type of matching (e.g., propensity score matching, Mahalanobis matching), check to see
if the groups appear to be roughly similar, and, if unsatisfied, modify parameters of the matching procedure (e.g.,
distance metric weights or regression model specification) and repeat (see Figure 1). The point at which to
end this iterative procedure is at the discretion of the
researcher. By design, researchers are unable to objectively assess the quality of their final matched groups
because the benchmark, the matched groups with optimal
balance, is unknown. Recognizing this issue, recent work
of Diamond and Sekhon (2010) attempts to streamline the
process of “match—check balance—adjust and repeat as
needed” by using a genetic algorithm to adjust the parameters and weights used in the matching algorithm in order
to obtain matched samples with the best possible balance
measure.
Other researchers have also begun to move towards the
idea of direct optimization of balance within a matched
samples framework. In particular, Rosenbaum et al. (2007)
introduce the notion of fine balance, which “refers to
exactly balancing a nominal variable, often one with many
categories, without trying to match individuals on this variable” (Rosenbaum et al. 2007, p. 75). This relaxation from
exact individual matches on a covariate to equal proportions
of individuals in the treatment and control groups for each
value of the covariate is central to the approach proposed in
this paper. Whereas Rosenbaum et al. (2007) consider fine
balance for one (nominal) covariate, with matches required
on the rest, this paper extends this concept to all covariates.
Another recent effort introduced entropy balancing
(Hainmueller 2012), which uses a maximum entropy
Figure 1.
reweighting scheme to adjust weights for each of the control individuals in order to meet user-specified balance constraints placed on the moments of the covariate distributions.
For more background on the idea of weighting observations
in a data set, see Hellerstein and Imbens (1999).
Matching treatment and control units on an individual
level is one method to achieve covariate balance; however
it is not a guarantee. We argue that although the focus
in the causal inference literature has been on matching,
the matching itself of treatment units to control units is
not necessary. Notable publications that support the idea
of conducting causal analysis on an aggregate, group level
include Abadie and Gardeazabal (2003) and Abadie et al.
(2010). Matching is not the only way to reduce selection
bias, and arguably not even the best way, because one is
not interested in unit matches per se, but in creating control
and treatment groups that are statistically indistinguishable
in the covariates (i.e., featuring covariate balance). Such an
observation suggests that a shift in direction is possible in
how treatment and control groups can be created.
To realize such a shift, §2 motivates and presents the
Balance Optimization Subset Selection (BOSS) approach
to the problem of causal inference based on observational
data. Section 3 reports computational results from one
BOSS algorithm for the estimation of treatment effect in
a simulated problem. Section 4 offers concluding remarks,
discusses the potential of the BOSS approach, raises
some theoretical and practical challenges, and outlines several topics for future investigation within the operations
research community.
Note that the main contribution of this paper is conceptual and theoretical. The goal of §2 is to present the
problem of causal inference in a new light, opening up a
field where optimization tools developed within the operations research community can make an impact. By motivating and formalizing an alternative approach to a problem
of great importance to multiple domains of modern science, this paper is intended as a seed for more applied,
computational-oriented literature. Section 3 is not meant to
be comprehensive; instead, it positions itself to illustrate
that the proposed theory can shift the problem at hand into
the computational realm. It is not intended to deliver comprehensive numerical achievements, but rather supports the
call for more intense, goal-driven computational research
of BOSS. The electronic companion to this paper is available as supplemental material at http://dx.doi.org/10.1287/
opre.1120.1118.
Matching methods logic.
Choose/adjust
regression/matching
model parameters
Run a matching
algorithm to find
a solution
Are the covariates in
treatment and control
groups balanced?
Yes
No
Repeat
Report a treatment
effect estimate,
bootstrap for
variance
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
401
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
2. BOSS Approach
The presented approach offers an alternative perspective
on causal inference using observational data. It exploits
the idea that covariate balance leads to minimized bias in
the estimated treatment effect by directly optimizing a balance measure without requiring matched samples. As noted
in §1, although the success of matching methods is assessed
by the degree of balance achieved, very few of the current matching methods directly optimize balance, resorting
to different types of optimization problems (e.g., optimal
parameter estimation for regression models, optimal assignment for unit matching with calipers). Traditional matching
methods simply report balance statistics without a guide to
assessing whether the reported balance could be improved
upon, is good, or even sufficient. There may be no standard
metric to assess the degree of balance achieved; however,
a discussion of balance is always presented and perceived
as a final verdict, validating a conducted analysis. This
simple observation highlights that the problem at hand is
a balance optimization problem, not a matching problem.
Matching is one method to obtain balance, but it unnecessarily restricts the solution space and lacks a measure
of balance optimality. Indeed, the end goal is balance, not
matching, and hence, optimizing on balance measures is
reasonable and preferred.
The BOSS approach to causal inference with observational data reformulates the problem as one of balance
optimization (Cho et al. 2011). In so doing, the problem is transformed from matching individual units to a
subset selection problem, and exploits operations research
methodologies (and in particular, discrete optimization) that
are ideally suited to model and address the balance optimization problem. In essence, BOSS inverts the direction of
the solution methodology and redefines the problem structure to directly obtain the goal of covariate balance (see
Figure 2). Note that the results of this subset selection
approach come at a cost of losing qualitative information of
individual matches, which may be useful in some practical
situations; however, group-based average quantities can be
estimated more precisely.
Assumption 1 (Strong Ignorability for Groups). Consider a population of all groups of size N , where SN ≡
8ui 9Ni=1 denotes any such group of N observed units, which
are either entirely treated (i.e., 8Tu = 19u∈SN ) or untreated
(i.e., 8Tu = 09u∈SN ). For any set of covariates 8Xu 9u∈SN ,
assume
4ȲS1 N 1 ȲS0 N 5 q 8Tu 9u∈SN 8Xu 9u∈SN 1
0 < P 48Tu = 19u∈SN 8Xu 9u∈SN 5 < 10
Proposition 1. Assume that Assumption 1 holds. From the
treatment pool, randomly select treatment group ST
N . Next,
randomly select groups of size N from the control pool,
until control group SC
=
N is identified such that 8Xu 9u∈SC
N
8Xu 9u∈STN . Then,
(3)
N
Proof. The described mechanism for the selection of ST
N,
and subsequently, SC
ensures
that
N
E4ȲS1 T 5 = Ex 8Tu =19u∈S E4ȲS1 N 8Tu = 19u∈SN
N
N
∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN 1
BOSS logic.
Choose balance measure
(a statistic for testing
distribution fit)
(2)
Expression (1) means that for any group of units, its
average responses are independent of treatment, given the
units’ covariate values. The symbol “q” signifies conditional independence (Dawid 1979). This implies that the
K observed covariates include all the covariates, dependent
on the treatment assignment Tu , that have causal effects on
the responses Yu1 and Yu0 , for every unit u. Additionally,
by expression (2), each group with a given set of its units’
covariate values is assumed to have a positive probability
of appearing in either the treatment pool or control pool.
These assumptions are made throughout the statistical literature, albeit for individual units (Rosenbaum and Rubin
1983). Assumption 1 is equivalent to the original assumption of Rosenbaum and Rubin (1983) when N = 1. The
following proposition captures the objective of any method
of postprocessing observational data for causal inference.
N
To motivate the subset selection problem and explain balance on covariates and why it is required for unbiased
estimation of the treatment effect, a formal problem formulation is presented.
(1)
and
E4ȲS1 T − ȲS0 C 5 = ATT0
2.1. The Value of Covariate Balance
Figure 2.
Let SN ≡ 8ui 9Ni=1 denote a set of N observed
P units.
Define the average treated response ȲS1 N = 41/N 5 u∈SN Yu1
P
and the average untreated response ȲS0 N = 41/N 5 u∈SN Yu0 .
Given a set of units that have received treatment, treatment
pool T; a set of units that have not received treatment, control pool C; and a set of K covariates, a pair of subsets
for comparison is identified: treatment group ST
N ⊂ T and
control group SC
⊂
C.
To
understand
the
value
of covariN
ate balance in causal inference, the following assumption
is required (Rosenbaum and Rubin 1983).
Run BOSS algorithm to
find multiple solutions
minimizing the balance
measure
Report the balance and
compute the mean and
variance of treatment
effect
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
402
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
E4ȲS0 C 5 = Ex 8Tu =19u∈S E4ȲS0 N 8Tu = 09u∈SN
N
N
∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN 0
By definition,
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
ATT = E4ȲS1 N − ȲS0 N 8Tu = 19u∈SN 50
By conditioning,
ATT = Ex 8Tu =19u∈S E4ȲS1 N − ȲS0 N 8Tu = 19u∈SN
N
∩ 8Xu 9u∈SN = x5 1
and under Assumption 1,
ATT = Ex 8Tu =19u∈S E4ȲS1 N 8Tu = 19u∈SN
N
∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN
− Ex 8Tu =19u∈S E4ȲS0 N 8Tu = 09u∈SN
N
∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN 1
which completes the proof.

From Proposition 1, the key to causal inference research
is the ability to identify control groups with the joint distribution of covariates identical to that of a treatment group.
This translates into the property that the probability that
(as a group) units in SC
N could be treated is the same as
the probability that units in ST
N are treated. Note that for
individual units (i.e., for N = 1), this probability is known
as the propensity score. If the distributions of covariates in
C
groups ST
N and SN are the same, then such groups are said
to be optimally balanced on the set of the K covariates,
rendering P 48Tu = 19u∈SCN 5 = P 48Tu = 19u∈STN 5.
The result of Proposition 1 is for groups of units, not
C
individual units. If groups ST
N and SN have one unit each
(N = 1), and these units are perfectly matched (Xu∈SCN =
Xu∈STN ), then (3) holds. Similarly, in propensity-score
based methods (Rosenbaum and Rubin 1983), regression is
used to match units with the same estimated probabilities
of being treated, again to have P 48Tu = 19u∈SCN 5 =
P 48Tu = 19u∈STN 5 for groups of such units. In all these methods, however, a value assessing covariate balance is judged
after the data have been postprocessed, with covariate balance not serving as a direct guide for optimal group selection. Although more rigorously designed propensity score
models might mitigate this problem to some degree, such
potential advances will require deeper statistical design
research in the future.
2.2. Modeling and Optimization for
Causal Inference
BOSS reframes the causal inference problem as a subset
selection problem. The goal is to randomly generate ST ,
a subset of T, and find SC , a subset of C, such that a measure of balance, M4ST 1 SC 5, is optimized. This discrete
optimization problem can be addressed using operations
research algorithms and heuristics. This formulation, moreover, lays the foundation for the development of a new
analytical model that exploits the power of ever-increasing
computational resources to assess, inform, and improve
data analytic techniques.
The BOSS conceptualization is flexible and falls within a
general discrete optimization framework. Various measures
of balance can be adapted into BOSS. This paper provides a
detailed statement of one instance of a balance optimization
problem, using a balance measure for a binning model. An
intuitive way of comparing distributions is a visual study of
histograms based on their probability mass functions (pmf)
(Imai 2005). Using goodness-of-fit test statistics based on
histograms is a more precise and rigorous way of quantifying the difference between covariate distributions for ST
and SC .
More formally, for each covariate k = 11 21 0 0 0 1 K,
its range 6Lk 1 Uk 7, with Lk = minu∈T∪C Xku and Uk =
maxu∈T∪C Xku , can be broken up by thresholds Lk = t0k <
k
t1k < t2k < · · · < tR4k5
= Uk . The total number of thresholds
R4k5 used for covariate k = 11 21 0 0 0 1 K is typically the
number of categories for discrete (categorical) variables and
some positive integer for continuous variables. This is similar to the coarsening procedure proposed by Iacus et al.
(2012) for coarsened exact matching.
Let covariate cluster D denote a subset of the set of
covariates D ⊆ 811 21 0 0 0 1 K9. For any covariate cluster D =
8k1 1 k2 1 0 0 0 1 km 9 consisting of m covariates, with 1 ¶ k1 <
k2 < · · · < km ¶ K, define a set of bins BD as the set of
k1
k2
km
intervals of the form 6tr−1
1 trk1 7 × 6tr−1
1 trk2 7 × · · · × 6tr−1
1 trkm 7
that spans the entire joint range of values of the covariates in D. Assuming a given fixed ordering of the elements
in BD , the individual
bins are indexed 8B1D 1 B2D 1 0 0 0 1 BRDm 9,
Qm
with Rm ≡ j=1 R4kj 5. These bins are used to quantify
the difference between the joint distributions of values of
covariates in D for groups ST and SC .
Let N4S1 BbD 5 denote the number of units in group S
with the values of covariates in D contained in bin BbD , or
the number of units falling into bin b. The objective of the
BOSS optimization problem is to minimize the difference
between N4SC 1 BbD 5 and N4ST 1 BbD 5 over all of the bins
for all covariate clusters of interest, where any objective
function that simultaneously minimizes these differences
can be used to evaluate the distribution fit. The Balance
Optimization Subset Selection with Bins (BOSS-B) problem
is now formally stated:
Given: K covariates; a fixed integer N ; set ST , randomly selected from set T of units represented by vectors 8X1u 1 X2u 1 0 0 0 1 XKu 9, u ∈ T, with T = N ; set C of
units represented by vectors 8X1u 1 X2u 1 0 0 0 1 XKu 9, u ∈ C,
with C > N ; a set of covariate clusters D; bins BD for
each D ∈ D.
Objective: find subset SC ⊂ C of size N , such that
X
X
4N4SC 1 BbD 5 − N4ST 1 BbD 552
(4)
max4N4ST 1 BbD 51 15
D∈D b=11 21 0001 BD
is minimized.
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
403
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
BOSS-B is a balance optimization problem. It exemplifies how the BOSS approach can be used for causal inference, with one measure of balance M4ST 1 SC 5 expressed
by (4). In BOSS-B, assignments of treatment and control
units into groups are determined such that a finite number
of preselected marginal and/or joint distributions of covariates are optimally balanced, thereby isolating the effect of
treatment from marginal and/or joint effects of these covariates and reducing bias in the estimated expected difference between the treatment and the control responses. The
objective function (4) is similar in form to the chi-square
test statistic, which provides additional meaning to the formulation. As the distributions get simultaneously balanced,
which occurs with an increasing number of bins, the more
accurate estimates of the treatment effect can be obtained.
However, as more bins are used, resulting in the histogram
resolution increasing, optimizing (4) becomes more difficult, because fewer and fewer control groups can be identified as similar to the treatment group. Additionally, the
number of required bins for a covariate cluster grows exponentially with the number of covariates in that cluster. Fortunately, this exponential growth is mitigated by the fact
that the number of occupied bins for any covariate cluster
is at most T + C.
The decision version of BOSS-B is NP-complete through
a polynomial many-one reduction from the “Exact Cover
by 3-Sets” problem, which is known to be NP-complete
(Garey and Johnson 1979), and hence, the optimization version of BOSS-B is NP-hard (see the online supplement for
a formal proof). However, for small-size problem instances,
algorithms like simulated annealing are sufficient to deliver
good results in reasonable computing time.
Note also that many algorithms solving an instance of
BOSS often encounter a large number of optimal or nearly
optimal solutions, depending on the binning scheme that is
used. As one might intuitively guess, there exist multiple
subsets of the treatment and control pools (i.e., solutions
to a balance optimization problem) that yield optimal or
nearly optimal balance. Swapping out a single unit for
another often produces only small changes in the balance function. Often even fairly large differences in subsets
result in similar balance values. Accordingly, in addition to
finding the optimal balance, it is helpful to also examine the
subsets that produce similarly balanced covariates and estimate the spread of the distribution of the treatment effect.
BOSS-B is a control group that is selected out of a larger
control pool of units. Also, for a given instance of BOSS-B,
refer to solutions with zero objective function in (4) as perfectly optimized. A perfectly balanced solution (i.e., one
that has exactly the same joint distribution of covariates
in a control group as in the treatment group) is typically
perfectly optimized in any measure of balance, though the
reverse is not necessarily true. For example, balance on all
of the marginal distributions does not generally imply balance on the joint distribution.
Three sources of error are inherent with the application
of BOSS-B: error due to noise in the response functions for
Y 1 and Y 0 ; error due to bin size or the number of bins used;
error due to nonzero objective function (when a perfectly
optimized solution is not found or does not exist).
The first source of error is present in all problems, resulting from the uncertainty inherent in all processes in nature,
and hence cannot be eliminated. However, given Assumption 1, the noise in the response has zero mean, and averages to zero for sufficiently large treatment and control
groups. The other two sources of error are not so well
behaved. However, under certain assumptions, the impact
of these errors can be limited. Ideally, one would like to
obtain SC
N ⊂ C that feature perfect balance on the joint
distribution of all covariates, D = 811 21 0 0 0 1 K9. Note that
this condition is equivalent to perfect individual matching,
which, if possible, one could find in polynomial time (in the
sizes of T and C, and N ) using an assignment algorithm.
In practice, however, this is rarely achievable for N large.
Therefore, suboptimal solutions may need to be considered, which is why working with observational data is a
challenge. Fortunately, perfect balance on the joint distribution of all covariates may not be necessary for accurate
inference. This suggests that most real-world causal inference problems can be solved using groups that offer good,
albeit not perfect, balance, or using groups that are perfectly balanced on a more limited set of marginal and/or
joint distributions of covariates, for making a correct inference. Theorem 1 illustrates the latter point.
2.3. Theoretical Aspects of BOSS-B
where random variable 1405 represents noise, with E41405 5
1405
= 0. Suppose also that the function hk 4Xku 5 is locally Lipschitz continuous such that for each k = 1, 21 0 0 0 1 K,
This section discusses how solutions to a balance optimization problem can be used to obtain estimates for ATT, and
how the estimation bias is reduced as a function of covariate clusters in BOSS-B (more specifically, the number of
bins) and the quality of solutions achieved for a given
measure of balance. Without loss of generality, assume
that ST = T. In most real-world observational studies,
treated units are rare, and hence, all available such units are
included in the treatment group. Therefore, a solution to
Theorem 1. Suppose that for any unit u, response Yu1405 can
be expressed as a sum of functions of individual covariates,
Yu1405 =
1405
X
hk 4Xku 5 + 1405 1
(5)
k=11 21 0001 K
1405
1405
1405
hk 4x1 5 − hk 4x2 5 ¶ Lk x1 − x2 1
1405
(6)
where Lk is a positive Lipschitz constant for the func1405
tion hk , k = 11 21 0 0 0 1 K. Consider an instance of BOSS-B
with ST
N = T, N = T, and D = 88191 8291 0 0 0 1 8K99. The
bias that arises in the estimation of ATT using an estimator
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
404
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
ȲS1 T − ȲS0 C , obtained from a perfectly optimized solution
N
N
SC
N ⊂ C, then converges to zero as the number of bins in
the sets BD , D ∈ D, approaches infinity telescopically (i.e.,
the number of bins is increased by uniform sequential subpartitioning).
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
415
SC
N ,
Proof. Consider the control group
a perfectly optimized solution to an instance of BOSS-B with fixed sets
425
of bins BD , D ∈ D. Also, consider control group SC
N , a
perfectly optimized solution to the same instance of the
BOSS-B problem, where bin BrD ∈ BD for some D =
8k9 ∈ D, k ∈ 811 21 0 0 0 1 K9, and r ∈ 811 21 0 0 0 1 R4k59 is partitioned to form bins BrD1 and BrD2 such that BrD1 ∩ BrD2 =
D
and BrD1 ∪ BrD2 = BrD . Define sets Ir = 8i2 i ∈ ST
N , Xki ∈ Br 9,
415
C415
D
425
C425
Jr = 8j: j ∈ SN , Xkj ∈ Br 9, and Jr = 8j: j ∈ SN ,
Xkj ∈ BrD 9. Let ã1 , ã2 , and ã denote the volumes of bins
BrD1 , BrD2 , and BrD , respectively. Also, let Z denote the num415
ber of control units in SC
falling into bin BrD , and let
N
425
Z1 , Z2 denote the number of control units in SC
falling
N
into bins BrD1 and BrD2 , respectively. By design, ã = ã1 + ã2
and Z = Z1 + Z2 , and Jr425
= Z1 , Jr425
= Z2 and Ir =
1
2
415
Jr = Z.
Proposition 1 describes an approach to select treatment
and control groups to ensure that ȲS1 T − ȲS0 C is an unbiased
N
N
estimator
of ATT. Using this notation, observe that 41/Ir 5 ·
P
1
1
i∈Ir Yi is an unbiased estimator of E4ȲSN 8Ti = 19i∈Ir 5,
P
415
by (5). However, in general, 41/Jr 5 i∈Jr415 Yi0 is not an
unbiased estimator of E4ȲS0 N 8Ti = 19i∈Ir 5, because the
exact values in covariate k for the control units falling into
a single bin may be different from the values for treatment
units in the same bin. As such, an imbalance is created
within bin BrD , because the treatment and control values are
not identically distributed within the bin. This imbalance
results in a contribution B4BrD 5 to the bias in the estimation
415
of E4ȲN0 8Ti = 19i∈Ir 5 using SC
N ,
1 X
X
1
E4Yj0 5 −
E4Yi0 50
B4BrD 5 ≡ 415
Jr 415
Ir i∈Ir
j∈Jr
From (5) and (6),
B4BrD 5 =
¶
1
E
Z
1
Z
!
X
h0k 4Xkj 5 −
X
h0k 4Xki 5
i∈Ir
415
j∈Jr
h0k 4Xkj 5 − h0k 4Xki 5 ¶ L0k ã ≡ U 415 1
X
415
i∈Ir 1 j∈Jr
where U 415 is an upper bound on the bias B4BrD 5. Similarly,
by (5), an imbalance within bins BrD1 and BrD2 results in
contributions B4BrD1 5 and B4BrD2 5, respectively, to the bias
425
in the estimation of E4ȲN0 8Ti = 19i∈Ir 5 using SNC , with
B4BrD1 5 + B4BrD2 5 ¶
1
Z
X
425
+
X
425
i∈Ir2 1 j∈Jr2
B4BrD1 5 + B4BrD2 5 ¶ L0k
Z1 ã1 + Z2 ã2
≡ U 425 1
Z1 + Z2
which is an upper bound on the bias B4BrD1 5 + B4BrD2 5.
Observe that for Z1 > 0, Z2 > 0, ã1 > 0 and ã2 > 0, Z1 ã1 +
Z2 ã2 < Zã, and hence, U 425 < U 415 . Moreover, if bin BrD
is subpartitioned uniformly, which implies ã1 = ã2 , then
U 425 = U 415 /2.
Generalizing this argument to a telescopically increasing
number of subpartitioned bins, let U denote the bias in the
estimation of E4ȲS0 N 8Tu = 19u∈SN 5 when no optimization
is conducted and SC
N ≡ C. Observe that because U is finite,
B
then for a perfectly optimized
SNC to the instance
S solution
D
of BOSS-B with bins B = D∈D B , the total bias can be
bounded, and converges to zero as the number of bins, B,
approaches infinity,
1 X
Yu0 − E4ȲN0 8Tu = 19u∈SN 5
B≡
N u∈SC
N
¶
X
b∈B
B4b5 ¶
U
→ 00
B
Theorem 1 assumes that the response function (5) is separable, meaning that it can be represented as a sum of functions of individual covariates. Although such an assumption
may appear restrictive, this class of functions subsumes the
class of extensively studied separable models given by
Yu = 0 + 1 ê4X1u 5 + 2 ê4X2u 5 + · · · + K ê4XKu 5 + 0
Furthermore, in the linear modeling literature, if the response function includes a term that is a function of two
or more covariates, say Xk1 u ∗ Xk2 u , then the response
function can be converted to a linear model by introducing a new covariate that is the product of covariates k1
and k2 . More generally, if the response function is a function of several covariates, say 4Xk1 u 1 Xk2 u 1 0 0 0 1 Xkd u 5, with
1 ¶ k1 < k2 < · · · < kd ¶ K, then the response function
can be transformed to satisfy the assumptions of Theorem 1 by introducing a new covariate that is the joint of
Xk1 u 1 Xk2 u 1 0 0 0 1 Xkd u .
Theorem 1 shows that under (5) and (6), as the number of bins in BOSS-B problem grows and perfectly optimized solutions are identified, ȲS1 C − ȲS0 C monotonically
N
N
converges to E4ȲN1 − ȲN0 8Tu = 19u∈SN 5, and hence gives
the minimally biased estimator of ATT that can be obtained
using the available observed data.
3. Computational Analysis
h0k 4Xkj 5 − h0k 4Xki 5
i∈Ir1 1 j∈Jr1
Therefore, by (6),
!
h0k 4Xkj 5 − h0k 4Xki 5 0
This section illustrates the theory of §2 by presenting a
simple numerical example. Note that its contribution to the
paper is more illustrative than fundamental. By setting up
a computational model for a limited problem and using a
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
405
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
generic optimization algorithm to attack this problem, the
reader can visually inspect the dynamics of the proposed
balance optimization and the convergence of the proposed
estimator to the treatment effect. It also provides grounds
to discuss future computational challenges for BOSS.
The simulated experiments presented illustrate that as a
balance measure approaches its optimal value, the bias in
the estimate of the treatment effect decreases. Additionally,
as the number of bins increases, (4) allows for more accurate estimation of the treatment effect.
3.1. Experimental Setup
To illustrate the BOSS-B approach, two data sets were created, designated as data3c10k and data10c10k. Each data
set consists of a treatment group of 500 units and a control
pool of 10,000 units using 3 and 10 covariates, respectively.
The data sets were created by first randomly generating a
pool of 5,000 potential treatment individuals and a pool
of 10,000 control individuals, with the covariate values for
each unit drawn from a normal distribution. Once the units
were generated, each unit i was assigned a response value
using the expression
1405
Yi
= 10 + 7X1i + 6X2i + 5X3i − 3X4i + 3X5i + 2X6i
+ X7i − X8i + 005X9i + 001X10i + i 1
(7)
where i ∼ N 401 25. (The extra covariate terms are omitted
for data3c10k.) Under this formulation, there is no treatment effect (i.e., exposure to treatment has no effect on the
response): ATT = 0.
Once the individuals were created, a treatment group
of 500 units was drawn randomly but nonuniformly from
the pool of potential treatment individuals. Individuals with
covariate values in the tails of the covariate distribution
were drawn with higher probability than those with values
in the center of the distributions, ensuring that the resulting
treatment and control groups had different covariate distributions. Figure 3 shows the initial distributions in the
treatment group and control pool for covariates 1, 2, and
3, respectively, of data3c10k. In these histograms, covariate values are separated into 32 uniformly sized bins. The
number of control units in a bin was normalized by a factor
of 1/20 to account for the difference in size between the
treatment group and control pool. The histograms indicate
that the covariate distributions of the treatment group differ
from those of the control pool, particularly for the first two
covariates.
Optimization was performed using a simulated annealing algorithm (Kirkpatrick et al. 1983). In the experiments,
the preselected treatment group was used, and the desired
control group size was 500 units. The first step in the algorithm is to bin the data: each unit is converted from a
vector of covariate values 8X1i 1 X2i 1 0 0 0 1 XKi 9 into a vec0
tor of bin numbers 8X1i0 1 X2i0 1 0 0 0 1 XKi
9 where Xki0 = j if
k
k
and only if tj−1 ¶ Xki ¶ tj (i.e., unit i falls into bin j
for covariate k). In the experiments, the bin thresholds
were uniformly spaced across the covariate distributions,
with R4k5 set to a given value (an input parameter) for
all covariates k = 11 21 0 0 0 1 K. Moreover, a unique covariate cluster was created for each individual covariate. By
Theorem 1, these covariate clusters are sufficient for generating an accurate estimate of ATT because of the separability of the response function (7).
After binning the data, the simulated annealing algorithm
begins with an initial control group consisting of a random
subset of 500 units from the control pool. At each iteration, the algorithm attempts a 1-exchange, replacing one
unit in the control group with an unselected unit in the control pool. If the exchange improves (4), then it is accepted
unconditionally. Otherwise, it is accepted with some probability according to the input parameters. A random restart is
applied when little progress has been made in (4) for some
number of iterations or after the algorithm identifies a perfectly optimized control group. The algorithm terminates
after performing a preset number of iterations. For more
details, see Algorithm 1 in the paper’s online supplement.
3.2. Experimental Results
Several experiments were conducted on the two data sets
(data3c10k and data10c10k) using uniformly spaced bins
with R4k5 = 41 81 16, and 32 for all k = 11 21 0 0 0 1 K. This
sequence was chosen because it forms a bin scheme where
each successive set of bins simply subdivides the previous
set of bins in half, creating a telescopic increase in the
number of bins.
For each data set and bin scheme, 25 runs of the simulated annealing algorithm were performed, with a different
random seed used for each run. Throughout a run, every
50th identified control group or perfectly optimized control
group was processed and stored, along with KolmogorovSmirnov (KS) two-sample goodness-of-fit test statistics for
the treatment and control covariate distributions. For data
sets with multiple covariates, the KS test statistic values
were averaged over all the covariates. Upon completion
of the experiments, any duplicated control groups were
removed. This was implemented by assigning a hash number to each control group based on its units.
Note that because the search process moves by
1-exchange, each successive control group that is reported
by the algorithm will have a high degree of overlap with
the previously reported control group. To prevent overlap
among the perfectly optimized solutions, random restarts
were performed after each perfectly optimized solution was
identified. This facilitates the generation of perfectly optimized control groups with minimal overlap between them.
Table 1 summarizes the features of optimal solutions
obtained in solving the data3c10k instance. In the table,
the objective function in (4) is referred to as Difference
Squared (DiffSqr). Column Bins specifies the number of
bins used (per covariate), and the column Observations
reports the number of perfectly optimized solutions that
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
406
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
Initial covariate distributions of treatment group and control pool (normalized) for data3c10k.
Figure 3.
Distribution of covariate 1 for data3c10k
Distribution of covariate 2 for data3c10k
70
Treatment group
Normalized control pool
90
Individuals with covariate values in bin range
Individuals with covariate values in bin range
80
70
60
50
40
30
20
10
0
60
50
40
30
20
10
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32
Histogram bin number
Histogram bin number
Distribution of covariate 3 for data3c10k
70
Individuals with covariate values in bin range
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
100
60
50
40
30
20
10
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32
Histogram bin number
were identified. The remaining two columns list the treatment effect and the KS two-sample test statistic (averaged
over the covariates), respectively. No results are presented
for data10c10k because perfectly optimized solutions were
Table 1.
Optimal solutions for data3c10k with respect
to DiffSqr objective.
Treatment
effect
Bins
4
8
16
32
64
Observations
251214
171404
71689
833
0
KolmogorovSmirnov
Mean
SD
Mean
SD
202904
101434
002380
000122
N/A
002684
001605
001098
000900
N/A
001155
000825
000369
000274
N/A
000090
000072
000038
000027
N/A
not obtained for this data set when more than four bins per
covariate were used.
Table 1 shows that as the number of bins for each covariate increases, the estimator mean tends toward the true ATT
value of zero. The KS test statistic values also indicate an
increasingly higher level of balance in the covariate distributions of the treatment and control groups.
Table 2 shows the difference in covariate means for the
treatment group and control pool, as well as the difference in covariate means for the treatment group and an
optimized control group obtained by solving BOSS-B with
R4k5 = 32 for all k = 11 21 0 0 0 1 K. Observe that the bias due
to covariate imbalance in the treatment group and control
pool is largely removed by the optimization.
Next, for a given data set and number of bins, all
recorded control groups were sorted by their scores in (4).
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
407
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
Table 2.
Difference of covariate means for covariates
before and after optimization with R4k5 = 32.
Table 4.
Solutions for data10c10k ranked by DiffSqr
objective using 32 bins.
Difference of means
data3c10k
data10c10k
Covariate
Before
optimization
After
optimization
1
2
3
1
2
3
4
5
6
7
8
9
10
00869
00862
00160
00539
00553
00420
−00355
00446
00346
00407
−00180
00208
00152
00009
00001
00007
00007
00014
00001
00002
00028
00007
00010
00005
00002
00009
Then, control groups in a fixed range of scores were
aggregated and their estimated treatment effects and other
relevant statistic values were averaged. Tables 3 and 4
display these average values obtained with R4k5 = 32 for
all k = 11 21 0 0 0 1 K. Figures 4 and 5 show the trends for the
treatment effect and its standard deviation, as the objective function value decreases. In general, as the score for
(4) approaches zero, the estimated treatment effect tends
toward 0, the true ATT value. Despite the inability to obtain
perfectly optimized solutions for data10c10k, accurate ATT
estimates are still obtained when the objective function is
close to 0.
Table 3.
Solutions for data3c10k ranked by DiffSqr
objective using 32 bins.
Treatment
effect
OF range
Observations
KolmogorovSmirnov
Mean
SD
Mean
SD
000122
000679
001478
002291
002948
003596
004085
004666
005173
005584
005881
007889
101213
104045
106617
108779
200778
202490
204258
205803
000900
000950
001111
001173
001183
001233
001304
001303
001306
001315
001355
001790
001828
001974
001956
002050
002125
002160
002159
002250
000274
000282
000294
000312
000328
000344
000356
000370
000381
000394
000402
000449
000528
000597
000659
000713
000762
000808
000854
000892
000027
000028
000029
000032
000034
000035
000035
000036
000037
000037
000038
000047
000044
000046
000045
000047
000048
000049
000049
000052
OF range
¶2.0
2.0–3.0
3.0–4.0
4.0–5.0
5.0–6.0
6.0–7.0
7.0–8.0
8.0–9.0
9.0–10.0
10.0–20.0
20.0–30.0
30.0–40.0
40.0–50.0
50.0–60.0
60.0–70.0
70.0–80.0
80.0–90.0
90.0–100.0
Observations
0
1
25
116
229
332
327
377
350
31305
31105
21737
21677
21608
21649
21499
21527
21221
KolmogorovSmirnov
Mean
SD
Mean
SD
N/A
002168
002409
002809
003567
004024
004467
004914
005159
007416
100607
103523
106002
108155
200576
202616
204404
206453
N/A
000000
001056
001113
001065
001198
001189
001200
001225
001719
001679
001748
001855
001970
001899
001956
002036
002113
N/A
000260
000251
000251
000255
000259
000262
000267
000271
000295
000328
000359
000384
000409
000434
000456
000477
000499
N/A
000000
000014
000016
000014
000013
000016
000016
000015
000021
000021
000021
000022
000022
000023
000024
000024
000024
Note that in Figures 4 and 5, there is a break where the
objective function range changes from increments of 1 to
increments of 10 between 9–10 and 10–20. This break is
shown with bars in the plot and on the axis. Also, results
from control groups with scores for (4) that were greater
than 100 are available in the online supplement.
3.3. Comparison with an Alternate
Balance Measure
The BOSS framework is not limited to just the BOSS-B
formulation presented in §2. Indeed, the goal of the BOSS
Figure 4.
data3c10k with 32 bins: Average treatment
effect for varying objective function ranges.
3.0
2.5
TE
TE+SD
TE–SD
2.0
1.5
1.0
0.5
0
–0.5
10.0 – 20.0
20.0 – 30.0
30.0 – 40.0
40.0 – 50.0
50.0 – 60.0
60.0 – 70.0
70.0 – 80.0
80.0 – 90.0
90.0 – 100.0
833
41377
41675
31747
31098
21751
21308
21022
11873
11670
11544
101937
81313
71009
61148
51416
41910
41437
31920
31745
≤ 1e – 07
1e –07 – 1.0
1.0 – 2.0
2.0 – 3.0
3.0 – 4.0
4.0 – 5.0
5.0 – 6.0
6.0 – 7.0
7.0 – 8.0
8.0 – 9.0
9.0 – 10.0
¶1e−07
1e−07–1.0
1.0–2.0
2.0–3.0
3.0–4.0
4.0–5.0
5.0–6.0
6.0–7.0
7.0–8.0
8.0–9.0
9.0–10.0
10.0–20.0
20.0–30.0
30.0–40.0
40.0–50.0
50.0–60.0
60.0–70.0
70.0–80.0
80.0–90.0
90.0–100.0
Avg. TE for control groups in OF range
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Data set
Treatment
effect
Objective function range
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
408
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
data3c10k with 32 bins: Average treatment
effect for varying objective function ranges.
Figure 5.
Table 5.
Solutions for data10c10k ranked by DOM
objective.
Treatment
effect
TE
TE + SD
TE – SD
2.5
OF range
2.0
1.5
1.0
80.0 –90.0
90.0 –100.0
70.0 –80.0
60.0 –70.0
50.0 –60.0
40.0 –50.0
30.0 –40.0
20.0 –30.0
10.0 –20.0
8.0 –9.0
9.0 –10.0
7.0 –8.0
6.0 –7.0
5.0 –6.0
2.0 – 3.0
0
4.0 – 5.0
0.5
3.0 – 4.0
Avg. TE for control groups in OF range
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
3.0
Objective function range
framework is to handle any proposed measure of balance M4ST 1 SC 5. For example, one can use a difference
of means
P as an optimization objective. Let 4S1 k5 =
41/S5 s∈S Xks be the mean value of covariate k across
the individuals in S. Then, a BOSS objective is to find a
control group SC ⊂ C with SC = T that minimizes
K X
4SC 1 k5 − 4T1 k50
(8)
k=1
Note that such analysis was done by Rubin (1973) for
one covariate, where it was referred to as mean matching.
With BOSS objective (8), no preprocessing of the data is
necessary, because no binning is performed (compared to
BOSS-B). Table 5 shows the performance of objective (8),
referred to as DOM for difference of means, in determining the treatment effect across a wide range of solutions
obtained during the simulated annealing algorithm execution. As the score for (8) approaches 0, the estimated treatment effect tends toward the true treatment effect of 0,
which is as expected given the linear nature of the response
function (7). Results for control groups with scores for (8)
greater than 1000 are available in the online supplement.
Observe that using (8) as a BOSS objective compared
to (4) results in more accurate ATT estimation. This observation might lead one to assume that (8) is better than (4)
at capturing balance. However, the KS scores are worse
with (8), indicating that although the covariate means are
close, the covariate distributions are not as balanced as
those for the solutions obtained with (4). An additional set
of experiments was performed to illustrate the importance
of balancing the distributions. These experiments used a
new data set, data3c10kn, created by taking the same individuals from data3c10k and using the response function
1405
Yi
= 10 + eX1i + X2i2 + 001X3i3 + i 0
(9)
¶0.001
0.001–0.01
0.01–0.02
0.02–0.03
0.03–0.04
0.04–0.05
0.05–0.10
0.10–0.20
0.20–0.30
0.30–0.40
0.40–0.50
0.50–0.60
0.60–0.70
0.70–0.80
0.80–0.90
0.90–1.00
Observations
0
121004
661859
941364
941269
831005
2861406
3741035
2901608
2551131
2381708
2441812
2411576
2261956
2291046
2351354
KolmogorovSmirnov
Mean
SD
Mean
SD
N/A
000596
000789
001115
001548
002015
003434
007421
102774
107747
202529
207030
301296
305528
309600
403380
N/A
000857
000913
000916
000920
000938
001323
002066
002244
002439
002560
002688
002770
002829
002831
002934
N/A
004101
004167
004201
004200
004199
004236
004419
004721
005027
005347
005667
005999
006350
006688
007032
N/A
000258
000276
000272
000264
000265
000266
000276
000291
000289
000306
000301
000315
000313
000312
000313
Five runs of the simulated annealing algorithm were performed with data3c10kn, using both (4) with R4k5 = 32
for all k = 11 21 0 0 0 1 K and (8). The best solutions obtained
from these runs are reported in the first two rows of Table 6.
In this case, the best solutions obtained with (4) lead to
better estimates of ATT than those obtained with (8). Optimizing (4) results in more accurate estimation because Theorem 1 still holds for (9) due to the separability of the
covariate terms. Moreover, the KS scores are better, indicating better balance for the covariate distributions.
The function (8) can be improved by incorporating
higher moments of the distributions,
such as the variance.
P
Let s 2 4S1 k5 = 41/4S − 155 s∈S 4Xks − 4S1 k552 be the
unbiased sample variance of covariate k across the individuals in S. Then two additional BOSS objectives can be
defined as
min
K X
4SC 1 k5 − 4T1 k5
k=1
+
K X
s 2 4SC 1 k5 − s 2 4T1 k5
(10)
k=1
and
min
K X
4SC 1 k5 − 4T1 k52
k=1
+
K X
s 2 4SC 1 k5 − s 2 4T1 k50
(11)
k=1
These two objectives aim at finding control groups with the
first and second moments of the covariate distribution as
close as possible to those of the treatment group. Objectives
(10) and (11) differ in the weight they place on the difference of means, with (11) squaring this difference for each
covariate. For data3c10kn, the results of optimizing these
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
409
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
Table 6.
Best solutions for data3c10kn for various objectives.
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Treatment
effect
Objective
OF range
DiffSqr(32)
DOM
DOM + DOV
DOM2 + DOV
¶1e−07
¶0.001
¶0.001
¶0.001
Observations
156
71086
357
403
two objectives (referred to as DOM + DOV and DOM2 +
DOV ) are much better than those obtained for (8), as shown
in Table 6.
In a similar manner, higher moments can be included in
the objective being optimized. Including higher moments
ensures that the two distributions are closer and closer
together, which is exactly what the BOSS-B formulation
aims to achieve, albeit in a more direct manner.
KolmogorovSmirnov
Mean
SD
Mean
SD
−000170
−103889
000392
000986
000875
003395
000959
001057
000804
002770
001669
001435
000078
000226
000179
000121
To demonstrate the performance of BOSS with respect to
existing matching methods, the Matching package (Sekhon
2011) was used. The package allows for matching based
on propensity score, matching directly on the values of the
covariates, or some combination of the two. For the purposes of testing, a standard logistic regression model was
used to estimate the propensity score.
Table 7 compares the best solutions (as defined by
the objective function value, with ties broken arbitrarily)
obtained by the BOSS procedure for objectives (4) with
R4k5 = 32 for all k = 11 21 0 0 0 1 K, (8), (10), and (11) with
the solutions returned by both propensity score matching
and matching on the covariates for the data3c10kn data set
(with the nonlinear response function (9)). Column Objective lists the method used to obtain the solution, column
OF Score lists the function value of the best solution for
the BOSS methods (no objective score is provided by the
Matching package), column Treatment Effect lists the estimate of the treatment effect computed from the best solution, and columns Kolmogorov-Smirnov Mean and Max list
the average and maximum values of the KS test statistic
for the covariate distributions in the treatment group and
the best control group.
The propensity score model fares the worst in producing
accurate estimates of the treatment effect, whereas direct
matching and BOSS with objective functions (4), (10), and
(11) all produce good results. The reason for the poor performance of the propensity score approach is the use of a
linear model for estimating the propensity score, whereas
the actual response function is nonlinear. A better model for
estimating the propensity score would potentially improve
these results. It should also be noted that the propensity
score approach produces the worst balance as measured by
the KS statistic, whereas BOSS with objective function (8)
also produces unsatisfactory levels of balance, with BOSS
with objective function (4) and covariate matching performing the best.
A difficulty of matching on the covariates is that close
matches become difficult to find as the number of covariates increases. To demonstrate this, the matching procedures were also run on the data10c10k data set. Table 8
shows the best solutions obtained by the BOSS approaches
and the matching approaches. Because data10c10k uses a
linear response function (7), both propensity score matching and BOSS with (8) perform better than they did in the
previous case. This improvement occurs because balancing covariate means for a linear response function produces
accurate ATT estimates. Estimating the propensity score
with a linear model will accomplish this indirectly, whereas
optimizing (8) will accomplish this directly. On the other
hand, the effectiveness of covariate matching is greatly
reduced due to the difficulty of finding close matches on
10 different covariates. Finally, BOSS with (4) is seen to
produce the best covariate balance as measured by the KS
test statistic, whereas the matching approaches produce the
worst covariate balance.
Table 7.
Table 8.
3.4. Comparison with Matching Methods
Comparison of single best solutions for BOSS
and matching for data3c10kn.
Comparison of single best solutions for BOSS
and matching for data10c10k.
KolmogorovSmirnov
Objective
OF score
Treatment
effect
Mean
Max
DiffSqr(32)
DOM
DOM + DOV
DOM2 + DOV
Prop. score
Cov. matching
0.0
1.50e−5
3.77e−4
2.69e−4
N/A
N/A
−001142
−009877
000271
001154
−103434
000943
00025
00093
00062
00045
00125
00025
00026
00118
00088
00060
00158
00034
Objective
DiffSqr(32)
DOM
DOM + DOV
DOM2 + DOV
Prop. score
Cov. matching
KolmogorovSmirnov
OF score
Treatment
effect
Mean
Max
209502
000029
000157
000158
N/A
N/A
002168
001294
001857
001947
−001148
20818
00026
00039
00037
00045
00066
00067
00036
00056
00048
00052
00114
00088
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
410
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
3.5. Discussion of Results
Inspecting the reported results with the goal of evaluating
the potential effectiveness of the BOSS approach, the conducted experiments well illustrate the theory of §2. The
simulated annealing algorithm was able to perform well for
BOSS-B and several other objectives, which suggests that
specialized algorithms could be much more effective and
efficient in finding optimal balance. Additionally, the BOSS
approach performed favorably when compared with some
of the existing matching methods proposed in the literature.
The accurate estimates of ATT produced by BOSS in
these experiments suggest that BOSS may be a viable
approach to successfully determine whether or not a treatment effect exists in problems that approximate realworld scenarios for which observational data exists. For
the BOSS-B formulation in particular, as R4k5 increases,
(4) provides a better measure of covariate balance, and
hence a better estimate of the treatment effect. However,
as R4k5 increases, it also becomes more difficult to identify control groups that are perfectly optimized with respect
to (4). Certainly there are improvements that can be made
in terms of the optimization process, but determining the
appropriate value for R4k5 and even the appropriate bin
thresholds will be a major factor as well. For the former,
Cochran (1968) states that for one covariate, subclassification with five categories is sufficient to remove about 90%
of the existing bias under certain conditions. Rosenbaum
and Rubin (1983) present similar results when subclassifying on the propensity score. Determining the appropriate
locations for bin thresholds will be dependent upon the
nature of the data. See Iacus et al. (2012) for further discussion of these issues.
Another issue is determining which covariate clusters to
use. In the experiments presented here, the covariate clusters were chosen based on knowing the separability of the
response function. In a real-world problem, the response
function will almost certainly be unknown, and therefore
some guesswork will be involved in appropriately picking
the covariate clusters.
For the general BOSS problem, there remains significant work to be done in determining appropriate balance
measures for optimization. In the simulated example problems considered here, the difference of means objective (8)
was sufficient for a separable linear response function, but
not for a separable nonlinear one. Although incorporating
the variance into the objective (10) yielded more accurate
results for the nonlinear response function, this may not
always be the case. Determining exactly what balance measures should be optimized remains an open problem.
4. Research Directions
BOSS introduces a new paradigm for developing an analytical toolbox based on techniques from operations research
to create a solution methodology where human bias, associated, for example, with defining distance measures for
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
matching or guessing the form of a regression model, is
eliminated, and the accuracy of treatment effect estimation
is limited solely by the complexity of an optimization problem (NP-hard) and available computational power.
To make a connection between the balanced marginal
distributions and the balanced joint distributions of covariates, the concept of copulas (Nelsen 1999) may be useful
if a copula family can be designed to incorporate continuous and categorical covariate values simultaneously with
a sizable number of parameters. In many cases, however,
preserving the same covariance structure over the covariate
values in the control and treatment groups might suffice.
For example, if a treatment group consists only of pairs
AA and BB, they would have the same marginal distributions as a control group with pairs AB and BA, because
both A and B appear twice; the joint distributions, however,
would not align. Examining covariance structures would
identify and help alleviate this issue. One approach would
be to minimize the covariance matrix difference directly,
incorporating it into BOSS as part of the objective function
or as a constraint. Note that some widely used matching
approaches (e.g., propensity score matching) operate under
the Stable Unit Treatment Value Assumption (SUTVA) that
is violated when observations on one unit are affected by
the particular assignment of treatment to other units. The
BOSS approach also relies on this strong assumption, even
though it may not hold in real observational studies and
randomized experiments.
The issue of space traversal, or how well BOSS explores
the space of available control groups, is also a rich area for
future exploration. For algorithms that generate a large number of optimal or near-optimal solutions, ensuring that these
solutions are sufficiently diverse will allow for better estimates on the distribution of the treatment effect. One way in
which this can be accomplished is by iteratively running the
BOSS algorithm, finding an optimal control group, removing the members of the control group from the control pool,
and then rerunning the BOSS algorithm using the smaller
control pool. Alternatively, control individuals can be prevented from being used in a control group after appearing
in some number of other identified control groups.
In problems with a large number of covariates and/or
covariate clusters to balance, it is unlikely that perfectly
optimized control groups exist when using even a moderate number of bins for each covariate. Therefore, further
research on binning-based measures of balance is required,
and bounds are needed on the quality of a control group
when it is not perfectly optimized. In the simulated experiments reported in §3, it was observed that many control
groups that were near-optimal led to the correct decision
with regards to the effectiveness of treatment, although the
exact dynamics of this phenomenon are not completely
clear. Alternate ways to assess the quality of a control group
in addition to the objectives presented here should also be
considered.
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
Additionally, developing algorithms to optimize directly
on covariate balance measures such as the KolmogorovSmirnov two-sample test statistic instead of using approximation techniques as binning is a promising direction. In
the current implementation, using the KS score instead of
objective (4) caused the search process to stall and fail to
make significant progress. This suggests that a 1-exchange
neighborhood is insufficient when used in conjunction with
the KS score.
For BOSS to be useful in practice, computational tools
need to be developed that can analyze distribution(s) of
the designed estimator(s). Besides point estimation, social
scientists often resort to hypothesis testing as well as building confidence intervals, the tasks where estimating standard error becomes important. Although our computational
investigations indicate that the distribution of the BOSS
estimators presented in this paper appears to be Gaussian,
more research is required to establish this result theoretically for the subset-selection based approach.
The challenges presented should be addressed simultaneously by research communities over various domains
of science. Statisticians might be interested in developing
a copula approach for the balancing of joint distributions, whereas operations researchers and computer scientists might work on more efficient optimization algorithms.
Opportunities for interdisciplinary collaboration may prove
to be fruitful as this research direction continues to expand
and evolve.
Supplemental Material
Supplemental material to this paper is available at http://dx.doi
.org/10.1287/opre.1120.1118.
Acknowledgments
The authors thank Alexander Shapiro, the associate editor, and
two anonymous referees for their helpful comments, which greatly
improved the presentation of this paper and led to more substantial computational results. This research has been supported
in part by the National Science Foundation [SES-0849223 and
SES-0849170]. The second author was also supported in part by
the Air Force Office of Scientific Research [FA9550-10-1-0387].
The fourth author was supported by the Department of Defense
(DoD) through the National Defense Science and Engineering
Graduate Fellowship (NDSEG) Program (32 CFR 168a). This
material is based upon work supported in part by (while serving at) the National Science Foundation. Any opinion, findings,
and conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect the views
of the National Science Foundation, the United States Air Force,
or the United States Government. The computational work was
conducted with support from the Simulation and Optimization
Laboratory at the University of Illinois.
References
Abadie A, Gardeazabal J (2003) The economic costs of conflict: A case
study of the Basque country. Amer. Econom. Rev. 93(1):112–132.
411
Abadie A, Diamond A, Hainmueller J (2010) Synthetic control methods
for comparative case studies: Estimating the effect of California’s
tobacco control program. J. Amer. Statist. Assoc. 105(490):493–505.
Cho WKT, Sauppe JJ, Nikolaev AG, Jacobson SH, Sewell EC (2011) An
optimization approach to matching and causal inference. Technical
report, University of Illinois at Urbana–Champaign, Urbana, IL.
Cochran WG (1968) Effectiveness of adjustment by subclassification in
removing bias in observational studies. Biometrics 24(2):295–313.
da Veiga PV, Wilder RP (2008) Maternal smoking during pregnancy and
birthweight: A propensity score matching approach. Maternal and
Child Health J. 12(2):194–203.
Dawid AP (1979) Conditional independence in statistical theory. J. Roy.
Statist. Soc. Ser. B 41(1):1–31.
Diamond A, Sekhon JS (2010) Genetic matching for estimating causal
effects: A general multivariate matching method for achieving
balance in observational studies. Technical report, Department
of Political Science, University of California, Berkeley, Berkeley, CA. Accessed July 2011, http://sekhon.berkeley.edu/papers/
GenMatch.pdf.
Garey MR, Johnson DS (1979) Computers and Intractability: A Guide to
the Theory of NP-Completeness (Freeman and Company,
San Francisco).
Hainmueller J (2012) Entropy balancing: A multivariate reweighting
method to produce balanced samples in observational studies. Political Anal. 20(1):25–46.
Hellerstein J, Imbens G (1999) Imposing moment restrictions from auxiliary data by weighting. Rev. Econom. Statist. 81(1):1–14.
Herron MC, Wand J (2007) Assessing partisan bias in voting technology: The case of the 2004 New Hampshire recount. Electoral Stud.
26(2):247–261.
Holland PW (1986) Statistics and causal inference. J. Amer. Statist. Assoc.
81(396):945–960.
Iacus SM, King G, Porro G (2012) Causal inference without balance
checking: Coarsened exact matching. Political Anal. 20(1):1–24.
Imai K (2005) Do get-out-the-vote calls reduce turnout? The importance
of statistical methods for field experiments. Amer. Political Sci. Rev.
99(2):283–300.
Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated
annealing. Science 220(4598):671–680.
Morris C (1985) A finite selection model for experimental design of the
health insurance study. J. Econometrics 11(1):43–61.
Nelsen RB (1999) An Introduction to Copulas (Springer, New York).
Reinisch LM, Sanders SA, Mortensen EL, Rubin DB (1995) In utero
exposure to phenobarbital and intelligence deficits in adult men.
J. Amer. Medical Assoc. 274(19):1518–1525.
Rosenbaum PR, Rubin DB (1983) The central role of the propensity score
in observational studies for causal effects. Biometrika 70(1):41–55.
Rosenbaum PR, Rubin DB (1985) Constructing a control group using multivariate matched sampling methods that incorporate the propensity
score. Amer. Statist. 39(1):33–38.
Rosenbaum PR, Ross RN, Silber JH (2007) Minimum distance matched
sampling with fine balance in an observational study of treatment for
ovarian cancer. J. Amer. Statist. Assoc. 102(477):75–83.
Rubin DB (1973) Matching to remove bias in observational studies.
Biometrics 29(1):159–183.
Rubin DB (1974) Estimating causal effects of treatments in randomized
and nonrandomized studies. J. Educ. Psych. 66(5):688–701.
Rubin DB (1978) Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6(1):34–58.
Rubin DB (1991) Practical implications of modes of statistical inference
for causal effects and the critical role of the assignment mechanism.
Biometrics 47(4):1213–1234.
Rubin DB (2006) Matched Sampling for Causal Effects (Cambridge
University Press, New York).
Sekhon JS (2004) The varying role of voter information across democratic
societies. Working paper, Department of Political Science, University of California, Berkeley, Berkeley, CA. Accessed January 2012,
http://sekhon.berkeley.edu/papers/SekhonInformation.pdf.
Sekhon JS (2011) Multivariate and propensity score matching software
with automated balance optimization: The matching package for
R. J. Statist. Software 42(7):1–52. Accessed January 2012, http://
www.jstatsoft.org/v42/i07.
Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference
412
Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved.
Terrie L (2008) Using matching to assess the effect of electoral rules on
the presence of the elderly in national legislatures. Poster presented at
2008 Political Methodology Meetings. Society for Political Methodology, American Political Science Association, Washington, DC.
Alexander G. Nikolaev is an assistant professor in the Department of Industrial and Systems Engineering at the University at
Buffalo. His research interests include stochastic optimization, statistical inference, and social network modeling.
Sheldon H. Jacobson is a professor and director of the Simulation and Optimization Laboratory in the Department of Computer Science at the University of Illinois. He has a diverse set
of basic and applied research interests, including problems related
to optimal decision making under uncertainty, discrete optimization, causal inference with observational data, aviation security,
Operations Research 61(2), pp. 398–412, © 2013 INFORMS
public health policy (immunization, transportation and obesity,
cell phone ban effectiveness), March Madness bracketology, and
forecasting the outcome of the United States presidential election.
Wendy K. Tam Cho is a professor in the Department of Political Science and Department of Statistics, and Senior Research
Scientist at the National Center for Supercomputing Applications,
all at the University of Illinois at Urbana–Champaign.
Jason J. Sauppe is a Ph.D. candidate in the Department
of Computer Science at the University of Illinois. His current
research interests include mathematical programming, discrete
optimization, and approximation.
Edward C. Sewell is a Distinguished Research Professor
of Mathematics and Statistics at Southern Illinois University at
Edwardsville. His current research interests are combinatorial
optimization and health applications.
Netw Model Anal Health Inform Bioinforma (2014) 3:69
DOI 10.1007/s13721-014-0069-7
ORIGINAL ARTICLE
Towards evaluating and enhancing the reach of online health
forums for smoking cessation
Michael Stearns • Siddhartha Nambiar •
Alexander Nikolaev • Alexander Semenov
Scott McIntosh
•
Received: 10 December 2013 / Revised: 29 August 2014 / Accepted: 6 September 2014
Ó Springer-Verlag Wien 2014
Abstract Online pro-health social networks facilitating
smoking cessation through web-assisted interventions have
flourished in the past decade. In order to properly evaluate
and increase the impact of this form of treatment on society, one needs to understand and be able to quantify its
reach, as defined within the widely adopted RE-AIM
framework. In the online communication context, user
engagement is an integral component of reach. This paper
quantitatively studies the effect of engagement on the users
of the Alt.Support.Stop-Smoking forum that served the
needs of an online smoking cessation community for more
than 10 years. The paper then demonstrates how online
service evaluation and planning by social network analysts
can be applied towards strategic interventions targeting
increased user engagement in online health forums. To this
end, the challenges and opportunities are identified in the
development of thread recommendation systems for
M. Stearns S. Nambiar A. Nikolaev (&)
Department of Industrial and Systems Engineering,
University at Buffalo (SUNY), Buffalo, NY, USA
e-mail: [email protected]
M. Stearns
e-mail: [email protected]
S. Nambiar
e-mail: [email protected]
A. Semenov
Department of Mathematical Information Technology,
University of Jyväskylä, Jyväskylä, Finland
e-mail: [email protected]
S. McIntosh
Department of Public Health Sciences, University of Rochester,
Roshester, NY, USA
e-mail: [email protected]
effective and efficient spread of healthy behaviors, in
particular smoking cessation.
Keywords Social network analysis Smoking cessation Online forum communication RE-AIM framework Reach Engagement Intervention modeling
1 Introduction
Tobacco use is one of several individual modifiable health
behaviors including poor diet, alcohol misuse, and physical
inactivity, identified by the World Health Organization as
leading risk factors for global disease burden (Lim et al.
2012; Narayan et al. 2010; Scarborough et al. 2011). The
development of cost-effective public health initiatives,
capable of reducing the rate of tobacco use at the population level, is of great importance. Web-assisted tobacco
interventions (WATIs) represent one potential solution to
this challenge with the expansion of Internet access globally leading an increasing number of individuals to turn to
them in place of or as an adjuvant to traditional forms of
treatment (Selby et al. 2010). They provide a cost-effective
medium for delivering targeted social support to a wide
audience (Norman et al. 2008). Models and methods
capable of describing the dynamics of social interactions
and influence within such communities are primed to
become part of the health policy-maker’s toolbox.
Intentionally created online networks for smoking cessation have existed for over two decades with newsgroups
such as Alt.Support.Stop-Smoking, and websites such as
Quitnet, Alt.Support.Stop-Smoking, StopSmokingCenter,
and WebCoach, among others. (Cobb et al. 2011). These
are dynamic, supervised systems allowing for various
modes of communication (e.g., chat rooms, forums, private
123
69
Page 2 of 11
messaging), self-representation (e.g., personal profiles,
blogs, journals), and affiliations (e.g., friend lists, private
groups), ensuring that users can seek for social support of
distant friends ‘‘like them’’ in real time (Norman et al.
2008). Recent research has established that modern online
health communities for smoking cessation function as a
form of treatment for their participants, significantly
increasing abstinence rates and exhibiting similar levels of
effectiveness as intensive face-to-face counseling (Shahab
and McEwen 2009). Web- and computer-based smoking
cessation programs for adult smokers were found effective:
‘‘in a random-effects meta-analysis of 22 eligible trials (9
web-based, and 13 offline computer-based interventions),
the intervention group had a significant effect on cessation
(relative risk (RR), 1.44; 95 % confidence interval
1.27–1.64)’’ (Myung et al. 2009). Similar successes were
reported exclusively with web-based interventions for
adolescents: RR, 1.40; 95 % CI, 1.13–1.72, where ‘‘the
intervention group had a significantly larger cessation rate
than that of the control group’’ (Crutzen et al. 2008).
Social Web (Web 2.0) technology alone does not
guarantee a successful online community where members
participate actively and develop lasting relationships (Iriberri and Leroy 2009). Adapting the well-established REAIM framework for sustainable interventions to online
forums, the following five criteria can be distinguished
(Glasgow et al. 2006). Reach is an individual-level measure of participation, referring to the percentage and risk
characteristics of forum participants. Effectiveness is the
degree to which the intervention achieved the intended
outcomes, e.g., progress towards smoking cessation as a
function of social interaction. Adoption refers to the proportion and representativeness of the settings that adopt an
intervention, e.g., incorporating online community forums
as a smoking cessation strategy. Implementation describes
the extent to which an intervention is delivered as intended,
e.g., online social forums have measurable interactions.
Maintenance is the extent to which an intervention
becomes routine, e.g., ongoing utilization and evolution of
online forums. The literature focusing on individual-level
measures has paid much attention to evaluating the effectiveness of online forums as a treatment of smoking,
finding that both intra-treatment and extra-treatment social
support are associated with increased rates of smoking
cessation (Crutzen et al. 2008). However, little to no
research has been reported on measuring reach, which is
tantamount to user engagement in the context of WATIs.
The contribution of this paper lies in enhancing our
understanding of user engagement as a key component of
reach of online treatments, and in particular, social support
environments and interventions (e.g., WATIs). The paper
illustrates how and why the lack of prescriptive, as opposed
to descriptive, models is growing into a serious challenge
123
Netw Model Anal Health Inform Bioinforma (2014) 3:69
in social network analysis today. By distilling the factors
that influence user engagement, the present discussion
looks for insights that could be applied to adapt thread
recommendation research to the context of smoking cessation with the aim of enhancing the reach of online
smoking cessation communities. In particular, the paper
discusses how targeted thread recommendations can be
employed to assist the less experienced health forum users
in order to achieve higher levels of user engagement. The
paper expands on the argument that social network formation models based on actors’ decisions do not allow for
incorporating exogenous interventions, and as a remedy,
proposes a strategy to explicitly model weak, acquaintancetype ties which with time can turn into strong, friendship
ties. In order to motivate this line of inquiry, posting
records from the Alt.Support.Stop-Smoking newsgroup are
studied. The members of this online community, which
was particularly active in the early 2000s, discussed topics
pertaining to smoking cessation in the forum’s threads.
This paper reports the following: Sect. 2 encompasses a
study of online health community Alt.Support.StopSmoking, and identifies the metrics reflecting the implications of user engagement; Sect. 3 details the challenges
and opportunities surrounding the use of prescriptive social
network modeling methods within smoking cessation
communities; and Sect. 4 concludes the paper and offers
directions for future research. It should be noted that the
analysis and models presented in this paper are smokingcessation specific and may not be immediately generalizable to digital health social networks addressing other
conditions.
2 Data analysis
The Internet-based Alt.Support.Stop-Smoking forum, used
in this study to distill measures to enable the monitoring of
engagement patterns, is a Usenet newsgroup. Its structure is
similar to other World Wide Web forums in that users can
both read and post messages which are stored and available
for viewing in a hierarchical tree. Usenet is a distributed
system, accessible via Network News Transfer Protocol
(NNTP) or, alternatively, using WWW front-ends such as
Google Groups. The data analyzed in this paper were
downloaded from a Usenet archive via NNTP in September, 2013, and inserted into a PostgreSQL database.
Complex data analyses were then conducted using a
developed Java code. The de-identified data analyzed in the
present study were derived from retrospective publicly
available data. Per IRB procedures at the University of
Buffalo IRB, submission of a human subject’s research
protocol for ethical board review of this type of investigation was not required.
Netw Model Anal Health Inform Bioinforma (2014) 3:69
The Alt.Support.Stop-Smoking forum activity examined
in this section spans the ten-year period between 8/1/2003
to 9/15/2013. During this time, 438,136 posts were made
by 8,236 unique users in 48,518 threads. Each of the
438,136 entries in the dataset corresponds to an individual
post made by a user on the forum and comprises the
timestamp of the post, the author’s unique forum username,
and the thread to which it was submitted. Note that forums
user data in the Alt. Support.Stop-Smoking dataset were
limited to posting records and therefore the activity of only
registered users with at least one post was analyzed. Owing
to the difficulty in quantifying the difference in benefits
between active posters and passive users (‘‘lurkers’’), user
records of the latter were not included in the analysis.
The first step in analyzing the Alt.Support.Stop-Smoking data involved extracting and analyzing the aggregate
forum metrics as a function of time. Figure 1a, b showcase
moving averages for post and thread counts, and new and
active user counts over the observed life of the forum,
respectively. Users were considered to be active during a
period if they were observed to have made one or more
posts during the period. As observed in the trends, the
initial rapid growth experience during the initial time
period is short-lived; January, 2004, marks the relative
peak of the forum’s activity, with 12,100 posts, 1,419 new
threads, 1,490 active threads, 266 new users, and 439
active users. All of these measures are significantly higher
than the overall averages observed in the dataset, where a
typical month revealed 3,591 posts, 397 new threads, 445
active threads, 67 new users, and 165 active users.
Over the 9 years following the forum’s popularity peak,
a gradual decline is observed in each of the four main
aggregate forum metrics. In the last period covered by the
dataset, 9/1/2013 to 9/15/2013, there are only five posts
submitted to the forum, made by four active users, in three
active threads. It is worthwhile to try to understand the
factors that precipitated this decline. Moreover, there is a
need to study whether the application of calculated external
Fig. 1 a Moving average of
posts and threads. b Moving
average of new and active users
(a)
Page 3 of 11
69
pressures could enable the forum to reach more users over
a longer period of time, thus increasing its cumulative
health benefit. Accordingly, Sect. 2.1 offers user-specific
analysis, enabling a deeper assessment of a typical forum
user’s behavior.
The remainder of Sect. 2 is structured into the following
subsections: Sect. 2.1 reports on user-level statistics; Sect.
2.2 studies how the gradually developed strong ties affect
user behavior; and Sect. 2.3 classifies users by type and
identifies user subgroups that could potentially benefit from
engagement-enhancing interventions.
2.1 User-specific analysis
The consideration of historical forum aggregates alone
does not fully capture the underlying user activity and
engagement patterns. To provide a more complete picture,
individual user data must be analyzed. The frequency
graphs in Fig. 2a, b indicate that based on the Alt.Support.Stop-Smoking data, the forum content is distributed
amongst a relatively small cadre of highly involved, ‘‘core’’
users rather than being distributed evenly throughout a
largely homogeneous user base.
In the analyzed data, an average user contributes 53.2
posts during their forum lifetime (defined as the time
elapsed between the user’s first post and their last post).
The average user contribution is skewed by a small group
of users who account for the majority of posts made to the
forum. The top 1 % of users (n = 83) accounted for
194,498 of the 438,136 total posts (44.39 %), the next 9 %
(n = 741) accounted for 193,498 posts (44.16 %), and the
bottom 90 % (n = 7,412) accounted for 50,140 posts
(11.44 %). The distribution of thread creators has a similar
shape, with the top 1 % of thread creators accounting for
22,707 of the forum’s 48,518 total threads (46.8 %), the
next 9 % accounting for 19,014 threads (39.2 %), and the
bottom 90 % accounting for 6,797 threads (14.0 %). These
measures indicate that the most active users are responsible
(b)
123
69
Page 4 of 11
Netw Model Anal Health Inform Bioinforma (2014) 3:69
Fig. 2 a Distribution of the
number of posts per user across
the lifetime. b Distribution of
thread creators for ([5 threads
created)
Fig. 3 User active lifetime bars
for a disproportionate amount of the forum’s overall content. Previous analyses of online communities have
observed a similar phenomenon, referred to as the 1 % rule
or the 90–9–1 principle, in which 90 % of actors observe
and do not participate, 9 % participate sparingly, and 1 %
create the vast majority of new content (van Mierlo 2014).
Overall, the majority of the users have short active
forum lifetimes, with 4,557 users (55.33 % of the user
base) having a lifetime of 1 day and only 634 users (7.7 %
of the user base) having an observed lifetime over 1 year.
Amongst the 100 most active posters, the observed average
lifetime is 936.45 days. These observations imply there is a
largely transient user base that enters and exits before
having any opportunity for engagement. It is worthwhile to
note that some short-term users were likely ‘‘bots’’ (automatic programs posting commercial ads) that must have
been banned by the server’s administration.
Figure 3 indicates that, as the forum grew older and its
active user base became more static, fewer new members
joined and even fewer elected to remain engaged. A
plausible explanation for this phenomenon lies in the
increased difficulty faced by new users in trying to integrate themselves into an established community, with the
majority of active members enjoying already established
friendship relationships. Young (2013) indicated that when
users start to think that they can no longer influence the
123
community, they will disengage. Failure to reverse such
patterns of user-disengagement and barrier to entry, can
lead to the death of the forum as the established user base
dwindles and fewer new users join to take their place.
Thus, it is necessary to determine how these friendship
networks that initially served as a barrier to new users
could instead be leveraged to engage them. To do so, the
concept of friendship between users (how it arises and the
influence it exerts on user behavior) must be defined.
2.2 Engagement related analysis
As the Alt.Support.Quit-Smoking forum does not explicitly
report on friendship ties between its members, they must be
inferred heuristically. Online friendships capture the
emergence of mutual recognition between two persons,
arising from their repeated interactions. In this vein,
Rheingold (2000) describes online communities as ‘‘cultural aggregations that emerge when enough people bump
into each other often enough in cyberspace’’, while Preece
(2001) defines them as ‘‘any virtual social space where
people come together to get and give information or support, to learn, or to find company’’.
Interaction instances, termed ‘‘weak ties’’ hereafter,
between users were derived by analyzing posting patterns
within threads. If a certain number of interactions or weak
Netw Model Anal Health Inform Bioinforma (2014) 3:69
ties are observed between a pair of users, it can be surmised
that a strong tie is formed between them, i.e., that they have
become friends. When User#1 submits a post to a thread
within 2 days of User#2, it is (by assumption) interpreted
as an instance of interaction between them. The gain in
recognition arising from such interactions is divided into
two sub-components: User#1 adds a weak in-tie from
User#2 while User#2 simultaneously adds a weak out-tie to
User#1. The reasoning for this division is based on an
interpretation of how friendship germinates, being restricted to those pairs of users that demonstrate equitable and
balanced interaction patterns. When the number of recorded in-ties/out-ties between a pair of users exceeds a
specified threshold (10 of each in the present analysis),
those users are assumed to have become friends in the
sense that they can distinguish each other from the general
user body and such recognition prompts them to communicate more. Following this logic, the analysis of the
dataset’s posting patterns reveals the distribution of
friendship ties between forum users (see Fig. 4).
As evidenced by user-based metrics, the forum’s user
base is highly segmented. The user with the largest
friendship network has 395 friends, with only four other
users exceeding the friend count of 300 or more, and only
29 exceeding the count of 100. Unconnected, i.e., friendless users, comprise the largest segment of the forum’s user
base (n = 7,206 users). These users were not reached, and
therefore were not affected by the forum to the extent
where their experience/thoughts/social support could be
helpful to others.
Having defined strong (friendship) ties, the assessment
of the influence that such ties might exert on user behavior
can proceed. To this end, two research questions were
explored: (1) Do friendship ties (or lack thereof) influence
a user’s propensity to abandon the forum? and (2) Do
friendship ties influence a user’s posting behavior in that
the users are more likely to post in threads created by their
friends as compared to those created by non-friends?
In order to answer the first question, active users during
each time period (month) were divided into two groups:
Page 5 of 11
69
those who elected to leave during that period and those
who elected to remain active. Analysis revealed 12,064
instances of user ‘‘survival’’ and 8,236 instances of user
‘‘death’’ (forum abandonment). The average number of
active friends for each user in these two groups was then
determined. When the entire duration of the dataset was
examined it is discovered that, on average, surviving users
have 8.244 active friends while outgoing users had 1.165
active friends. These results indicate that the presence of an
active friendship network is highly correlated with a user’s
decision of whether to stay or leave, with the users having
comparatively larger active networks being more likely to
remain.
In order to answer the second question, data consisting
of active threads and active users, for whom there existed
at least one active thread created by a friend, were collected
for each time period (month). The number of friend- and
non-friend-threads to which users responded (among the
active threads), and the number of posts made in each was
then obtained for each user. Of 998,884 opportunities to
post in a friends thread, 135,640 were used, with contributions submitted to 87,900 distinct threads. Conversely, of
3,487,854 opportunities to post to a non-friends thread,
239,150 were used, with 160,652 distinct threads receiving
contributions. This corresponds to an 8.8 % probability of a
user posting in a friend-created thread and a 4.6 % probability of posting in a non-friend-created thread. When the
gross number of posts made within these threads is considered, the observed counts correspond to an average of
0.1358 posts made by an average user per friend-thread and
0.0685 posts made per non-friend-thread. This effect is not
only statistically significant (which is not surprising, given
the sample sizes), but it is, more importantly, practically
significant.
2.3 Analysis of user engagement needs
In order to summarize and simultaneously provide a more
in-depth analysis of user behavior patterns and engagement
needs, the forum’s user base is provisionally divided into
Fig. 4 Distribution of
friendship network sizes ([0
friends)
123
69
Page 6 of 11
Netw Model Anal Health Inform Bioinforma (2014) 3:69
Fig. 5 Representative
examples of different user types
four distinct groups. The drivers of the division are users’
forum lifetimes and the observed levels of activity. Users
may be generally divided into short-term users and longterm users, with each having two distinct subgroups. The
four distinguished user types along with representative
examples of engagement-specific activity patterns are
shown in Fig. 5.
2.3.1 Short-term users
Quadrants II and III in Fig. 5 comprise those users having
relatively short lifespans—frequently a week or less.
Quadrant II and III users are differentiated from each other
by their respective activity levels. Quadrant III users are
those who join the forum, make a small number of initial
contributions, and then leave for good. As shown in Sect.
2.1, such users make up a significant proportion of the
forum’s user base. Conversely, Quadrant II users post
heavily immediately upon joining the forum, only to leave
soon after.
Although it is impossible to definitively conclude the
primary motivators for short-term users, research has suggested that they are composed largely of recent quitters
seeking support while struggling with their quit attempt. In
the analysis of an online smoking cessation community,
Selby et al. (2010) found that seeking support and advice
was the most common theme identified in first posts among
both recent and longer term quitters. In their analysis of
2,562 first posts to an online smoking cessation support
123
group, approximately 54.7 % were made by individuals
who had quit smoking within the past month, 8.9 % by
those who had quit more than 1 month prior, and 24.9 %
by those who had not yet quit smoking (Selby et al. 2010).
The analysis of posting patterns within the Alt.Support.Stop-Smoking community indicates that the typical
user is narrowly focused, limiting their posting activity to a
few number of threads, oftentimes their own. Of 8,236
unique users, 51.2 % (4,219/8,236) limited their posting
activity to a single thread, and 73 % (6,009/8,236) to five
or fewer threads. Considering the 4,219 users whose
activity was confined to a single thread, 43.6 % (1,839/
4,219) posted solely to the threads that they themselves had
created, indicating that their primary motivator for participation is personal benefit.
Although the short-term users may have received the
benefit of social support during their time on the forum,
the failure to retain them as contributing members can be
considered an overall community loss. By leaving the
forum soon after having joined, such users will not
‘‘return the favor’’ by providing social support to other
users in the future. This behavioral pattern may not necessarily be considered a disservice to the exiting shortterm user. The literature is split on the significance of
continued and active participation during quit attempts
(Preece et al. 2004; An et al. 2008), although it could be
viewed as a disservice to other current and future members who will not benefit from the user’s experience and
insights.
Netw Model Anal Health Inform Bioinforma (2014) 3:69
2.3.2 Long-term users
Quadrants I and IV comprise users with long lifespans,
often many years. Quadrant IV users demonstrate relatively
low activity levels and small friendship networks, but
nonetheless chose to remain active for an extended period
of time. These users are likely heavily topic-driven, primarily posting to threads that serve their immediate needs
or pertain to a personal interest. Quadrant I users demonstrate sustained high activity over long lifetimes. They are
sometimes referred to in the literature as ‘‘core-users’’ or
‘‘super-users’’ (Young 2013).
Core-users are responsible for much of the forum content, as illustrated in Sect. 2.1, and form the backbone of
the community—exerting a disproportionate level of
influence relative to their overall numbers (O’Neill et al.
2014; van Mierlo et al. 2012). Participation of core-users is
motivated more by community factors than personal
interest in specific topics. It has been found that many coreusers are altruistic and truly serve the community: they are
the first to greet newcomers and provide social support to
other users. They may have benefited from the forum in the
past, and are motivated to ‘‘pay it forward’’. In previous
analysis of an online smoking cessation forum, it was
found that the majority of responses ([50 %) to new users’
first posts were made by members who had quit for a
month or more, with only 1 % of first replies being made
by members who had not yet quit (Selby et al. 2010).
Posting patterns observed in the Alt.Support.StopSmoking reflect the role played by core-users in the community. Lurkers are often hesitant to ask a question or seek
support within an inactive community where they perceive
the likelihood of a response to be low, with core-user
activity giving lurkers the confidence to join in the conversation (Bishop 2007). As seen in Sect. 2.1, the rate at
which new members posted to the community is positively
correlated with the number of posts and active threads
during that time—content for which core-users were largely responsible.
Core-users’ role as community ambassadors, typically
being among the first to respond to newcomers, is another
essential function. In the Alt.Support.Stop-Smoking dataset, 3,743 users started a new thread within 2 days of
having joined the forum, with 45.0 % (1,686/3,743)
receiving a prompt reply (within 3 h of their initial post).
Initial responses to these threads were typically made by
core-users, with the average post count and number of
friends of first responders being equal to 1,681.18 and
71.39, respectively. Both of these values were significantly
higher than the overall community averages of 53.2 posts
and 2.505 friends, respectively (p \ 0.001, p \ 0.001).
Following an initial post, the prompt engagement of
newcomers by core-users was found to have a significant
Page 7 of 11
69
correlation with their future activity patterns. New users
receiving a prompt reply to their first thread had an average
lifespan of 114.52 days and an average post count over
their lifespan of 93.36. Conversely, new users who did not
receive a prompt reply to their first thread were found to
have an average lifespan of 61.36 days and an average
lifespan post count of 49.00. These differences are both
statistically (p \ 0.001) and practically significant, having
95 % confidence intervals of 22.4–66.3 days and
35.79–70.53 posts, respectively.
3 Directions and future considerations for increasing
engagement in smoking cessation communities
This section builds upon the insights offered by Sect. 2,
demonstrating how online service evaluation and planning
by social network analysts can be applied towards strategic
interventions targeting increased user engagement in online
health forums. Calculated strategic management is essential for maintaining successful online communities where
members actively participate and develop lasting relationships (Iriberri and Leroy 2009). Modeling the dynamics of
interactions between core-users, regular users, and newcomers in online health forums would provide a technical
foundation of modern pro-health engagement research.
There is a gap in the literature of prescriptive models
capable of monitoring, controlling, and improving user
engagement in online health forums. One avenue towards
accomplishing these goals is through targeted recommendations of threads to users. Thread recommendation systems apply knowledge discovery techniques to match users
to threads. Given the diverse interests and needs of forum
users, coupled with the large amount of information that
they must sift through on a typical forum, recommender
systems present an essential tool for improving end-user
retention and facilitating meaningful user interactions.
Thread recommender systems serve to simultaneously
satisfy users’ information needs by directing them to
appropriate content, and their social needs by connecting
them to other users within the community. There are a
number of domain-specific considerations, not emphasized
or even present in conventional thread recommendation
tasks, which are essential for the development of effective
health forum recommender systems.
In contrast to conventional online forums, the participation of users in online health forums is primarily motivated by a desire to give and/or receive social support
(White and Dorman 2001). Friendships between forum
participants play an essential role in the provision of social
support within such communities. Reading and participating in forum threads leads users to encounter other members like themselves with whom friendships can be built,
123
69
Page 8 of 11
thus enabling personalized support. Therefore, threads
serve not solely as platforms for the dissemination of static
content, but also as conduits for meaningful user interactions, with thread value being generated by and representing its participants. Within this framework, each thread
can be viewed as a resource for introducing new user ties
and strengthening existing ones. The mechanisms by which
friendships form between users, and the manner in which
threads can be employed to facilitate the process, are
essential components of the emerging methodology for
health forum thread recommender systems.
The ensuing subsections discuss adjustments to current
paradigms that can lead to models capable of informing
and controlling online forum user engagement. These
subsections offer more focused discussion about the
domain-specific challenges confronting thread recommendation systems in online smoking cessation forums, the use
of social network structure as a means to motivate thread
recommendation, and present a new paradigm for modeling
actor ties within a social network to better capture the
manner in which friendships between users develop.
3.1 Thread recommendation within an online smoking
cessation forum
Research on thread recommendation systems is just beginning
to emerge (Tang et al. 2013). Traditional product recommendation systems have employed a combination of collaborative filtering and content-based approaches to match
consumers with products, under the assumption that product
appeal and consumer preference are static but initially
unknown (Sarwar et al. 2001). Collaborative filtering methods
identify and exploit consumer and product similarities to
make predictions about user tastes or preferences. They may
be reinforced via content-based approaches which function by
comparing consumer preferences to product features, and
thereby, provide even more suitable recommendations.
In addition to the traditional challenges confronting all
recommender systems (e.g., cold starts and data sparsity)
(Sarwar et al. 2001), user preferences/needs within online
smoking cessation communities are dynamic, i.e., continually evolving, as users progress through health state
changes (e.g., quitting stages). The process of changing
smoking behavior has been subdivided into five distinct
stages by smoking cessation researchers, including: precontemplation, contemplation, preparation, action, and
maintenance (Prochaska and DiClemente 1984). Additionally, users may relapse, i.e., return to an earlier stage.
An effective health forum thread recommendation system
should therefore tailor its recommendations to reflect users’
states of behavioral change (in this case, smoking cessation) in order to provide them with an appropriate level of
support.
123
Netw Model Anal Health Inform Bioinforma (2014) 3:69
Forum threads are typically short-lived and quickly
changing in their content/narrative, in contrast to the long
lives and static characteristics of products being recommended in conventional settings. The content and narrative
of a thread may evolve as contributions are made to it by
users introducing uncertainty into the very defining characteristic of a thread as a product. An effective thread
recommendation system should capture and account for
such uncertainty. Note that thread evolution affects not
only its future contributors: the benefit/utility that a user
derived from participating in a thread is not immediately
realized upon their initial posting, resulting instead from
the responses made by the future contributors. Due to these
dynamics, the benefit/utility of thread participation is
unpredictable, being a function of time and depending on
future, as yet unrealized, events.
3.2 Network structure considerations and complex
behaviors
Threads in online smoking cessation forums facilitate user
engagement, providing a platform for interactions between
users and the provision of social support. A thread’s value
lies not solely in its narrative, but in the opportunity that it
provides users to directly communicate with one another. A
user’s acceptance of a thread recommendation can be
thought of as signifying that engagement is taking place.
To reflect the importance of communication between users,
and the role that threads serve to enable it, an effective
thread recommendation system must consider both the
social network structure of the overall community and
within the thread itself.
Social network analysis takes an expanded view of a
social environment, allowing for inferences about how
network structure both enables and drives behavior change
(Cobb et al. 2011). Smoking cessation is an example of a
complex adoptable behavior, which is differentiated from
simpler behaviors in social network literature (Centola and
Macy 2007). The distinction between the simple and
complex behaviors is an essential consideration for an
effective thread recommender system due to fundamental
differences in how behaviors are diffused through a network. Simple behaviors, such as the adoption of a new
technology or product, are spread farther and more quickly
by networks having many long-ties. Conversely, complex
behaviors, such as smoking cessation, typically require a
user to be in contact with multiple individuals capable of
supporting them in their behavior change before it is
adopted. Once adoption of a complex behavior has been
realized, continued reinforcement is crucial to ensure that
the newly adopted behavior persists and the user does not
relapse back to their prior state. Research has shown that
highly clustered networks are most effective in facilitating
Netw Model Anal Health Inform Bioinforma (2014) 3:69
the adoption of complex behaviors within a community
(Centola 2010).
The consideration of network structure in thread recommendation tasks alters the manner in which a thread’s
value is determined and the purpose which it ultimately
serves. The relationship between a thread’s network
structure and that of a user targeted by the recommender
system, directly influences the ability of the thread to
provide the user with social support. A thread containing
contributions from friends (recognizable peers) may be
assumed to provide a greater level of social support to an
individual. However, this is not to say that threads containing relatively few of a user’s friends are of no value to
that user. Rather than providing high levels of immediate
social support, such threads provide a user with the
opportunity to change their local network structure through
the introduction of new ties and/or the strengthening of
existing ones. In summary, threads possess the capacity to
provide both immediate and future benefits to users. In
order to reflect a thread’s capacity to influence a user’s
local network structure, a prescriptive modeling framework
capable of capturing the influence of outside interventions
(in the form of thread recommendations) on network
structure is required.
Existing stochastic actor-based models lack the means to
analyze and quantify influence imposed on social networks
from the outside. Stochastic actor-based models are a
popular methodology for modeling network evolution and
predicting ties between actors. Within such networks,
nodes represent social actors, e.g., forum users, while edges
(ties) represent social relations between them such as
friendship, trust, or cooperation. Ties between pairs of
actors may be established, or existing ties dissolved,
influenced by factors such as the actors’ structural positions
within the network, actor characteristics (actor covariates),
and their relationships with other nearby actors (dyadic
covariates). However, in stochastic actor-based models,
network ties are actor-initiated, i.e., they can only be
changed myopically by the actors themselves.
Formulation of an exogenous intervention strategy
requires one to choose an aggregate, actor-based objective
function, and decision variables to optimize this function.
The modeling challenge lies in the identification and
application of external interventions (in the form of thread
recommendations) that serve to modify a user’s local network in such a way to benefit that user and/or those around
them. When recommending a thread to a user for the
purpose of altering their local network structure, the likelihood of such changes is an essential consideration. The
concept of link-prediction may be applied towards this task
(Liben-Nowell and Kleinberg 2007). The link-prediction
problem for a social network involves the identification of
new links that are likely to appear in the future,
Page 9 of 11
69
complementing the network’s current structure and the
characteristics of pairs of users (dyadic covariates).
In order to modify the existing actor-oriented modeling
paradigm to accommodate exogenous interventions, the
manner in which actor ties are modeled must be revisited.
Ties in traditional stochastic actor-based models are
assumed to be binary, with relationships between actors
either existing or not. To capture the dynamics of user
interactions within a smoking cessation community, actor
ties should instead be weighted, reflecting varying levels of
friendships between actors and the build-up from weak ties
to strong ones.
3.3 Weak and strong tie dynamics
While the first co-posting in the same thread by two users
may only constitute a weak tie, potential repeated communication between them can lead to tie strengthening over
time, eventually resulting in the establishment of a strong
friendship tie. In this way, altruistic behavior of core-users
can be employed to ‘‘push’’ the network towards a state
characterized by higher levels of user engagement by
introducing users to one another by co-referencing threads,
thereby facilitating meaningful user interactions.
A major deficiency of existing actor-oriented models
lies in their inability to explicitly accommodate weak ties
and their dynamics. One of the most significant premises,
upon which the actor-oriented models are built, is that tie
formation is a Boolean class of variables wherein a tie is
either present or absent, and must be observable (Snijders
et al. 2010). This paper posits that the problem of modeling
exogenous interventions can be approached by considering
two processes that together describe the formation of a
social network. Process 1 expresses how an actor builds
strong ties with other nearby actors, i.e., what drives their
decisions about with whom to communicate more/less.
However, such decisions are clearly made with respect to
the actor’s acquaintances, with most other actors treated as
strangers. Strangers’ attributes are unknown to an actor,
and their influence on their decision-making mechanism,
captured by Process 1, is minimal. This accentuates the
importance of Process 2—building acquaintances, termed
weak ties. This definition of a ‘‘weak tie’’ is different than
that based on the structural holes theory (Walker et al.
1997).
Therefore, models incorporating varying levels of
‘‘affinity’’ between actors are required, enabling more
detailed analyses of transitions between weak and strong
ties: these transitions may serve as a key underlying
facilitator for the growth of health behavior online social
networks.
It is strong ties that people would report in a questionnaire, or that can be observed from time-stamped
123
69
Page 10 of 11
interaction records. Meanwhile, weak tie patterns are hidden unless they trivially span a whole (small) network.
Weak ties can be traced online in certain situations: they
are ‘‘follow’’-type ties as opposed to ‘‘friends’’-type ties.
Therefore, it is crucial to explore approaches to learning
weak tie formation dynamics in large networks simultaneously with strong tie dynamics. This will allow (1) the
accurate expression of actor decision-making logic, i.e.,
estimation of Process 1 parameters by removing the bias of
the tie patterns that actors are unaware of, and (2) the
quantitative evaluation of social influence effects inside
networks as well as effects of interventions imposed from
the outside.
While strong tie formation driven by actors themselves
cannot be influenced from the outside, weak tie formation
can. People cannot be expected to become friends just
because a model-based tool says they should. However,
they can be introduced to each other, informed of congruent interests, and invited to vote on or contribute to
‘‘hot’’ forum threads, etc. Such actions help build
acquaintances, as they unobtrusively increase the probability that people will more quickly expand friendship
circles, begin communicating with newly found acquaintances, and eventually build stronger ties. A model incorporating weak ties can quantify weak influence effects, and
suggest feasible interventions for actor outcomes.
It is only with time that a network actor (e.g., a smoking
cessation forum user) expands their local neighborhood, on
which they will make decisions about building long-lasting
relationships, getting engaged, or staying inactive and
leaving for good. Thus, strong tie formation depends on
weak ties. On the other hand, it is through communication
with friends (i.e., people already trusted) that an actor will
learn about other trustworthy actors, begin to distinguish
those actors from strangers and explore communication
pathways to them. Thus, weak tie formation is facilitated
by strong ties.
A potential pathway to incorporating both strong ties and
weak ties into a mathematical model lies in studying the
behavior of any actor based on the actor’s local network
structure, i.e., the actor’s acquaintances, under the assumption that a part of the network is hidden, which is a critical
omission in all the existing actor-oriented models the
investigators are aware of. The exploration of a network, i.e.,
the discovery of its hidden parts that may contain useful
information, then becomes an important task for an actor,
where they may benefit from ‘‘outside’’ assistance.
4 Concluding remarks
Calls for the design and implementation of prescriptive
social network analysis techniques for the growth and
123
Netw Model Anal Health Inform Bioinforma (2014) 3:69
maintenance of online health communities continue to
emerge. The National Institutes of Health have called for
research addressing ‘‘the emergence of collective behaviors that arise from individual elements or parts of a
system working together’’ through an exploration of
‘‘complex and dynamic relationships among the parts of a
system and between the system and its environment’’
(Marcus 2013). Recent papers such as ‘‘The Spread of
Behavior in an Online Social Network’’ (Centola 2010),
have improved our understanding of how network structure influences the diffusion of complex behaviors. The
present study contributes to this research direction by
paving the way for the prescriptive modeling of behavior
dynamics. This section touches upon some additional
aspects of prescriptive social network modeling for reach
enhancement of online pro-health communities, in particular, the treatment of lurkers and recent trend towards
using gamification for therapeutic purposes.
A challenge facing the present and prior analyses of
online health communities is that passive users (lurkers)
are difficult to account for, although they have been
found to make up a significant proportion of users in
online health forums (Selby et al. 2010). Other research
has indicated that lurkers enjoy many of the same benefits as active posters, with more than half of lurkers
reporting that ‘‘just reading/browsing is enough’’ (Preece
et al. 2004). User anonymity has been observed to play a
significant role in WATIs and other online health communities. Although known contacts are potentially more
influential than anonymous ones, typically having more
detailed knowledge of a particular user’s needs and
emotional state (Newman et al. 2011), many users are
disinclined to discuss sensitive issues pertaining to
habits and behavior on non-anonymous social networks
such as FacebookÒ (Ploderer et al. 2013; Morris et al.
2010).
While the empirical work presented in this paper relies
on the data from an online health forum with limited
capabilities beyond posting, the more recent introduction of
user-controlled features (profile creation, friendship
assignment, thread tracking, etc.), the use of gamification
as a modern treatment delivery mechanism (Primack et al.
2012), and mobile-based treatment delivery mechanisms
(Whittaker et al. 2008; Stanton et al. 1999; Lawrance
2001), may require further effort for analyzing how modern
health portals, such as ‘‘medhelp.com’’, deliver treatment
to their participants.
Acknowledgments This work was supported in part by the Academy of Finland Grant #268078 ‘‘Mining social media sites’’ (MineSocMed) and the National Cancer Institute (R01CA152093-01 to
S.M.). Its contents are solely the responsibility of the authors and do
not necessarily represent the official views of the Academy of Finland, the National Cancer Institute or the National Institutes of Health.
Netw Model Anal Health Inform Bioinforma (2014) 3:69
References
An L, Schillo BA, Saul JE, Wendling AH, Klatt CM, Berg CJ,
Ahulwalia JS, Kavanaugh AM, Christenson M, Luxenberg MG
(2008) Utilization of smoking cessation informational, interactive, and online community resources as predictors of abstinence: cohort study. J Med Internet Res 10(5):e55
Bishop J (2007) Increasing participating in online communities: a
framework for human-computer interaction. Comput Hum
Behav 23(4):1881–1893
Centola D (2010) The spread of behavior in an online social network
experiment. Science 329(5996):1194–1197
Centola D, Macy M (2007) Complex contagions and the weakness of
long ties. Am J Sociol 113(3):702–734
Cobb NK, Graham AL, Byron Abrams DB (2011) Online social
networks and smoking cessation: a scientific research agenda.
J Med Internet Res 13:e119
Crutzen R, De Nooijer J, Candel MJ, de Vries NK (2008) Adolescents
who intend to change multiple health behaviours choose greater
exposure to an internet-delivered intervention. J Health Psychol
13:906–911
Glasgow RE, Klesges LM, Dzewaltowski DA, Estabrooks PA, Vogt
TM (2006) Evaluating the impact of health promotion programs:
using the RE-AIM framework to form summary measures for
decision making involving complex issues. Health Educ Res
21(5):688–694
Iriberri A, Leroy G (2009) A life-cycle perspective on online
community success. ACM Comput Surv 41(2):1–29
Lawrance KG (2001) Adolescent smokers’ preferred smoking cessation methods. Can J Public Health 92(6):423–426
Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for
social networks. J Am Soc Inform Sci Technol 58(7):1019–1031
Lim SS, Vos T, Flaxman AD, Danaei G, Shibuya K, Adair-Rohani,
et al (2012) A comparative risk assessment of burden of disease
and injury attributable to 67 risk factors and risk factor clusters
in 21 regions, 1990–2010. Lancet 380(9859):2224–2260
Marcus S (2013, Nov 4) Modeling social behavior funding opportunity [Web log comment]. Retrieved from https://loop.nigms.
nih.gov/2013/11/modeling-social-behavior-funding-opportunity
Morris MR, Teevan J, Panovich K (2010) What do people ask their
social networks, and why? A survey study of status messages
Q&A behavior. In: Proceedings CHI pp 1739–1748
Myung SK, McDonnell DD, Kazinets G, Seo HG, Moskowitz JM
(2009) Effects of web- and computer-based smoking cessation
programs: meta-analysis of randomized controlled trials. Arch
Intern Med 109:929–937
Narayan KM, Ali MK, Koplan JP (2010) Global noncommunicable
diseases—where worlds meet. N Engl J Med 363(13):1196–1198
Newman MW, Lauterbach D, Munson SA, Resnick P, Morris ME
(2011) It’s not that I don’t have problems, I’m just not putting
them on Facebook: challenges and opportunities in using online
social networks for health. In: Proceedings CSCW, pp 341–350
Norman CD, McIntosh S, Selby P, Eysenbach G (2008) Web-assisted
tobacco interventions: empowering change in the global fight for
the public’s (e)health. J Med Internet Res 10:e28
O’Neill B, Ziebland S, Valderas J, Lupiáñez-Villanueva F (2014)
User-generated online health content: a survey of internet users
in the United Kingdom. J Med Internet Res 16(4)
Ploderer B, Smith W, Howard S, Pearce J, Borland R (2013) Patterns
of support in an online community for smoking cessation. In:
Page 11 of 11
69
Proceedings of the 6th international conference on communities
and technologies, pp 26–35
Preece J (2001) Sociability and usability in online communities:
determining and measuring success. Behav Inform Technol
20(5):347–356
Preece J, Nonnecke B, Andrews D (2004) The top 5 reasons for
lurking: improving community experiences for everyone. Comput Hum Behav 20(2):201–223
Primack BA, Carroll MV, McNamara M et al (2012) Role of video
games inn improving health-related outcomes: a systematic
review. Am J Preventative Med 42:630–638
Prochaska JO, DiClemente CC (1984) The transtheoretical approach:
towards a systematic eclectic framework. Dow Jones Irwin,
Homewood
Rheingold H (2000) The virtual community: homesteading on the
electronic frontier. MIT Press, MA
Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based
collaborative filtering recommendation algorithms. In: Proceedings of the 10th international conference on World Wide Web
Scarborough P, Bhatnagar P, Wickramasignhe KK, Allender S, Foster
C, Rayner M (2011) The economic burden of ill health due to
diet, physical inactivity, smoking, alcohol and obesity in the UK:
an update to 2006-07 NHS costs. J Public Health 33(4):527–535
Selby P, van Mierlo T, Cunningham J (2010) Online social and
professional support for smokers trying to quit: an exploration of
first time posts from 2,562 members. J Med Internet Res
12(3):e34
Shahab L, McEwen A (2009) Online support for smoking cessation: a
systematic
review
of
the
literature.
Addiction
104(11):1792–1804
Snijders T, van de Bunt G, Steglich C (2010) Introduction to
stochastic actor-based models for network dynamics. Soc Netw
32:44–60
Stanton WR, Lowe JB, Fisher KJ, Gillespie AM, Rose JM (1999)
Beliefs about smoking cessation among out-of-school youth.
Drug Alcohol Depend 54(3):251–258
Tang X, Zhang M, Yang CC (2013) Leveraging user interest to
improve thread recommendation in online forum. In: 2013
international conference on social intelligence and technology,
pp 11–19
van Mierlo T (2014) The 1% rule in four digital health social
networks: an observational study. J Med Internet Res 16(2)
van Mierlo T, Voci S, Lee S, Fournier R, Selby P (2012) Superusers
in social networks for smoking cessation: analysis of demographic characteristics and posting behavior from the Canadian
Cancer Society’s smokers’ helpline online and StopSmokingCenter.net. J Med Internet Res 14(3):e66
Walker G, Kogut B, Shan W (1997) Social capital, structural holes
and the formation of an industry network. Organ Sci
8(2):109–125
White M, Dorman SM (2001) Receiving social support online:
implications for health education. Health Educ Res
16(6):693–707
Whittaker R, Maddison R, Rodgers A (2008) A multimedia mobile
phone-based youth smoking cessation intervention: findings
from content development and piloting studies. J Med Internet
Res 10(5):e49
Young C (2013) Community management that works: how to build
and sustain a thriving online health community. J Med Internet
Res 15(6):e119
123
Social Structure Optimization in Team Formation
Alireza Farasata , Alexander G. Nikolaeva
a
Department of Industrial and Systems Engineering, University at Buffalo (SUNY),
Buffalo, NY, U.S.A
Abstract
This paper presents a mathematical framework for treating the Team Formation Problem explicitly incorporating Social Structure (TFP-SS), the formulation of which relies on modern social network analysis theories and metrics.
While recent research qualitatively establishes the dependence of team performance on team social structure, the presented framework introduces models
that quantitatively exploit such dependence. Given a pool of individuals,
the TFP-SS objective is to assign them to teams so as to achieve an optimal structure of individual attributes and social relations within the teams.
The paper explores TFP-SS instances with measures based on such network
structures as edges, full dyads, triplets, k-stars, etc., in undirected and directed networks. For an NP-Hard instance of TFP-SS, an integer program is
presented, followed by a powerful LK-TFP heuristic that performs variabledepth neighbourhood search. The idea of such λ-opt sequential search was
first employed by Lin and Kernighan, and refined by Helsgaun, for successfully treating large Traveling Salesman Problem instances but has seen limited use in other applications. This paper describes LK-TFP as a tree search
procedure and discusses the reasons of its effectiveness. Computational results for small, medium and large TFP-SS instances are reported using LKTFP. The insights generated by the presented framework and directions for
future research are discussed.
Keywords: team formation problem, social network analysis, combinatorial
optimization, discrete optimization applications, Lin-Kernighan heuristic
Email addresses: [email protected] (Alireza Farasat), [email protected]
(Alexander G. Nikolaev)
Preprint submitted to Computers and Operations Research
April 23, 2015
1. Introduction
The success of a project as well as the productivity of a whole organization often depends on the effectiveness and efficiency of work of participating
teams (Agustı́n-Blas et al. 2011). The challenge of assembling successful
teams can be addressed by formulating a problem of grouping individuals
or assigning them to (sub)sets so as to optimize some outcome-related objectives (Zhang and Zhang 2013). Team Formation Problem (TFP) has received attention from the operations research community over the past years
(Chen and Lin 2004, Gaston et al. 2004, Fitzpatrick and Askin 2005, Wi
et al. 2009). However, despite the common understanding that the social
structure between members of the same team plays an important role in the
team’s output, such consideration has not been explicitly taken into account
in mathematical modeling, primarily due to the lack of quantitative means
to do so (Lappas et al. 2009, Zhong et al. 2012).
This paper addresses the challenge of developing a mathematical framework for incorporating social structure measures into TFP. It identifies the
means to quantify social structure by assessing the impact of each individual’s
local network on their work-related outcome. For example, such outcome can
be the amount of goods produced, the number of errors committed (self reported or observed), it can be some job satisfaction indicator, the frequency
of conflicts at workplace, etc. Rooted in social science theories, the presented
framework allows one to build models for TFP explicitly incorporating Social Structure (TFP-SS). The class of TFP-SS models sheds light on team
building strategies and also advances the emerging quantitative research of
social theories and team outcomes (Manser 2009, Ceravolo et al. 2012).
The presented framework elucidates the connection between work environment, social network theories and measurable team outcomes: see Figure
1. Social network theories motivate the use of graph-based constructs, called
network structures, for representing social relations: such network structures
include edges, full dyads, k-stars, and (un)directed triplets, among others.
Theories social exchange, structural holes, homophily, reciprocity, transitivity and network evolution support the design of interpretable network structure measures as functions of network structures in TFP-SS (e.g., the number
of transitive triplets in a given graph). The resulting models are useful for
both descriptive and prescriptive purposes. Given historical work-related
outcome data for differently structured teams, the researcher can quantify
the impact of each theory on team performance, by estimating the weight of
2
Team Formation
Problem
Social Network
Theories
Network Structure
Measures (NSMs)
Performance
Measurement Data
Mathematical
Model
Tractable
Program
Figure 1: The proposed framework for the TFP.
each respective network structure measure in a model of the outcome. Then,
by adjusting team roster decision variables, the outcome can be driven in the
desired direction.
This optimization-permitting ability together with the reliance of the
TFP-SS models on social theories distinguishes the presented framework from
the existing network clustering, community detection and clique problems literature (San Segundo et al. 2011, Pirim et al. 2012, San Segundo and Tapia
2014). Importantly, the presented framework allows one to more closely
control individual team members’ local networks, which play a big role in information transmission, according to the structural holes theory. This paper
also presents an extremely efficient LK-TFP algorithm for solving TFP-SS,
based on variable depth first neighborhood search.
To summarize, this paper presents a framework for formulating and solving team formation problems that employ information provided by the local social network of individuals. This framework addresses problems at a
meeting point of social science and operations research that have significant
practical appeal. The efforts in building and treating models for TFP-SS
lead to the creation of a methodological toolbox that quantifies social and
behaviorial aspects of working in teams, particularly in professional nursing,
rescue, police operations, sport teams and academic research. The paper’s
key contributions are three-fold.
1) It motivates and justifies the use of mathematical programming and
optimization techniques in the area of social science, where most problems
have been previously qualitatively addressed by observations, experiments
3
and basic statistical methods.
(2) It presents a prescriptive, quantitative approach to a real-world application of social network analysis, as opposed to the existing descriptive
studies. It also introduces a framework to operations research community,
computer and social scientists for modeling more complex problems in the
area of social science from operations research perspective.
(3) It identifies the relation between established social science findings
and team outcomes. It presents explicit, rigorous functions of social structures to evaluate the outcomes. It also describes how social structures and
individual attributes can be incorporated into mathematical models of the
outcome regardless of the network type (e.g., directed, undirected, weighted
or unweighed).
(4) Designing and testing methods for solving the TFP where the optimal
social structure is sought within the teams. The paper presents both an
exact method and an efficient heuristic exploiting the Lin-Kernighan-inspired
variable depth neighborhood search.
The rest of the paper is organized as follows. Section 2 offers a review
on existing models based on Social Network Analysis (SNA) and motivates
a call for prescriptive models in this field. Section 3 provides an overview of
social network theories and defines relevant network structure measures that
quantify social structure. Section 4 discusses the relation between workrelated outcomes and social structure measures. Section 5 gives a formal
statement of a special-case non-trivial instance of TFP-SS and studies this
instance in greater detail. Section 6 presents LK-TFP algorithm. Section
7 reports experimental results of solving TFP-SS instances of varied sizes
with undirected and directed networks. Section 8 concludes the paper and
discusses future research directions.
2. Emerging Prescriptive Research in SNA
The science of SNA encompasses a set of techniques for building models of
networks and models on networks (see Wasserman and Faust (1994) for SNA
motivation and position statement). These techniques range from studying
centrality measures (Borgatti 2005) to building complex probabilistic models describing network structure and formation (Albert and Barabási 2002,
Robins et al. 2007, Aral et al. 2009). More recently, the domain of SNA has
attracted the attention of exact science professionals whose expertise allowed
4
for advances in modeling interactions between agents (Contractor et al. 2006,
Newman 2006, Snijders et al. 2010).
The main deficiency of the existing SNA tools is that they mostly offer
descriptive insights (Nascimento and Pitsoulis 2013), rather than prescriptive capabilities. The dearth of models that could allow a decision-maker to
optimally change/influence a social network structure accentuates the difficulty in handling such tasks, and at the same time, calls for filling this gap.
The existing works in the area of optimization and prediction are notable
(Squillante 2010, Leite et al. 2011, Bettinelli et al. 2013), however, they have
focused on small, highly constrained tasks as opposed to introducing broad
classes of problems and general methodologies for addressing them. Of such
prescriptive efforts, the models for finding subsets of influential individuals
in networks are most well-studied (Kempe et al. 2003, Goyal et al. 2010,
Arulselvan et al. 2009).
There exist models that incorporate such graph-based measures as network diameter, density, and centrality, into TFP. However, again, most such
studies are descriptive and focus on impacts of social relations, expressed by
SNA measures, on team performance (Balkundi et al. 2009, Manser 2009,
Abbasi and Altmann 2011, Ceravolo et al. 2012). Existing prescriptive models considering a team’s social network use little information captured in the
social network structure. Basic SNA concepts such as closeness, diameter,
and minimum spanning tree have been employed in identifying a team of experts so as to minimize intra-team communication costs (Lappas et al. 2009,
Dorn and Dustdar 2010, Shi and Hao 2013), and in some cases, individual
member costs (Kargar et al. 2012, Juang et al. 2013).
In 2003, in a study of 816 organization founding teams, Ruef et al. showed
that homophily and network constraints are the key factors defining team
composition (Ruef et al. 2003). In a more recent study of 2349 open-source
software (OSS) development teams, Hahn et al. reported positive correlations
between the developers’ decisions to join project teams, the collaborative
ties with project initiators and the perceived status of other (non-initiator)
members (Hahn et al. 2008).
Zhu et al. investigated the impacts of personal and dyadic motives on
team formation (Zhu et al. 2013). They used Exponential Random Graph
modeling to find that individuals first get interested in a project due to
personal motives such as self-interest, mutual interest, collective action and
coordination cost. The typical secondary considerations include dyad-based
considerations explained by the social theories of homophily, swift trust, so5
cial exchange and co-evolution.
Given the qualitative evidence of network effects on team success, there is
much value in conducting rigorous quantitative research on team formation.
This paper makes the first effort to introduce a comprehensive framework
for TFP, based on social network theories. The ability to quantify social
network structure is the key to this effort.
3. Social Science Theories and Network Structure Measures
In order to formulate a TFP that explicitly incorporates team Social
Structure (TFP-SS), such structure needs to be quantified. This section details how this can be accomplished, relying on existing social science theories.
Network structure measures (NSM) are the key tools used in the presented
framework to construct rigourous, closed-form functions of social structures
in TFP. Earlier studies of the behavior of connected individuals sought for
social theories that could explain network formation mechanisms (Contractor
et al. 2006, Robins et al. 2007, Snijders et al. 2010). These efforts resulted in
the use of network structures, i.e., simple geometric constructs corresponding to the underlying social theories, in mathematical modeling. Prior to
establishing how these social network theories can be useful for explaining
outcomes in TFP, some definitions and notation are in order.
3.1. Definitions and Notation Used to Quantify Social Structure
Let graph G(V, E) with a set of nodes V , |V |= N , and a set of edges E,
represent a social network of individuals. With the notation vi used for node
i, i = 1, . . . , N , set eij = 1(0) if there exists (does not exist) an edge between
nodes i and j. Let wij denote the weight of an edge, which indicates the
strength of a social tie. Define NG (vi ), i = 1, . . . , N , as the local network
of node vi , i.e., the set of all neighbors of i in G. Define M as the total
number of teams and Xj ⊂ G, j = 1, . . . , M , as the network of the members
of team j, with |Xj |= nj as the number of members in team j. Let NXj (vi ),
i = 1, . . . , N , j = 1, . . . , M , denote the local network, also known as the ego
network (Everett and Borgatti 2005), of node i in team j. With a slight
abuse of notation, an individual represented by node i in G is said to belong
to team j, vi ∈ Xj , if node i is contained in Xj . In an attributed graph,
set the binary variable Iri = 1(0), r ∈ A, if node i has (does not have) the
rth attribute (e.g., certain expertise or ability), where set A = 1, 2, . . . , A
contains the indices of the attributes relevant to a problem at hand.
6
According to earlier works in other SNA applications, the meaningful part
of a social environment (climate) in a community is captured by the community’s social network. Social scientists theoretically explore the conncetions
between individuals in a social network, which in turn, can be expressed in
a graph via basic network structures (see Figure 2). Using the introduced
notation, a full dyad (also known as reciprocal tie), with nodes i and p is
the structure where eip = 1 and epi = 1. Similarly, in a directed graph, a
triplet of nodes i, j and k is a three-cycle whenever eip = epq = eqi = 1; in
an undirected graph, such triplet is simply called a triangle.
full dyad *
2-star *
2-mixed star **
triangle *
3-star *
ࡵ࢘࢏
2-out star **
Undirected Network *
2-in star
transitive triplet **
3 cycle **
Directed Network **
Undirected Attributed ***
actor*transitive triplet ****
Directed Attributed ****
Figure 2: Examples of basic network structures in undirected, directed and attributed
networks.
Network structure measures (NSMs) are functions of network structures
capturing the tendencies highlighted by social theories; they can also be
viewed as properties of social network graphs. For instance, the number
of reciprocated ties measures the tendency for reciprocity in a community
(Snijders et al. 2010).
Let Fl (NXj (vi )), l ∈ L, denote the lth NSM in the local network of node i
in team j, where L is the set of indices of network structure types of interest
to a researcher. For example, the number of edges in a local
Pnetwork of node
vi is found using the respective NSM as Fedge (NXj (vi )) = p∈Xj ,p6=i eip .
Social science theories and their respective NSMs are explored next to
explain how one can interpret the observed NSM quantities in a team or an
individual’s local network.
7
3.2. Social Science Theories for Interpreting Team Network Structure
There are several theories related to social networks that may explain
relations between the social structure within a team and the team’s workrelated outcome. Network theories rooted in social science elucidate the
creation, maintenance, dissolution, and reconstitution of organizational networks (Contractor et al. 2006), and also, interpret social structures from
communication and individual attributes perspectives. The theories relevant
in the team formation context include (1) social exchange, (2) homophily,
(3) transitivity, (4) contagion, (5) network evolution, and (6) structural holes
theories. Their connection with network structure quantifiable by NSMs can
be established as follows.
The social exchange theory states that the inclination to have a communication tie from individual A to individual B is predicated on the presence of a communication tie from individual B to individual A (Contractor
et al. 2006, Zhu et al. 2013). The main concern of exchange and dependence
theories is the mutual relationships between pairs of network actors, called
reciprocity. Since reciprocity facilitates information, knowledge and experience sharing between team members, it is an indicator of cooperation in the
team. The number of full dyads in a P
local network of node vi in team j can
be expressed as Ff ulldyad (NXj (vi )) = p∈Xj eip epi .
Homophily as a node level theory suggests that individuals with similar
attributes are more likely to properly communicate with one another. This
theory explicitly takes into account individuals’ attributes such as gender,
age, education, organization type, etc.; attributes such as professional skills,
knowledge, intelligent, leadership skills, job satisfaction, problem solving
skills, flexibility and motivation are vital factors in the team success (Zhang
et al. 2001, Zhu et al. 2013). Network structures pertaining to the homophily
theory must incorporate individual attributes into
P models, with corresponding measures of the form Fego−altr (NXj (vi )) = p∈Xj eip epi Iri Irp , r ∈ A.
Transitivity is an important factor impacting team outcome due to the
role of information flow and communication between team members. This
theory stresses an inclination toward consistency in relationships within a
community, and hence, expects better-functioning teams to exhibit higher
levels of transitivity (Contractor et al. 2006). Different triplet type-based
NSMs inform a researcher of different variations of transitivity (Robins et al.
2007). As an example, the number of three-cycles
in a weighted graph can
P
be expressed as Fwightedtriangle (NXj (vi )) = p,q∈Xj ,p6=q¬i wip wqi wpq eip epq eqi .
8
The contagion theory focuses on the tendency to ”follow the crowd”
in social networks. Detected by the prominence of k-star structures, the
tendency indicates the popularity of certain individuals in a network. In
directed networks, k-in star and k-out star structures illustrate the level of
popularity. The presence of these social structures implies that individuals
with higher indegree, or higher outdegree, are more attractive to individuals
looking to form new ties (Snijders et al. 2010). In the team formation context, high contagion signals strong team core, team cohesiveness, but may
also indicate over-reliance of a team on a single individual in performing
tasks. Individuals with high degree of popularity also help to maintain an
effective advice network within a team and facilitate the spread of information (Borgatti and Halgin
2011). The number of k-stars can be computed as
P
Fk−star (NXj (vi )) = p∈Xj ,i6=1,...,k ei1 ei2 . . . eik (∀ k = 2, . . . , K andK ≤ nj −1).
The network evolution theory states that social networks are dynamic,
which means that ties emerge and change over time (Snijders et al. 2010, Zhu
et al. 2013). These relational changes may be viewed as a function of the
existing social structure in a network. This idea implies that all nodes in the
network act to incase their personal utility, or “happiness”. The relevance
of this theory for TFP is conceptual, since it motivates the consideration of
local networks in team performance studies.
The theory of structural holes (Ronald 1992) argues that the shape of
a local (ego-centered) network influences the amount of information that the
ego-node receives. As a result, the ego is supported by more non-redundant
information at any given time, which provides the ego with the capability of
performing better or being perceived as the source of new ideas (Borgatti and
Halgin 2011). A network position where an ego benefits from the information
flow within a team is called a structural hole (Ronald 1992): the abundance
of structural holes in a network can be assessed by counting the k-star and
triangle structures in it.
Table 1: Social network theories, Social structure and Team outcome
Social network theories
Social exchange
Homophily
Transitivity
Contagion
Network evolution
Structural holes
Impact on Team Outcome
cooperation
individual’s attributes
information sharing in team
leadership
individual’s attributes
personal performance
9
Social Structures
full dyad
ego-alter
triangle, k-star, edge
k-star
k-star, dyadic covariate
triangle
Models based on the theories summarized in Table 1 have been previously
used in other SNA applications, particularly in network formation studies. A
large branch of SNA literature develops Exponential Random Graph Models
(Robins et al. 2007). Stochastic actor-based models introduced by Snijders
et al. (2010) are also widely-used; they also successfully utilize network structures based on individuals’ attributes. Snijders et al. (2010) were the first to
focus research attention on individuals’ local networks. Following the same
logic, this paper primarily considers TFP-SS instances where aggregate, additive utility of team members is maximized.
4. Expressing Work-related Measurements Using NSMs
The objective of TFP is to optimize some aggregate measure of team performance. An integral component of a TFP-SS framework should relate team
outcome with measurable network effects. Social network theories motivate
and justify the use of NSMs in quantifying social structure. The next step is
to establish an explicit relation between NSMs and team performance (e.g.,
taken from available data). This task can be accomplished similar to the way
that other social network properties such as diameter, centrality measures,
etc., have been exploited in the earlier TFP literature (Abbasi and Altmann
2011, Lappas et al. 2009).
According to the theory of structural holes, in many situations, the workrelated outcome of each team can be represented as a function of the NSMs
computed over the team members’ local networks. Note that in general,
relying exclusively on local networks in TFP-SS performance computations
may be incorrect (e.g., this approach is not valid for evaluating consulting
teams). However, with nursing, rescue, and police teams, the consideration
of local networks can certainly be justified.
In most real-world applications, data of prior individual performance of
team members is recorded and can be accessed. Therefore, performance function P(Xj ) = H(N(Xj ) (vi )) can be approximated using any general parametric models and parameter estimation techniques.
There exists a variety of techniques that can be used to estimate H from
empirical data: regression, spline interpolation, neural networks, and machine learning, among others. Given the abundance of available literature
on this topic (including the papers references above), this paper will not
focus on such methods in detail. Note only that in certain situations, the
dependencies between team member outcomes must be taken into account,
10
in which case simple methods such as linear regression may not work (Aral
et al. 2009), and more complicated estimation techniques must be explored.
Assume that each individual’s outcome is recorded, and also, the information of their local network structure is available. For illustrative purposes,
assume that H(Xj ) can be expressed as a linear function of NSMs computed
over each team member’s local network (in team j). Although the linearity
assumption may reduce the accuracy of the model, as discussed, it offers
a simple way to aggregate social structures and estimate the overall workrelated outcome for team j,
P(Xj ) =
nj
X
X
θl Fl (NXj (vi )).
(1)
i=1 l∈L
Recall that the NSMs do allow one to focus on individual local networks. In
(1), θl is the weight quantifying the contribution of the network structure,
i.e., the strength of its corresponding theory, represented by a corresponding NSM: each such weight should be estimated using the available data.
Thereafter, a TFP-SS instance can be formally stated.
5. TFP-SS Formulation
Consider the problem of partitioning a group of N individuals embedded
in a social network, represented by graph G(V, E), into M teams so as to
achieve the best outcomes across all the teams. The TFP-SS encompasses a
variety of models; to specify a problem instance, a researcher must consider
the following modeling components.
Individual or team network: the objective is to optimize a criterion
function over all teams considering individuals’ local networks or the teams’
networks. This distinction should be made based on the adopted outcome
evaluation approach. As such, the team’s network may be preferred for
forming consultant teams; on the other hand, the theory of structural holes
and network evolution theory support the idea of using individuals’ local
networks in problems where the team outcome is the sum or the averages of
the individual member outcomes (Zhu et al. 2013).
Set of NSMs: as a part of formulating TFP-SS, a researcher should select set L of network structure types, corresponding to social network theories
relevant in the problem setting. For example, individual attributes as well
as the corresponding NSMs may be less important in certain applications.
11
Objective function: with the outcome of each team evaluated as in (1),
a researcher should define a proper objective function. In measuring team
outcome, one might be interested in the following objectives (among others):
(1) optimizing the average outcome across all teams:
n PM P(Xj ) o
j=1
.
max
Xj
M
(2) maximizing the outcome of the weakest team:
n
o
max
min P(Xj ) .
Xj
j=1,...,M
(3) minimizing the absolute deviation from the average outcome:
PM
j=1 P(Xj ) min P(Xj ) −
.
Xj
M
(4) minimizing the squared deviation from the average outcome:
h
PM
min P(Xj ) −
Xj
P(Xj ) i2
.
M
j=1
Network type: networks with different types of edges, e.g., (un)directed
and weighted, may lead to different models. For instance, 2-out stars are not
even defined on undirected networks.
The goal of this paper is to present a general methodology for treating
TFP-SS problems. In order to illustrate the application of the resulting
framework, a special case instance of TFP-SS is considered in detail.
5.1. TFP-SS: A Special Case Instance
The presented framework is designed to generate and treat TFP-SS models using NSMs. In order to investigate the tractability of the resulting models, this section first considers a non-trivial TFP-SS Special case instance
(TFP-SSS). The TFP-SSS is defined on an undirected, unweighed graph,
representing a social network of individuals with identical skills. The choice
of network structure types included for modeling TFP-SSS is limited to edge,
12
2-star, 3-star and triangle, L = {edge, 2 − star, 3 − star, triangle} - these social structures are most common in network formation modeling.
In the ensuing computational study, the experiments with TFP-SS instances based on directed networks are also included, so as to demonstrate
the comprehensiveness and flexibility of the presented framework: the instances with directed networks use full dyad, 2-in star, 2-out star, 3-cycle
and transitive triplet NSMs.
As stated, TPS-SSS is a realistic problem relevant for assembling professional teams (e.g., nursing, rescue, police, security, sport etc.), where a
minimum level of expertise is uniformly required of all team members. Such
teams usually complete tasks under stressful conditions, and the effectiveness
of working within a team structure is more important in this case than small
differences in individual qualification.
In TFP-SSS, the average outcome of teams is maximized based on the individuals’ local networks, which is mathematically equivalent to maximizing
the sum of the outcomes over all the teams,
max
M
X
P(Xj ).
(2)
j=1
It should be noted that the local network of any given node includes
the node itself, its immediate neighbors and the links between them all. To
visualize a particular instance of TFP-SSS, consider a social network of size
16 in Figure 3.
2
Team 1
Team 4
1
3
4
Team 2
Team 3
Figure 3: Four teams in the given social network.
13
Table 2: Network structure measure values for nodes in Team 1
Node
1
2
3
4
No. of edges
3
4
1
3
No. of 2-stars
3
3
0
3
No. of 3-stars
0
1
0
0
No. of triangles
1
1
0
1
Quantifying the outcomes of four teams of size four amounts to computing
NSMs as illustrated in Table 2 for nodes in Team 1. The described instance
of TFP-SSS is formally stated:
Instance: A graph G(V, E), |G|= N ; nj ∈ Z + for 1 ≤ j ≤ M ; a partition of
disjoint sets X1 , X2 , . . . , XM , where Xj ⊆ G(V, E) for 1 ≤ j ≤ M and θl ∈ R
for l ∈ L.
Question: Is there a partition of
PVM into M disjoint subsets X1 ∪ X2 ∪ . . . ∪
XM , with |Xj |= nj , such that j=1 P(Xj ) is maximized, where P (Xj ) =
PM
Pnj P
i=1
j=1 nj = N ?
l∈L θl Fl (NXj (vi )) for 1 ≤ j ≤ M and
The following theorem states that TFP-SSS is NP-hard.
Theorem 1. TFP-SSS with M teams to be formed out of N individuals is
NP-hard.
Proof. The presented TFP-SS instance is NP-hard by polynomial time
reduction from Partition into Triangles (see the Appendix).
The number of all feasible solutions in TFP-SSS is
P −1 N
N − n1
N− M
j=1 nj
,
Γ(N, M ) =
...
nM
n1
n2
with n1 + n2 + . . . + nM = N . Quantity Γ(N, M ) shows how quickly the
number of the feasible solutions grows, as N and M increases. Since TFPSSS is NP-hard, one cannot expect to obtain an exact algorithm to solve it
in polynomial time. However, for small size problems, an exact method can
be designed to search for an optimal solution. The next section presents an
Integer Programming (IP) formulation of TFP-SSS.
5.2. An IP Formulation of TFP-SSS
Define a set of decision variables for TFP-SSS:
1 if node i is assigned to team j for every i ∈ I and j ∈ J ,
yi,j =
0 otherwise,
14
where I = {1, . . . , N } and J = {1, . . . , M }. The objective function of TFPSSS is nonlinear since it includes the products of the decision variables,
max
M X
N
X
θ1
eip yij ypj + θ2
p=1
j =1 i =1
+ θ3
N
X
N X
N X
N
X
N X
N
X
eip eiq yij ypj yqj
p=1 q=1
eio eip eiq yij yoj ypj yqj + θ4
o=1 p=1 q=1
N X
N
X
(3)
!
eip epq eqi yij ypj yqj
.
p=1 q=1
To linearize (3), variables wiopqj , zipqj and xipj are introduced to replace the
terms yij yoj ypj yqj , yij ypj yqj , and yij ypj , respectively. An integer programming
formulation for the TFP-SSS is given,
max
M X
N
X
j =1 i =1
+ θ3
θ1
N
X
eip xipj + θ2
p=1
N X
N X
N
X
N X
N
X
eip eiq zipqj
p=1 q=1
eio eip eiq wiopqj + θ4
N X
N
X
(4)
!
eip epq eqi zipqj
,
p=1 q=1
o=1 p=1 q=1
st :
N
X
i=1
M
X
yij = nj
yij = 1
∀ j,
(5)
∀ i,
(6)
j=1
wiopqj ≥ yij + yoj + ypj + yqj − 3 ∀ i, o, p, q, j, o 6= p 6= q,
yij + yoj + ypj + yqj
wiopqj ≤
∀ i, o, p, q, j, o 6= p 6= q,
4
zipqj ≥ yij + ypj + yqj − 2 ∀ i, p, q, j, p 6= q,
yij + ypj + yqj
zipqj ≤
∀ i, p, q, j, p 6= q,
3
xipj ≥ yij + ypj − 1 ∀ i, p, j,
yij + ypj
xipj ≤
∀ i, p, j,
2
yij , xipj , zipqj , wiopqj ∈ {0, 1}.
15
(7)
(8)
(9)
(10)
(11)
(12)
where θl , l ∈ L, and eip are known parameters and o, p, q ∈ I. The constraints in (5) ensure that team j, j = 1, . . . , M , has exactly nj members
(alternatively, these equality constraints can be replaced by upper bound
constraints). The constraints in (6) ensure that no individual is assigned to
more than one team. The constraints in (7)-(12) are needed to linearize the
model. Finally, all the decision variables are required to be binary. Note that
the TFP-SS instances with directed networks have similar IP formulations
to that of TFP-SSS.
Observe that while the present model is linear, the number of constraints
in it quickly increases with problem size. For N individuals to be assigned
to M teams, the IP has M (N 4 − 5N 3 + 9N 2 − 4N ) variables and 2M (N 4 −
5N 3 + 9N 2 − 5N ) + M + N constraints. For example with N = 30 and
M = 5 (a small instance at first glance), the IP has 3,414,900 variables
and 6,829,535 constraints; however, one can still expect the IP to solve successfully for smaller problems. Note that in most real-world applications, the
number of individuals to be managed is more than 50 (e.g., think of a nursing
department in a typical health care center). Hence, an efficient sub-optimal
algorithm is in order for dealing with such instances as well as other versions
of TFP-SS.
6. Variable Depth Neighborhood Search Algorithm for TFP-SS
This section presents an efficient algorithm for TFP-SS, called LK-TFP.
The idea of variable depth neighbourhood search is well recognized for its
success in treating Traveling Salesman Problem (TSP). Proposed by Lin and
Kernighan (Lin and Kernighan 1973), the idea was revised and implemented
in an exceptionally effective heuristic LKH for symmetric TSP by Helsgaun
(2000). LKH algorithm performs λ-opt sequential moves, where in each step,
λ links in the network representing the current solution are replaced by other
λ links (Helsgaun 2000). The variable depth neighborhood idea based on
λ-opt search has not found success in applications beyond TSP and vehiclerouting problems (Kothari and Ghosh 2013, Salari and Naji-Azimi 2012).
Being similar to TSP in that the team assignments can be visualized as a
Hamiltonian tour, TFP-SS appears to be a suitable problem for implementing
the sequential move idea.
16
1
1
1
1
5
2
3
a) 2-move
4
2
b) 3-move
2
4
2
d) 5-move
c) 4-move
3
3
Figure 4: λ-move for λ=2,3,4 and 5
6.1. LK-TFP Algorithm for TFP-SS
In LK-TFP, one feasible solution is within λ-move from another if such
a move improves the objective function value by exchanging λ individuals
from different teams (similar to replacing λ links in TSP); see examples in
Figure 4.
Updating the objective function of TFP-SSS is a relatively expensive
computational operation. A naive implementation re-computes the objective
function using an order of O(n4 ) operation driven by recomputing the most
complex network structure measure (in TFP-SSS, it is 3-star structure).
The idea of a branching algorithm for TFP-SS is to build a tree of solutions to be traversed (see Figure 5). The solutions at the branches of the tree
are obtained by executing λ-moves. LK-TFP traverses the tree to arrive at
such a transitive solution that improves the objective function, signifying the
completion of a current sequential move. LK-TFP identifies good branches of
the tree and avoids visiting too many non-improving solutions by cutting off
the search space. To describe the tree traversal in detail, some terminology is
required. Level i, i ≤ N , of the tree encompasses all the solutions reachable
by a single i-move from the initial solution (i.e., the one at the root of the
tree).
In the tree, the root node (a single node at level 1) represents a feasible solution of TFP-SS which can be a random initial solution or a current
best known solution, necessarily feasible. The internal tree nodes (at level
i, i ≤ N ) represent solutions resulting from i-moves performed on the root
solution. Importantly, each internal node also stores an incomplete, i.e., infeasible solution. Leaf nodes (at the lowest possible tree level) are those
where the algorithm must stop the search because of the pre-set branching
rules (e.g., any team can have only one member participating in the same
move). A branch between two or more nodes represents a directed correspondence between two or more solutions.
Let egoij denote the individual that has been removed from a team, perturbing the solution at node j, j ∈ Ji , at level i, i ∈ I, of the tree, where Ji
17
Level 1
1-2-3
4-5-6
7-8-9
Initial Solution
Level 2
1-2-3
5-2-3
4-1-6
4-1-6
7-8-9
7-8-9
1-2-3
6-2-3
4-5-1
4-5-1
7-8-9
1-2-3
7-8-9
4-5-6
7-2-3
4-5-6
1-2-3
4-1-6
1-8-9
1-8-9
1-2-3
9-2-3
4-5-6
4-5-6
7-8-1
7-8-1
Infeasible
Feasible
2 is leaving team 1
Level 3
1-2-3
2-5-3
4-1-6
4-1-6
7-8-9
7-8-9
1-2-3
5-2-3
4-1-6
4-1-6
7-8-9
8-2-3
7-8-9
4-1-6
7-5-9
1-2-3
9-2-3
7-5-9
4-1-6
4-1-6
7-8-5
7-8-5
Infeasible
Feasible
4 is leaving team 2
Level 4
1-5-3
5-4-3
2-1-6
2-1-6
7-8-9
7-8-9
1-5-3
7-5-3
4-1-6
4-1-6
1-5-3
2-8-9
9-5-3
2-8-9
4-1-6
4-1-6
7-8-2
7-8-2
Infeasible
Feasible
Figure 5: General Search tree structure of LK-TFP. The algorithm sequentially traverses
the infeasible solution space until one infeasible solution with an acceptable (improving)
feasible counterpart is found.
and I are the number of nodes at level i of the tree and the number of levels
in the tree, respectively. An individual is called a friend of the ego if there
exists a social (i.e., network) link between them. A set of egoij ’s friends is
denoted by Fij , j ∈ Ji , i ∈ I. When egoij joins another team, one of this
team’s current members that are not friends with egoij must leave the team.
These candidates for leaving the team form a set denoted by Lij , j ∈ Ji ,
i ∈ I.
The algorithm starts by branching from the root, selecting individuals
one-by-one to be ego; each ego attempts to join friends at other teams. If
no such friends exist, two individuals within the minimum distance from the
ego are selected. Then, leaving individuals are determined. For each leaving
node in Lij , a branch is added, pointing at a new node at the next level of
the tree.
For instance, suppose that ego node 1 is on team A, with its friend nodes
4 and 8 on teams B and C, respectively. As node 1 is added to team B or C,
with some individuals who are not friends with node 1 leaving that team, a
branch is added to the tree. As the potential leaving nodes join other teams
(in turn), branches are added to the lower levels of the tree.
18
At every tree node, e.g., with individual j as an ego at level i, the algorithm studies two solutions: one feasible and one infeasible. Infeasible
solutions are generated by replacing the ego with one of the leaving nodes
(thus creating a duplicate of the latter). Feasible solutions are obtained by
completing an infeasible move, i.e., having the ego replace the leaving node.
Whenever an improving feasible solution is found, the algorithm is restarted,
with the new best solution at the root of a new tree. If the feasible solution is
non-improving, then the gain of the corresponding infeasible solution is computed as gij = f (vp ) − f (vq ), where gij is the gain of the current i-move at
tree node j, and f (vp ) and f (vq ) are the incremental changes in the objective
function incurred by switching individuals p and q. The algorithm continues
to branch if gij is positive; otherwise, the tree node and all its subsequent
branches are fathomed. Whenever a solution tree is completely traversed
with no improving feasible solution found, the algorithm stops.
In the computational experiments reported in this paper, LK-TFP was
implemented with the depth first tree search strategy, thus benefitting from
more efficient memory usage. The algorithm’s key steps are outlined pseudocodes for Algorithms 1 and 2.
Algorithm 1 LK-TFP Algorithm for TFP-SS
public void main()
1.
2.
3.
4.
5.
6.
7.
8.
SM
T
Initialize Xj , j = 1, . . . , M such that M
j=1 Xj = V ;
j=1 Xj = ∅ and
PM
Calculate Objective Function j=1 P(Xj );
for (t = 1; t ≤ Tmax ; t++)
depthNeighborhoodSearch(); /* executing a tree based search*/
if ( bestSolutionValue ≤ currentSolutionValue)
recordBestSolution();
/* recording the best solution*/
end if
end for
end main
In the presented version, LK-TFP was found to be very efficient in solving
TFP-SS instances. A discussion of possible reasons of such performance
follows.
19
Algorithm 2 Depth Neighborhood Search
public type depthNeighborhoodSearch()
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
while (notVisitedNodes.hasNext())
for (int i = 1; i ≤ I; i++) /* level i */
for (int j = 1; j ≤ Ji ; j++) /* node j inlevel i */
do
find eij = vq , Fij and vp ∈ Lij ;
exchange vq and vp ; /*exchanging the ego*/
replace vq with vp ; /* replacing the ego */
gij = f (vp ) − f (vq );
P
Gi = i gij ; /* calculate gain*/
depthFirstSearch(); /* Depth First Search */
while (Gi > 0)
end for
updateVisitedNodes(); /* update the list of the visited nodes */
end
end while
6.2. Performance Analysis of LK-TFP
LK-TFP is remarkably efficient in solving TFP-SS, similar to LKH for
TSP, and deserves a discussion of the reasons of such performance. Algorithms using λ-opt search usually face a serious drawback. In order to
provably find an optimum, n-opt should be applied with large n, which is
computationally infeasible in non-trivial problem instances.
LK-TFP avoids this difficulty by employing an intelligent search strategy.
At each step, the algorithm checks for the necessity of increasing the value
of λ in λ-opt moves and it considers a growing set of potential exchanges. If
these exchanges improve the objective function and the exploration succeeds
in a better feasible solution, they are always accepted and λ increases by
one. In terms of tree search, increasing λ is equivalent to going deeper into
the search tree. Starting with a feasible solution, the algorithm repeatedly
executes the exchanges guided by the incremental gain gi , until the whole
tree is traversed and the algorithm stops, or an improving feasible solution
is reached, in which case the algorithm is restarted.
Additionally, the sequential exchange criterion (Helsgaun 2000) in LKTFP enables internal nodes, which would otherwise be missed due to a pre20
node 1
+1
+3
move 5
move 1
node 5
-2
move 2
move 4
node 2
-3
move 3
node 4
node 3
+2
Figure 6: An example of an improving sequence of moves.
mature fathoming, to be visited. To fathom a node, the positive gain criterion plays an important role. At each level of the P
tree, gi = f (vj ) − f (vk )
is the gain resulting from the exchange, and Gi = ik=1 gk , is the gain obtained. Using Gi as a branch cutting criterion has a significant impact on
the algorithm efficiency. The gain consideration prevents the algorithm from
traversing very deeply in the tree and enables more rapid fathoming when
there is no improvement in exchanging. At first glance, this stopping criterion
may appear to be too restrictive. However, thanks to multiple permutations,
i.e., sequences in which the same exchanges can be performed, whenever an
improving exchange is aborted, it is still guaranteed to be discovered on a
brach under another internal node in the tree. Lin and Kernighan proved
this by showing that for a sequence of numbers with a positive sum, there
exists a cyclic permutation of these numbers such that every partial sum is
positive (Helsgaun 2000).
Consider Figure 6, eliciting the relation of gain gij to the partial sum Gi
for a five-opt sequential move. Suppose a current explored subsequence is
negative in gain (e.g., G2 = −2 < 0) resulting from move 2; if it is a part
of an improving larger sequence, then this subsequence will be found later
by traversing the tree from another node (Starting from node 5). It means
that an improving sequential move will be found. Furthermore, there are
some other rules which are useful to increase the search efficiency. Recently
performed exchanges are stored in memory for a small number of iterations.
This memory helps to escape from local optima. Moreover, LK-TFP uses
the two best friend rule similar to LK which restricts the search to the five
nearest neighbours. The idea of this rule is to place friends in the same
teams. This heuristic rule is based on expectations of which individuals are
21
more probably teammates in the optimal solution. In other words, this rule
treats the problem like a puzzle and attempts to complete it by placing right
individuals in teams in each step. After each run, the best solution is placed
in the root of tree and the search process continues. Results of LK-TFP for
different size problems are reported in the next section.
7. Computational Experiences with TFP-SS
This section explores the performance of LK-TFP on small-, medium- and
large-sized TFP-SSS instances, including those incorporating directed networks. These problems are solved by both IP (using CPLEX) and LK-TFP
(implemented in JAVA). The experiments were performed using a desktop
with Intel(R) Core(TM)i5 3GHz processor, 8GB RAM and 64 bit operating
system.
Overall, LK-TFP was executed at least 30 times for each instance to explore
the variability of the results over different runs, and thus, gauge the robustness of the algorithm. CPLEX is only able to solve small size problems (less
than 20 nodes) but for larger problems more memory is required.
7.1. TFP-SSS with Undirected Social Networks
A majority of existing social network datasets are undirected. TFP-SSS
was first solved for several real-world undirected networks including Zachary
Karate Club (ZKC), Florentine Families (FF), Kapferer Mine (KM), Taro
Exchange (TE), Western Electric Employees (WEE), Thurman Office (TO)
and Bernard & Killworth (BK) (retrieved from http://vlado.fmf.uni-lj.
si/pub/networks/data/) as well as for synthetic random networks. Without loss of generality, the weights θl , l ∈ L in (1) are set to be equal.
In the first experimental setup, ZKC data capturing a friendship network
of 34 karate club members who were observed over a 2-year period at a US
university in the 1970 is exploited (Girvan and Newman 2002). Figure 7
depicts the network divided into two teams of 17: node colors indicate team
assignments. Table 3 shows the results of both CPLEX and LK-TFP runs
with the ZKC network and varied team sizes.
The FF network represents the social relations including business ties and
marriage alliances among 16 Renaissance Florentine families. Figures 8(A)
and 8(B) depict the division of the FF network into three and four teams,
respectively: nodes of the same colors are on the same teams.
Table 4 reports the performance of CPLEX and LK-TFP with the FF net22
Figure 7: Dividing ZKC network into two teams of size 17.
Table 3: Simulation Results of the Heuristic Algorithm and CPLEX runs across ZKC
network (NA: Not Available).
Network
Teams
Optimum
ZKC
ZKC
ZKC
2
4
5
NA
NA
NA
LK-FTP Solution
Min
Ave.
Max
872
874.8
877
103 123.75 141
218
230.7
239
CPLEX Solution
NA
NA
NA
(A)
LK-TFP Time (s)
Min
Ave.
Max
12
156.2
360
1
17.3
36
67
379.15 731
CPLEX Time (s)
NA
NA
NA
(B)
Figure 8: Dividing Florentine Families network into (A) three and (B) four teams.
work, with varied number of teams.
Kapferer Mine network connects 15 miners working on the surface in a
mining operation in Zambia (then Northern Rhodesia). The KM network
with two and three teams is illustrated in Figure 9.
23
Table 4: Simulation Results of the Heuristic Algorithm and CPLEX runs across FF network.
Network
Teams
Optimum
FF
FF
FF
2
3
4
137
96
59
LK-FTP Solution
Min Ave. Max
137
137
137
96
96
96
51
56.5
59
CPLEX Solution
137
96
59
LK-TFP Time (s)
Min Ave. Max
0.08
0.9
4
1.3
2.95
8.5
3.5
10.5
21
CPLEX Time (s)
327
511
1630
Taro Exchange network represents the relation of gift-giving among 22
(A)
(B)
Figure 9: Dividing Kapferer Mine network into (A) two and (B) three teams.
households in a Papuan village. Figure 10 depicts the TE network divided
into two and three teams, respectively.
Western Electric Employees network captures relations between 14 Western
(A)
(B)
Figure 10: Dividing Taro Exchange network into (A) two and (B) three teams.
Electric (Hawthorne Plant) employees from the bank wiring room participating in horseplay. Using LK-TFP, the network is devided into 2 and 3 teams,
24
as shown in Figure 11 (A) and (B), respectively.
Thurman Office network outlines the interactions among 15 employees in
(A)
(B)
Figure 11: Dividing Western Electric Employees network into (A) two and (B) three teams.
the overseas office of a large international corporation based on informal relationships. Figure 12 shows the results of optimal team assignment with
two and three teams. Table 5 summarizes the results of running LK-TFP
and CPLEX across all networks mentioned above
(A)
(B)
Figure 12: Dividing Thurman Office network into (A) two and
(B) three teams.
Similarly, the results of both CPLEX and LK-TFP runs for the generated
problems are presented in Table 4. The obtained results confirm that LKTFP effective and efficient in solving even large scale TFP-SSS instances.
As the results illustrate, IP techniques can handle only small problems (up
to 16 individuals and five teams). For larger problems, more computational
resources are required. For the instances with up to 30 individuals and four
teams, CPLEX was not able to report any incumbent before getting out of
memory. However, LK-TFP quickly identified optimal solutions for small
25
Table 5: Simulation Results of LK-TFP and CPLEX runs across KM, TE, WEE, TO and
BK networks.
Network
Teams
Optimum
KM
KM
TE
TE
WEE
WEE
TO
TO
BK
BK
2
3
2
3
2
3
2
3
2
3
133
72
309
259
548
247
262
69
NA
NA
LK-FTP Solution
Min
Ave.
Max
133
133
133
65
71.6
72
297 306.72 309
247
253.4
259
548
548
548
247
247
247
255
259.5
262
62
65.36
69
4035 4112.8 4150
932 968.21 978
CPLEX Solution
133
72
309
259
548
247
262
69
4060
NA
LK-TFP Time (s)
Min
Ave.
Max
3.51
8.23
12.9
2.15
6.4
18.5
2.3
17.6
45
38.45 93.9
256.1
1.1
3.5
8.1
.9
11.12
23.5
12.91 30.73
45.65
21.5 101.6 151.39
68
269.5
329.2
214.5 516.3 1263.83
CPLEX Time (s)
325
616
6048
19308
109
352
1244
17039
133380
NA
Table 6: Simulation Results of the Heuristic Algorithm and CPLEX runs across the generated problems.
Candidates
Teams
Optimum
6
10
16
16
16
20
20
20
30
30
30
30
40
40
50
50
2
2
2
4
5
2
4
5
2
4
5
10
2
10
5
10
24
108
317
111
78
900
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
LK-FTP Solution
Min
Ave.
Max
24
24
24
108
108
108
317
317
317
111
111
111
78
78
78
900
900
900
240
247
254
158
163.2
169
2805 2841.9 2916
451
474.7
555
441
460.7
495
93
98.2
101
4723 4804.1 4943
1253 1281.6 1323
1722 1849.7 1931
426
439.2
461
CPLEX Solution
24
108
317
111
78
900
240
149
1523
NA
NA
NA
NA
NA
NA
NA
LK-TFP Time
Min
Ave.
X
X
0.001
0.07
0.001
0.26
0.02
4.1
0.05
1.76
1
16.3
1.2
14.03
0.9
8
61
578.65
12
44
16
39.59
9
12.8
135
1244.74
41
127.26
71
218.5
15
39.48
(s)
Max
X
0.11
0.89
13
4
50
28
18
1305
84
85
26
2307
237
419
64
CPLEX Time (s)
1
43
628
28,348
31,591
37,405
192,674
163,423
218,912
NA
NA
NA
NA
NA
NA
NA
problems. For medium- and large-sized problems, the runtime of LK-TFP is
still remarkable. With large N and small M , LK-TFP does take longer to
identify good solutions; naturally, the individuals’ local networks get larger
in such instances.
7.2. TFP-SS with Directed Social Networks
There exist cases where the relationships between individuals in a social
network are not necessarly bidirected. Incorporating the direction of relations
into TFP-SS requires the use of other types of NSMs such as full diad, 2-out
star, 3 cycle and transitive triplet. The directed TFP-SSS instances were
created with Dickson Bank Wiring (DBW),Thurman Organizational Chart
(TOC), Succor Teams (ST) (available on http://vlado.fmf.uni-lj.si/
pub/networks/data/) and Advogato networks.
26
Dickson Bank Wiring network consists of 14 Western Electric employees
helping each other with work. Figure 13 shows how this directed network
can be split into two and three teams, respectively.
Thurman Organizational Chart (TOC) indicates formal relationships (the
(A)
(B)
Figure 13: Dividing Dickson Bank Wiring network into (A) two and (B) three teams.
organizational chart) of 15 employees of a large international corporation.
Based on this directed network, employees are assigned into two and three
teams as depicted in Figure 14.
Succor Teams network represents that relationship between 35 countries in
terms of exporting succor players. Here, a link from country A to Country B
means that country A exports succor players to country B. Figure 15 shows
how these countries can be divided into two subgraphs (teams).
Table 7 summarizes the results of Cplex and LK-TFP runs for the directed
networks described above.
Table 7: Simulation Results of LK-TFP and CPLEX runs across BWD, TOC and ST
networks.
Network
Teams
Optimum
BWD
BWD
TOC
TOC
ST
ST
2
3
2
3
2
3
36
30
48
21
761
NA
LK-FTP Solution
Min Ave. Max
36
36
36
30
30
30
42
45.8
48
21
21
21
749 754.9 761
345 350.1 368
CPLEX Solution
36
30
48
21
761
NA
27
LK-TFP Time (s)
Min
Ave.
Max
.05
.9
3.1
1.09
2.8
7.9
4.6
12.8
38
9
32.1
58.03
239.8 580.15 789.4
1208 1875.7 2349.8
CPLEX Time (s)
363
324
468
568
42757
NA
(A)
(B)
Figure 14: Dividing Thurman Organizational Chart network into (A) two and (B) three
teams.
Figure 15: Dividing Succor Teams network into two teams.
Importantly, in the settings where many people are required to be grouped
into working teams, e.g., in emergency situations or large projects, the scalability of computational methods becomes an issue. This section explores
this issue with TFP-SS instances with directed networks. To this end, a
sample of the trust network of Advogato with 500 users in an online com28
munity platform for software developers constructed in 1999 (KONECT:
http://konect.uni-koblenz.de/networks/advogato) is taken to formulate three TFP-SS instances. Table 8 summarises the results. This trust
Table 8: Simulation Results of the Heuristic Algorithm and CPLEX runs across Advogato
network.
Candidates
Teams
Optimum
500
500
500
5
10
20
NA
NA
NA
LK-FTP Solution
Min
Ave.
Max
1820 1991.8 2191
590
606.3
632
458
520.1
584
CPLEX Solution
NA
NA
NA
LK-TFP Time (s)
Min
Ave.
Max
787
7846
21,940
896 11918.5 32101
453
9976.8
28356
CPLEX Time (s)
NA
NA
NA
network is an appropriate instance indicating how one can adapt TFP-SS to
find teams in which their members highly prefer reliable teammates. Such
formulation may be particularly used when a close cooperation is required
within a large number of teams in emergency situations with participating
NGOs and multiple volunteers.
8. Conclusion and Discussion
This paper presents a mathematical framework which explicitly incorporates social structure in treating Team Formation Problem. The presented
framework introduces models that quantitatively exploit the underlying network structure in team member communities. Importantly, this paper also
opens broader research opportunities in the area of prescriptive SNA modeling. The presented framework sheds light on the relationship between social
network theories and social structures, and discusses how to quantify social
structure using information provided by the underlying graph. In order to assess team performance, network structure measures quantifying both social
relations and individual attributes are given. The paper explores TFP-SS
instances with measures based on network structures as edges, full dyads,
triplets, k-stars, etc.
For a proven NP-Hard instance of TFP-SS, called TFP-SSS, an integer
programming formulation is presented for exact solution computation. In
order to tackle problem instances based on TFP-SS, an efficient LK-TFP
heuristic based on variable depth neighborhood search is developed for small, medium- and large-sized instances with both real and randomly generated
networks. The idea of λ-opt sequential search, introduced and developed
by Lin, Kernighan and Helsgaun for solving large TSP instances, is also
successfully applied to solve TFP-SS instances with undirected and directed
29
networks. This paper describes the resulting LK-TFP heuristic as a tree
search, and explains the roots of its efficiency, confirmed by computational
results.
Observe also that TFP-SS instances can be interpreted as certain clique
relaxation problems, by identifying the NSMs that can be used in TFP-SS
to match clique relaxation-based formulations. As such, consider a TFP-SS
instance, where only the number of edges is considered as the NSM and the
objective is to maximize the minimum over all the teams’ outcomes. This
optimization problem can be interpreted as that of finding M network subsets
such that each of them is a relaxed clique of the pre-defined minimal quality
(e.g., an s-defective clique for a given fixed value of s). This observation signal
that LK-TFP framework, and hence, LK-TFP can be useful for current and
future efforts in clique relaxation domain, where clustering problems have
been dominant so far.
While this paper demonstrates the advantages of the presented framework
to prescriptively implement SNA theories in TFP, some potential directions
for further improvement exist. While the framework is able to generate a
range of models for TFP based on social structures, the question of selecting
the best model for a given application deserves more attention. Also, this paper avoided an extended discussion on estimating the function relating NSMs
and observed team outcomes: this issue can be addressed in the future. Furthermore, LK-TFP can be further tested and implemented to address other,
similar problems, e.g., clique relaxation problems. Since TFP-SS presents
computationally challenging problems, other optimization algorithms can be
designed for treating TFP-SS models; exact methods are of particular interest. Finally, the presented framework’s ideas can be extended to problems
beyond the team formation problem. Network clustering, information influence, community detection, and scheduling (e.g., of work shifts) problems
are especially promising.
References
Abbasi, A., Altmann, J., 2011. On the correlation between research performance
and social network analysis measures applied to research collaboration networks. In: 44th Hawaii International Conference on Systems Science (HICSS44), Jan. 4-7, Hawaii, USA.
Agustı́n-Blas, L. E., Salcedo-Sanz, S., Ortiz-Garcı́a, E. G., Portilla-Figueras, A.,
Pérez-Bellido, Á. M., Jiménez-Fernández, S., 2011. Team formation based on
30
group technology: A hybrid grouping genetic algorithm approach. Computers
& Operations Research 38 (2), 484–495.
Albert, R., Barabási, A.-L., 2002. Statistical mechanics of complex networks. Reviews of modern physics 74 (1), 47.
Aral, S., Muchnik, L., Sundararajan, A., 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of
the National Academy of Sciences 106 (51), 21544–21549.
Arulselvan, A., Commander, C. W., Elefteriadou, L., Pardalos, P. M., 2009. Detecting critical nodes in sparse graphs. Computers & Operations Research
36 (7), 2193–2200.
Balkundi, P., Barsness, Z., Michael, J. H., 2009. Unlocking the influence of leadership network structures on team conflict and viability. Small Group Research
40 (3), 301–322.
Bettinelli, A., Liberti, L., Raimondi, F., Savourey, D., 2013. The anonymous subgraph problem. Computers & operations research 40 (4), 973–979.
Borgatti, S. P., 2005. Centrality and network flow. Social networks 27 (1), 55–71.
Borgatti, S. P., Halgin, D. S., 2011. On network theory. Organization Science
22 (5), 1168–1181.
Ceravolo, D. J., Schwartz, D. G., FOLTZ-RAMOS, K. M., Castner, J., 2012.
Strengthening communication to overcome lateral violence. Journal of Nursing Management 20 (5), 599–606.
Chen, S.-J., Lin, L., 2004. Modeling team member characteristics for the formation of a multifunctional team in concurrent engineering. Engineering Management, IEEE Transactions on 51 (2), 111–124.
Contractor, N. S., Wasserman, S., Faust, K., 2006. Testing multitheoretical, multilevel hypotheses about organizational networks: An analytic framework and
empirical example. Academy of Management Review 31 (3), 681–703.
Dorn, C., Dustdar, S., 2010. Composing near-optimal expert teams: a trade-off
between skills and connectivity. In: On the Move to Meaningful Internet
Systems: OTM 2010. Springer, pp. 472–489.
Fitzpatrick, E. L., Askin, R. G., 2005. Forming effective worker teams with multifunctional skill requirements. Computers & Industrial Engineering 48 (3),
593–608.
Garey, M. R., Johnson, D. S., 1979. Computers and intractability. Vol. 174. freeman New York.
Gaston, M., Simmons, J., DesJardins, M., 2004. Adapting network structures for
efficient team formation. In: Proceedings of the AAAI 2004 Fall Symposium
on Artificial Multi-agent Learning.
31
Girvan, M., Newman, M. E., 2002. Community structure in social and biological
networks. Proceedings of the National Academy of Sciences 99 (12), 7821–
7826.
Goyal, A., Bonchi, F., Lakshmanan, L. V., 2010. Learning influence probabilities
in social networks. In: Proceedings of the third ACM international conference
on Web search and data mining. ACM, pp. 241–250.
Hahn, J., Moon, J. Y., Zhang, C., 2008. Emergence of new project teams from
open source software developer networks: Impact of prior collaboration ties.
Information Systems Research 19 (3), 369–391.
Helsgaun, K., 2000. An effective implementation of the lin–kernighan traveling
salesman heuristic. European Journal of Operational Research 126 (1), 106–
130.
Juang, M.-C., Huang, C.-C., Huang, J.-L., 2013. Efficient algorithms for team
formation with a leader in social networks. The Journal of Supercomputing,
1–17.
Kargar, M., An, A., Zihayat, M., 2012. Efficient bi-objective team formation in social networks. In: Machine Learning and Knowledge Discovery in Databases.
Springer, pp. 483–498.
Kempe, D., Kleinberg, J., Tardos, É., 2003. Maximizing the spread of influence
through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp.
137–146.
Kothari, R., Ghosh, D., 2013. Insertion based lin–kernighan heuristic for single
row facility layout. Computers & Operations Research 40 (1), 129–136.
Lappas, T., Liu, K., Terzi, E., 2009. Finding a team of experts in social networks. In: Proceedings of the 15th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM, pp. 467–476.
Leite, A. R., Borges, A. P., Carpes, L. M., Enembreck, F., 2011. Improving the distributed constraint optimization using social network analysis. In: Advances
in Artificial Intelligence–SBIA 2010. Springer, pp. 243–252.
Lin, S., Kernighan, B. W., 1973. An effective heuristic algorithm for the travelingsalesman problem. Operations research 21 (2), 498–516.
Manser, T., 2009. Teamwork and patient safety in dynamic domains of healthcare:
a review of the literature. Acta Anaesthesiologica Scandinavica 53 (2), 143–
151.
Nascimento, M. C., Pitsoulis, L., 2013. Community detection by modularity maximization using grasp with path relinking. Computers & Operations Research
40 (12), 3121–3131.
32
Newman, M. E., 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103 (23), 8577–8582.
Pirim, H., Ekşioğlu, B., Perkins, A. D., Yüceer, Ç., 2012. Clustering of high
throughput gene expression data. Computers & operations research 39 (12),
3046–3061.
Robins, G., Pattison, P., Kalish, Y., Lusher, D., 2007. An introduction to exponential random graph (¡ i¿ p¡/i¿¡ sup¿*¡/sup¿) models for social networks.
Social networks 29 (2), 173–191.
Ronald, B., 1992. Structural holes: The social structure of competition. Cambridge: Harvard.
Ruef, M., Aldrich, H. E., Carter, N. M., 2003. The structure of founding teams:
Homophily, strong ties, and isolation among us entrepreneurs. American sociological review, 195–222.
Salari, M., Naji-Azimi, Z., 2012. An integer programming-based local search for
the covering salesman problem. Computers & Operations Research 39 (11),
2594–2602.
San Segundo, P., Rodrı́guez-Losada, D., Jiménez, A., 2011. An exact bit-parallel
algorithm for the maximum clique problem. Computers & Operations Research 38 (2), 571–581.
San Segundo, P., Tapia, C., 2014. Relaxed approximate coloring in exact maximum
clique search. Computers & Operations Research 44, 185–192.
Shi, Z., Hao, F., 2013. A strategy of multi-criteria decision-making task ranking
in social-networks. The Journal of Supercomputing, 1–16.
Snijders, T. A., Van de Bunt, G. G., Steglich, C. E., 2010. Introduction to stochastic actor-based models for network dynamics. Social networks 32 (1), 44–60.
Squillante, M., 2010. Decision making in social networks. International Journal of
Intelligent Systems 25 (3), 225–225.
Wasserman, S., Faust, K., 1994. Social network analysis: Methods and applications. Vol. 8. Cambridge university press.
Wi, H., Oh, S., Mun, J., Jung, M., 2009. A team formation model based on
knowledge and collaboration. Expert Systems with Applications 36 (5), 9121–
9134.
Zhang, L., Zhang, X., 2013. Multi-objective team formation optimization for new
product development. Computers & Industrial Engineering 64 (3), 804–811.
Zhang, Z.-x., Luk, W., Arthur, D., Wong, T., 2001. Nursing competencies: personal characteristics contributing to effective nursing performance. Journal of
Advanced Nursing 33 (4), 467–474.
33
Zhong, X., Huang, Q., Davison, R. M., Yang, X., Chen, H., 2012. Empowering
teams through social network ties. International Journal of Information Management 32 (3), 209–220.
Zhu, M., Huang, Y., Contractor, N. S., 2013. Motivations for self-assembling into
project teams. Social Networks 35 (2), 251–264.
9. Appendix
Theorem 1 Consider an instance of TFP-SSS, with M teams to be
formed out of N individuals,
Instance: A graph G(V, E), |G|= N ; nj ∈ Z + for 1 ≤ j ≤ M ; a partition of
disjoint sets X1 , X2 , . . . , XM , where Xj ⊆ G(V, E) for 1 ≤ j ≤ M and θl ∈ R
for l ∈ L.
Question: Is there a partition of
PV into M disjoint subsets X1 ∪ X2 ∪ . . . ∪
XM , with |Xj |= nj , such that M
j=1 P(Xj ) is maximized, where P (Xj ) =
Pnj P
PM
i=1
j=1 nj = N ?
l∈L θl Fl (NXj (vi )) for 1 ≤ j ≤ M and
The presented problem is NP-Hard.
Proof: The proof proceeds by a polynomial-time reduction from Partition
into Triangles. An arbitrary instance of Partition into Triangles is given.
Instance: A graph G(V, E), with |G|= 3q, for a given fixed integer q.
Question: Can the vertices of G be partitioned into q disjoint sets V1 , V2 , . . . , Vq ,
each containing exactly 3 vertices, such that for each Vi = ui , vi , wi , 1 ≤ i ≤ q,
the three edges {ui , vi }, {ui , wi }, and {vi , wi } all belong to E (Garey and
Johnson 1979)?
Consider a particular instance of TFP-SSS with N = 3q and Xj = Vj for
1 ≤ j ≤ q and M = q. Set θl = 1 for l = {triangle} and θl = 0 otherwise (i.e.,
use the number of triangles as the only NSM in the objective function of TFPSSS. Finally, set nj = 3 for 1 ≤ j ≤ q. To demonstrate that there is a one-toone correspondence between the described Partition into Triangles and TFPSSS instances, suppose that Xj∗ , j = 1, . . . , q is the optimal solution to TFPP
SSS. Therefore, one has that |Xj∗ |= 3, and also, M
j=1 Ftriangle (NXj (vi )) =
q is the maximal value of the objective function. Observe that Vj∗ , with
|Vj∗ |= 3, which assigns three nodes to partition j is equivalent to Xj∗ . Note
that these three nodes form a triangle. Suppose that Xj∗ 6= Vi∗ for some
i’s and j’s, with 1 ≤ i, j ≤ q. Then, there exists at least one partition
/ E. Therefore, one
with 3 nodes {ui , vi , wi } ∈ Xj∗ such that {eui , evi , ewi } ∈
PM
has j=1 Ftriangle (NXj∗ (vi )) < q, which is a contradiction since there are q
34
partitions and at least one of them does not have a triangle. Hence, Xj∗ = Vi∗
is an optimal solution for both problems. This completes the proof.
35
Violent Conflict and Online Segregation: An
analysis of social network communication across
Ukraine’s regions*
Dinissa Duvanova†
Department of International Relations
Lehigh University
Alexander Nikolaev
Department of Industrial and Systems
Engineering
University at Buffalo, SUNY
Alex Nikolsko-Rzhevskyy
Department of Economics
Lehigh University
Alexander Semenov
Department of Computer Science and Information
Systems, The University of Jyvaskyla, Finland
Abstract
Does the intensity of a social conflict affect political division? Traditionally, social cleavages are
seen as the underlying cause of political conflicts. It is clear, however, that a violent conflict itself
can shape partisan, social, and national identities. In this paper, we ask whether social conflicts
unite or divide the society by studying the effects of Ukraine’s military conflict with Russia on
online social ties between Ukrainian provinces (oblasts). In order to do that, we collected original
data on the cross-regional structure of politically relevant online communication among users of
VKontakte social networking site. We analyze the panel of provinces spanning the most active
phases of domestic protests and military conflict and isolate the effects of province-specific war
casualties on the nature of inter-provincial online communication. The results show that war
casualties entice strong emotional response in the corresponding provinces, but do not
necessarily increase the level of social cohesion in inter-provincial online communication. We
find that the intensity of military conflict entices online activism, but activates regional rather
than nation-wide network connections. We also find that military conflict tends to polarize some
regions of Ukraine, especially in the East. Our research brings attention to the underexplored
areas in the study of civil conflict and political identities by documenting the ways the former
may affect the latter.
JEL Code: K42, H56, O5, P2, Z1
Keywords: Ukraine, social media, war, terrorism
The authors would like to thank Ruben Enikolopov, Konstantin Sonin, and participants of the
Symposium for a Special Issue of the Journal of Comparative Economics for helpful comments. The
research of A. Semenov is supported by the Academy of Finland Grant #268078 "Mining social media
sites" (MineSocMed).
† Corresponding author. Assistant professor, Department of International Relations, Lehigh University,
email: [email protected].
*
1
Electronic copy available at: http://ssrn.com/abstract=2664949
1. Introduction
Does the intensity of social conflicts affect political divisions? On the one hand, the
traumatic experience of violence may reinforce polarizing identities (Wilkinson 2004). On the
other, violent conflict may help consolidate the society and promote social capital (Russett 1990,
Voors et al. 2012, Blattman 2009). How exactly do violent conflicts reshape the society? We
attempt to answer this question by studying the effects of Ukraine’s "Revolution of Dignity" and
the military conflict with Russia on online social ties between Ukrainian provinces.
Researchers identified digital communication and online activism as increasingly
consequential forms of behavior, as well as important mechanisms of political change. Social
media has been shown to affect civic engagement, political participation, and economic choice.1
Digitally enabled forms of political communication have also become increasingly important
sources of attitudinal and behavioral data. As the digital revolution gives rise to new electronic
forms of mass communication and virtual association, it opens greater opportunities to study
how people form attitudes, express their opinions, and engage in collective behavior. In this
paper, we explore such opportunities by analyzing ways in which political information shapes
online social engagement.
We examine online political activism and engagement during the period of political
contention spanning the anti-regime Euro Maidan protests, annexation of Crimea, and armed
insurgency and foreign intervention in Eastern Ukraine. We believe that Ukraine’s case presents
advantageous settings not only for investigating online activism as an increasingly popular form
of civil engagement, but also for evaluating long-standing questions about the role of violent
1 Researchers find empirical links between exposure to digital communication technology and political
attitudes (Kerbel 2009), voting (Christakis and Fowler 2009, Vitak et al. 2011, Bond et al. 2012), civic
engagement (Jennings, and Zeitner 2003, Jensen at al. 2007, Bennett & Segerberg, 2013), campaign
contributions (Hamilton & Tolbert 2012), support for political parties (Norris 2003), and collective action
(Earl 2011, Bennett and Segerberg 2013). Scholars studying authoritarian regimes link social media to
oppositional attitudes and protest behavior (Tang, Jorba, and Jensen, 2012, Howard & Hussain, 2013;
Lim, 2012; and Tufekci & Wilson, 2012). Social media is also identified as an effective tool of governance
that affects policymaking (Kerbel, 2012; Baum, 2012; Lawless, 2012). Enikolopov et al. (2015)
demonstrate that blogs affect stock market and corporate governance. See Fox & Ramos, 2012, and
Jensen et al., 2012, for a review of the related literature.
2
Electronic copy available at: http://ssrn.com/abstract=2664949
conflict in promoting social change. Ukraine’s turbulent politics provide a rich context for
studying social conflict. In Ukraine, the relative weakness (Aslund and McFaul 2006, Birch 2000)
and “fluidity” (Zielinski et al. 2008) of institutional mechanisms of routine public engagement
(political parties, unions, advocacy groups) make virtual communication a particularly important
venue of political engagement.
The ability to freely exchange opinions and share relevant political information is the
cornerstone of a democratic society (Huckfeldt and Sprague 1995). In new democracies that lack
proper institutions, social networking websites such as Facebook may not only provide easily
accessible venues for political expression, but also serve as substitutes for underdeveloped
institutions of civil society.2 In order to analyze such increasingly accessible and important
mechanisms of political expression, we collected original data on the cross-regional structure of
politically relevant online communications. In particular, our dataset contains user-created posts
and comments from public political groups in VKontakte (VK.com)—the largest, by the number
of registered users and daily visits, social networking site in Ukraine. Each post and related
comments, in addition to the date and time stamp, contain user-specific information such as
their name, self-reported home and current cities, education, the list of languages spoken, etc.
Unfortunately, due to the privacy concerns, the individual-level data, while present in the original
dataset, had to be first aggregated to the group level before we could use it for our analysis.
Nevertheless, we are able to utilize information on contributors to specific discussions, including
their regional composition.
Our analysis of virtual communication carried out in VKontakte discussion groups
reveals that the intensity of military aggression as captured by the widely publicized information
about army fatalities while uniting some, segregates other parts of the Ukrainian social media
community. We analyze volumes of cross-provincial communication and engagement in online
discussion groups and find that information about civil protests (Maidan) and war casualties
2
See Beissinger (2012) for a related discussion.
3
leads to greater online activism on the part of users from the affected provinces. Such
engagement, however, remains mostly localized and does not affect other parts of the country.
We also find considerable variation in the ways different parts of the country respond to the
violent conflict as measured by the number of casualties.3 Political information enticing a strong
emotional response has a polarizing effect in Eastern oblasts (provinces), but not in the rest of the
country. While in Western oblasts war casualties result in increased disengagement from the rest
of the country, in the East they lead to greater polarization. This finding corroborates previous
studies that link online communication to political fragmentation and polarization (Rozenblat
and Mobius 2004; Duvanova et al. 2015; Bennett and Segerberg, 2013; Webster, 2007; Prior,
2007).4 To our knowledge, it is the first study to systematically examine the causal effects of
divisive news on the digitally enabled form of public engagement.
The paper is organized as follows. The next section develops our argument, analytical
model, and hypotheses. We then describe our data collection, methods, and aggregation
procedures in Section 3. Section 4 presents our empirical analyses. Conclusions that summarize
our results and contributions are presented in Section 5.
2. Insurgency, casualties, and social network communication
2.1 Theoretical Considerations
Traditionally, social identities are seen as the underlying causes of political contention
(Fearon & Laitin 2000, Montalvo & Reynal-Querol 2005, Montalvo and Reynal-Querol 2010).
Deep and persistent cleavages polarize the society and may contribute to violent conflict (Fearon
& Laitin 2003, Blattman & Miguel 2010, Jackson & Morelli 2011).5 At the same time, tribal,
3 We
divide Ukraine into Western, Center, Southern, and Eastern parts according to the established
convention. The list of oblasts falling into each region is presented in Figure 1.
4 Gentzkow and Shapiro (2011) on the contrary, find that the ideological fragmentation in online news
consumption is higher than in offline media, but low in absolute terms and in comparison to face-to-face
communication. While we do not make any claims about the absolute levels of social media
fragmentation in Ukraine, our finding that information about intensity of fighting heightens regional
segregation of virtual network does not contradict Gentzkow and Shapiro’s conclusions and further
extends this line of research.
5 Studies have shown that mass media can increase the salience of ethnic identities, heighten racial
animosity (Della Vigna et al. 2014, Adena et al. 2015), and influence political behavior (Enikolopov et al.
4
ethnic, and national identities are shaped by collective experiences of war and violence. Instead
of asking whether social divisions affect political conflict, we investigate the reverse causal link
between the intensity of violent conflict (measured by the number of fatal casualties) and
networking of diverse groups of online population.
Scholars have long recognized war as a companion and catalyst of nation building
(Anderson 1982, Mylonas 2012) and state-making (Tilly 1992, Thies 2005, Acemoglu and
Robinson 2006). External threats (international conflict or outer-group attacks) reduce the
salience of internal divisions, suppresses internal dissent, and helps build the sense of group
identity. In the face of external threats—violent conflicts in particular—citizens tend to re-define
the relationship between their partisan, social, and national identities. The studies of the “rally
‘round the flag” effect demonstrate that, if considered as such by the majority of the country,
external threat tends to unite its citizens (Mueller 1970, Russett 1990). According to this theory,
foreign threats activate non-divisive identities and have a unifying effect in politically diverse
groups.
When studying Ukraine, however, it is unclear whether the war in the Donbas would
have such a unifying effect. An ethno-linguistically diverse nation, Ukraine might not have yet
developed a strong sense of national identity. Since the introduction of competitive elections,
political preferences and voting patterns of the Ukrainian citizens largely overlapped with ethnolinguistic divisions (Clem and Craumer 2008). Anti-government protests were fueled by ethnic
and cultural cleavages, rather than by universalistic democratic principles (Beissinger, 2013).
Russian media have had an important influence on the political behavior of exposed Ukrainians,
reinforcing a pro-western—pro Russian schism (Peisakhin and Rozenas 2015). With most
Ukrainians being fluent in Russian, self-selected exposure to biased media explains why the very
nature of Ukraine’s armed conflict remains unclear. While the nationalist, pro-western outlets see
the conflict in terms of foreign intervention, Russian state-controlled media cover it as an
2011, and Della Vigna & Gentzkow 2010 for review). Similarly, social media that harnesses social
networks of like-minded people is seen as an important source of influence in politics (Murphy and
Shleifer 2004, Christakis and Fowler 2009, Bond et al. 2012).
5
insurgency, separatism, and civil war. In more general terms, one might debate whether the “rally
‘round the flag” effect would be appropriate in a deeply divided nation in the midst of a war. The
ethno-linguistic as well as regional dimensions of the conflict may further fragment the society
along these fault lines.
A cursory analysis of online communication might suggest that in fact the Ukrainian
virtual public is getting increasingly engaged in a way that bridges geographically defined
linguistic and political barriers. Figure 2 plots the strength of online communication ties
connecting users of VKontakte social networking platform at two points in time.6 The first
graph maps cross-provincial communication during the early phase of Maidan protests
(November 2013). The second graph does the same for the period of intense fighting between
the Ukrainian army and insurgent forces (August 2014). It appears that online communication
has a greater density and more pronounced cross-regional nature during the war, comparing to
the period before the start of insurgency. On the other hand, the Southern part of Ukraine
(especially the occupied Crimean peninsula) is generally excluded from interprovincial
communication in 2014, although in 2013 it was closely connected to the Central and Western
oblasts. Does that mean the Ukrainian public becomes more or less united in response to the
violent conflict in the Donbas? Visual analysis of online communication might be misleading.
First, the overall importance of inter-regional ties cannot be adequately assessed without explicit
reference to intra-regional communication. Second, communication does not mean an
agreement. Heated debates might entice participation, but at the same time, may further polarize
virtual communication.
6 We first aggregate individual-level data on online political communication to the level of the province.
For each pair of provinces (dyads), we identify discussion topics with participants from each province.
For every dyad, the intensity of shared communication is measured as the number of topics that both
provinces contributed to. In order to account for the value of a topic in determining communication
intensity, we use the percentages of messages contributed by the regions with respect to their total
message volumes. In other words, for each topic we calculate its share in the total number of wall posts
contributed by each of the provinces. Then we add the products of provinces’ contributions (as the share
of total posts) to each of the shared topics for the corresponding dyad. The resulting measure of the
inter-provincial ties ranges from 0 to 1, with the higher numbers corresponding to stronger ties. Our
measure has an advantage of being balanced with respect to the total number of users from the specific
province and the number of topics they discuss.
6
Although this is a grossly simplified characterization, the political cleavage lines that have
been observed since the early post-independence period closely follow the East-South vs. WestCenter geographical divide. Previous research has identified ethnic, linguistic, socio-economic,
and cultural-historical cleavages as mutually reinforcing factors accounting for regional variation
in public attitudes and political behavior (Birch 2000; Aslund and McFaul 2006; Clem and
Craumer 2008, Beissinger 2013).
While researchers disagree on the precise composition and relative theoretical merits of
the underlying social factors, it appears that Ukrainian political preferences clearly and
consistently vary across regional lines, with Russian-speaking Eastern and Southern Ukraine,
which is more industrialized and dominated by highly concentrated industry, being significantly
different from the Ukrainian-speaking Western and Central provinces, which are characterized
by small enterprises and tertiary-sector domination.
Making use of such persistent and clearly defined social divides, our identification
strategy consists of using the province-specific Ukrainian military fatalities as a measure of a
conflict’s intensity. Ukrainian media routinely reports personal information on all military
personnel and volunteers killed in Eastern Ukraine, including one’s place of origin. This makes it
possible for the online groups to link casualties with specific provinces. Soldiers and officers
from all parts of the country serve in Ukrainian armed forces; solders from the South, East,
West, and Center are equally likely to die in fighting separatists. The key advantage of
considering casualties is that they can be, for the most part, considered exogenous and due to
their specific geographical attributes we can derive province-specific expectations about
unifying/polarizing effects. Conflict casualties were first reported in March 18, 2014, and peaked
in August 2014 at over 500 dead during the Ilovaisk massacre. Figure 3 graphs the monthly
casualties data over time.
We identify two dimensions of cross-regional communication: levels of cross-regional
fragmentation and the degree of its polarization. Fragmentation and polarization are two distinct
7
aspects of network communication. We define fragmentation of online communication as the lack
of connections (exchange of information) between different groups in the society. For the
purpose of our analysis, we concentrate on the inter-provincial connections. But how do we
know the information is being exchanged? Open communication platforms allow all network
users to access the information, but this does not guarantee all users are exposed to such
information. In fact, prior studies clearly documented “selective exposure” in pluralistic
information environments (Sunstein 2001; Stroud, 2011). We explore inter-regional
fragmentation using three different empirical metrics capturing: 1) how active users from a given
province are in contributing to online communication; 2) to what extent the online
communication (in a group or discussion forum) brings together users from different provinces;
and 3) how “influential” the content of one province’s posts and comments is in enticing a
response from other members of the social network. In our specific case, the response is
captured by the number of comments the original message (post) attracts.
Of course, active communication among users from various provinces does not
necessarily imply a constructive dialogue. Therefore, we consider how polarized the sentiment is
in each discussion. Because we analyze online discussions on a diverse set of political topics, we
measure polarization as the difference in the intensity of positive and negative sentiment
captured by the content analysis of the comments.
2.2. The Model and Hypotheses
We model all online communication as falling into two types. Type I communication is
carried between users residing in the same province. Type II communication takes please
between users from any two different provinces. Interprovincial communication, in our view, is
more likely to be carried out by the friendship, familial, occupational, and off-line association
networks and, due to geographical patterns of ethno-linguistic and political cleavages, unite
people of proximate political preferences and ethno-linguistic background. Type II
communication, on the contrary, is less likely to be anchored by physical connections and more
8
likely to engage people of different backgrounds and political preferences. While we are unable
to differentiate between in-group ties based on physical connections and impersonal out-group
ties, it is more likely that inter-provincial communication would fall into an impersonal outgroup type because of the greater physical and cultural distance.
We assume that the users incur time costs and social benefits by posting messages and
comments. Furthermore, social benefits increase with the sense of community, solidarity, and
social efficacy. Hence, a user is more likely to contribute a costly effort to a conversation that is
more socially relevant to her. 7 In our analysis, political events, such as Maidan revolution and
escalation of the war in the Donbas, increase social relevance of some online communication.
We also assume that VKontakte users are generally aware of ethno-linguistic backgrounds and
political preferences of the authors of wall posts. They may deduce this information from the
author’s choice of language (Ukrainian or Russian), content and tone of the message, previous
history of public communication, and, perhaps, from the home town and/or town of residence
associated with users’ profiles.8 We model communication as initiated by user i, who posts a
message on the VKontakte wall. User j ≠ i may respond to i’s posts by a number of comments
xij. If both i and j reside in the same province, xij will constitute Type I communication. If i and j
are from different provinces, Type II communication takes place.
In our model, the social network users utilize the information on the intensity of social
conflict to update their political beliefs and preferences. They may rely on various media sources
to obtain such information. Because the media outlets may be selective in covering various
aspects of the conflict, we chose to concentrate on the widely publicized objective measure of
the intensity of social conflict—war-related fatalities of pro-Ukrainian armed forces. This
measure has an important advantage of being specific to our unit of aggregation. The VKontakte
users may obtain province-specific war causalities from official and unofficial (i.e., Wikipedia)
sources. Besides, they may be exposed to province-specific casualties via physical interactions
7
8
Relevance may mean both agreement and disagreement.
Our analysis excludes users whose profiles do not identify the place of residence.
9
with people whose family members and acquaintances serve in the army.
If the military intervention in Eastern Ukraine promotes solidarity and strengthens an allUkrainian national identity, one would expect Ukrainians of all ethno-linguistic backgrounds and
political preferences engage in online discussions (Type II communication). On the contrary, if
the conflict activates divisive ethnic, regional, or political identities, one would expect VKontakte
communication be more localized, as the people would resort to the safety of familial,
friendship, occupational, and generally more localized networks (Type I communication) where
they are more likely to be accepted. Moreover, solidarity with the defenders of Ukrainian
independence should be compatible with the inclusive Ukrainian national identity. As such, if
Ukrainians are becoming more united in their patriotism and opposition to Russia, they should
be more likely to engage in Type II communication with provinces experiencing war casualties
on the Ukrainian side to show emotional support and express sympathy. If, on the contrary, the
war polarizes rather than unites the nation, province-specific war casualties should increase Type
I communication: provincial, rather than all-Ukrainian identity should be activated in response to
province-specific fatalities.
The first set of hypotheses, therefore, addresses the general nature of inter-provincial
communication carried over the digital social networking platform:
Hypothesis 1a: Intensification of violent conflict should lower fragmentation of online
communication.
Hypothesis 1b: Intensification of violent conflict should lower the degree of polarization in
online communication.
These hypotheses are consistent with the notion that the war in the Donbas unites Ukrainians
and promotes an inclusive scene of patriotism. The opposite expectations are consistent with the
notion that the war further fragments and polarizes the nation. To evaluate these hypotheses, we
analyze the entire daily panel of Ukrainian provinces between January 1, 2013 and December 31,
2014.
Given the regional clustering of Ukraine’s socio-political and ethno-linguistic cleavages,
10
we develop a set of complementary hypotheses to test whether the patterns of inter-provincial
fragmentation and polarization differ in West-Central and East-Southern provinces:
Hypothesis 2a: Intensification of violent conflict should reduce fragmentation and
polarization of online communication in the subset of provinces with a strong prior
sense of Ukrainian identity, e.g., West-Central Ukraine.
Hypothesis 2b: Intensification of violent conflict should increase fragmentation and
polarization of online communication in the subset of provinces lacking a very strong
sense of Ukrainian identity, e.g., East-Southern Ukraine.
To evaluate these hypotheses, we split the sample into Western, Central, Eastern, and Southern
provinces. After describing our data in the next section, the following “Analysis” section
evaluates these two sets of theoretical expectations against the observed patterns of online
communication.
3. Data
3.1 Data Collection Methods
In order to investigate how social conflict affects mass political communication, we collected
data on online groups and users of VKontakte (vk.com or simply “VK” previously—
vkontakte.ru), which is the most visited social network site in Ukraine.9 With its three official
languages—Russian, Ukrainian, and English—VKontakte has over 200 million users who
primarily reside in Russia, Ukraine, Kazakhstan, Moldova, Belarus, and Israel. In Ukraine, the
site’s share of total Internet searches is second only to the amount carried out using the Google
search engine. According to Ukraine Business Online,10 out of 30 million Ukrainians who used
social networks in 2012, 20 million had VKontakte accounts. VKontakte allows users to post
public or private messages and share audio, video, and text content as well as create groups,
public pages, and events. In what follows, we analyze user-created groups that have explicitly
identifiable political content in the body of their wall posts and comments.11
9 Alexa,
the web information service, http://www.alexa.com/topsites/countries/UA, retrieved in Feb.
2015
10 http://www.ukrainebusiness.com.ua/news/7110.html
11 It should be noted that following the 2014 politically motivated resignation of VKontakte founder P.
Durov, the site was subject to increased control by the Russian secret services
(http://www.ewdn.com/2014/04/22/durov-says-he-gave-up-vkontakte-share-because-of-anti-maidan11
VKontakte offers its users the possibility to create a profile and fill it with various
personal and non-personal information, such as name, gender, date and place of birth, pictures,
etc. Each user has her own wall, where she could post various messages. These messages, along
with the user’s data, can be seen on the user’s profile page, which can be accessed by a URL.
Users can create explicit communities: groups and public pages. Communities have separate
pages. These pages contain various community details, such as name, description, logo,
community’s wall, where community news are displayed, and community discussion boards.
Users may join communities. When they do, their public information becomes accessible under
community “members” field. Depending on privacy settings, communities may be public
(available to anyone) or closed communities, where an invitation to join is required. Settings also
regulate who can post on a community wall, e.g., all users or administrators only. VKontakte
allows searching communities by different keywords. Each community has a different URL, by
which it can be accessed.
The data were collected from public communities using social media monitoring
software described in Semenov and Veijalainen (2013) and Semenov, Veijalainen, and
Boukhanovsky (2011). Collection relied on the application programming interface (API) of
VKontakte. This API interface exports data in JSON format. At the end, the data were placed in
a repository based on PostgreSQL database. Initially, the groups were identified by communities
search over all their posts using one or more of the following keywords “Ukraine”, “Украина”,
“Украïна”, “Майдан,” “Євромайдан” with all their grammatical variations.12 In total, the
search identified 14,777 public communities. Then, all public posts (message text and date) were
downloaded from each community wall. Comments attached to selected posts were gathered as
pressure-from-fsb/). Following Durov’s resignation, pundits predicted mass migration of pro-Ukrainian
users to Facebook. Despite this, VKontacte remains the most visited networking site in Ukraine, far
surpassing the number of Facebook users. It should be taken into account, however, that with some antiRussian users leaving VKontakte in protest, the subset of remaining users has become biased towards
supporting Russia. The self-selection of users into different networking platforms should bias our analysis
against finding fragmentation and polarization.
12 Including keywords such as “war”, “protest”, and “conflict” produced redundant results because these
overlapped with “Ukraine” and “Ukrainian.”
12
well. Next, city and country, listed in users’ public profiles, were gathered for those users who
were members of the mentioned communities, as well as those who posted commenters of the
wall. No personal information was ever collected or stored. We gathered 19,430,445 wall posts,
which jointly had 62,193,711 comments. During wall post and comment gathering, we
downloaded 71 GB of text data. To capture a temporal dimension to the development of online
networks, we separate posts contributed to the discussion topics into daily segments based on
the timestamps of the user-supplied posts.13 In effect, our analysis only includes discussion
groups that had active contributors during the analyzed day. Figure 3 displays the changes in the
number of wall posts and comments contributed during the analyzed period.
Relying on the user-supplied information about the place of residence, we can identify
the regional dimension of politically motivated Internet communication, which we expect to
follow the existing ideological fault lines of the present-day Ukrainian politics. Our analysis
groups users’ comments by oblasts.14 For each discussion group, we compute the share of
participants from each province.
13 Aggregating our data to weekly frequency does not affect our conclusions. Results can be found in
Appendix B, Tables B1-B5.
14 VKontakte users do not identify their oblasts, but only use city and country names. Cities were matched
to their corresponding oblasts using the open-source dataset Geonames
(http://download.geonames.org/export/dump/). For those cities presented in the dataset more than
once, the one having maximal population was selected. Based on the user-provided place of residence, we
group users by provinces. Some users kept their information private. In those cases we weren’t able to
identify their home province, we assigned them to a separate “unidentified” group.
13
In order to investigate how regional patterns of online communication respond to the
intensification of political tension and violent conflict, we supplement our online networks data
with data tracing province-specific unfortunate war casualties. There are several potential official
government sources of the casualties data (e.g., the Ministry of Defense of Ukraine) as well as
several unofficial sources, collected by individual volunteer organizations (e.g., the Wings of
Phoenix). While the former lacks biographical information on the fallen (thus making it
impossible to match them to their home cities) as well as data on volunteer fighters, the latter
might be incomplete. For that reason, we decided to use a crowd sourced Wikipedia webpage on
Ukrainian casualties that appears to be more complete than other individual sources, and in
addition to the name, rank, and the date of death, contains a short biographical sketch.15 Using
the supplied information, we were able to calculate the number of casualties each oblast suffered
each day from January 1, 2013 through the last day of 2014. Among the 1,466 casualties, 560 and
398 are from the Western and Central oblasts (65% of the total), while 204 and 304 are from the
Southern and Eastern oblasts (35% of the total).
Our identification strategy is based on the assumption that if war casualties affect the
way people engage using VKontakte platform, the number of comments
originated
in oblast should respond positively to casualties from the same oblast. And this is the first thing
we check in our empirical analysis. This alone, however, will tell us little about whether different
users are engaged in inter-provincial dialog and even less about the nature of the dialogue. To
quantify and measure these effects we identify two dimensions of cross-regional communication:
fragmentation, or the lack of connections between regions, and polarization of the sentiment of each
discussion.
3. 2. Measuring fragmentation
To capture the extent to which social network communication unites users from
15 Biographical
data were missing or incomplete on several soldiers, which did not allow us to identify
their home city. In those cases their oblast in the dataset was coded as “unknown.”
14
different provinces, we adopted the reverse of “group separation” measure by Rosenblat and
Mobius (2004). We define interprovincial cohesion (reverse of fragmentation) as the share of
total comments falling in Type II communication. The larger the share, the less
compartmentalized online communication is. To capture the same concept of cohesion in a
regional dyads analysis (either West-Center vs. East-South or any pair of regions identified on
Figure 1) that is more relevant for our purpose, we modify the above measure for the case of
only two groups. The idea is that if a share of comments to a particular post coming from any of
the two regions equals to either one or zero, this would mean that there is little or no interaction
between the regions.16 If, instead, the share is close to .5, neither region dominates the
discussion, so users from different parts of the country exchange opinions on relevant political
issues. To account for that, we construct the dyadic cohesion index as:
ℎ
where
= 1– 2 ×
%
∑%&
!" #
!"
$!"
−(
(1)
is the number of comments to wallpost ) on day coming from region .
Thus, in both measures, values closer to zero mean no interprovincial conversation, and values
close to one mean full engagement. While reflective of the extent of “connectedness” with users
residing outside a given province, this index does not take into account the overall volume of
conversation carried by the province and the popularity of the original posts (often assessed by
the number of comments they attract). Therefore, it ignores informational and persuasive
properties of inter-provincial dialog.
To capture the extent to which the content produced by users from a given province can
attract inter-provincial response, we use a recently introduced measure of engagement capacity to
assess the ability of groups of online forum users to entice peer response from other oblasts
To correctly calculate the shares, we had to adjust for the comments coming from other countries, as
well as comments from users whose location we weren't able to identify. Also, we limited our
consideration to posts with at least 10 comments to remove noise from the data.
16
15
(Godre et al., 2015). The online social network or forum communication directly depends on the
content continuously created by the users, and in particular, the synergy they gain from
responding to each other, thereby building relationships, and in particular, finding common
political interests, sharing/spreading attitudes, establishing themselves as a cohesive selfidentifying entity. The engagement capacity index quantifies the share of any user, or user group,
in this synergistically created value, under the assumption that a forum post is only valuable if it
is able to attract peer responses, which can be either in support of or in opposition to the
original post.
The engagement capacity measure relies on the principles of cooperative game theory
that calculates “fair-share” equilibria in settings with agents that form coalitions to achieve a
common goal. The first widely used measure of this sort is known as the Shapley value (Shapley,
1953). While the Shapley value is useful for evaluating the fairly contributed values with
unstructured coalitions, the engagement capacity works with directed trees of forum threads
(Godre et al., 2015). The engagement capacity index takes into account an initiator/responder
tradeoff. This means that in a single exchange it gives more value to the “originator,” or more
precisely, all the users, who have posted in the thread up until this moment.
We view VKontakte posts as discussion seeder/originator actions, and the ensuing
comments as the responder actions. We posit that VKontakte users form groups, by oblast, and
calculate the groups’ engagement index as the measure of their ability (as originators) to engage
other groups (as responders) in political communication. We track the engagement values for all
oblasts over time, daily, and study whether information available through public channels
(reported war casualties from a given province), serves as a statistically significant predictor of
the ability of each particular oblast “to be heard.” Figure 4 plots the engagement capacity index
for West-Central and East-Southern oblasts. We can see that before the Revolution started, both
West-Central oblasts and South-East oblasts had similar power in starting valuable discussions that
were followed by residents of other oblasts. This clearly changed after November 2014. Now the
16
majority of influential discussions that attract a wide array of participants originate in the West.
3. 3. Measuring Polarization
Our second dependent variable—political polarization—is conceptualized as the distance
in sentiment, or the intensity of comments’ positive and negative connotation (Pang and Lee,
2008). We capture this by performing a content analysis of all wall posts using Ukrainian and
Russian “bag of words” datasets that include 8,863 positive and 24,299 negative words in both
Russian and Ukrainian.17 In economic analyses, the “bag of words” methods had been previously
used to study media market segregation (Gentzkow and Shapiro 2010), consumer confidence
and political opinion (O'Connor et al. 2010), and economic uncertainty (Baker et al. 2013).
Our comments data jointly include about 417 million words. In an automated analysis
using “the bag of words” databases, we identified the degrees of negative and positive sentiment
by matching the content of the comments with the positive or negative words from our “bag of
words” datasets. Of the total number of words used, 29 million (or 7 percent) were identified as
positive or negative words. We use the results of the content analysis to construct two variables
that capture the overall number of positive and negative words in the content supplied by each
oblast. Both variables, in our opinion are crucial in understanding changes in attitudes and
political identification. If we see these variables responding similarly to casualties, this would
indicate greater polarization of sentiment. If, on the other hand, these variables move in the
opposite direction, this would indicate greater convergence in the overall sentiment.
4. Analysis
4. 1. Casualties and online activism
Before examining the relationship between our measures of fragmentation and
polarization on the one hand and violent conflict on the other, we ensured that online
communication in fact responds to war casualties. Hence, we start by testing the mechanism. We
define the time variable as the number of months past the start of the 2013 Maidan protests
For the review of the “bag-of-words” method, as well as alternative approaches to sentiment analysis
see Maragoudakis et al., 2011 and Liu, 2012.
17
17
(which, therefore, is negative before November 2013 and positive after that). In Figure 3, we can
clearly see the exponential rise in user activity. In fact, the average number of comments to
political posts grew from about a thousand per month in early 2013 to several millions in 2014.
We organize the data as a time-oblast panel, where for each day, each oblast Casualtiesit lists
the number of soldiers native to that oblast who died in conflict on that day.18 We first want to
see if the news about lost lives increases the number of comments from residents of the affected
oblasts relative to the rest of Ukraine. To test this, we estimate the following fixed effects model19:
= * + ,- + ∑8 &9 4 ∙ , 5,6
ℎ + . ∙ ,0 1 + 2 ∙ 3
7
+ (2)
31
+
Where Commentsit is the total number of comments left by residents of oblast i at time t, ci are
oblast fixed effects that take care of unobserved heterogeneity among oblasts, dayt and montht are
two sets of time dummies, aftert is the time-dummy, equal one after March 18, 2014 (the date the
first Ukrainian soldier died defending Crimea), TimeTrendt is the quadratic time trend, and eit is
the error term.20
Our full-sample results are presented in Table 1, Columns 1-3. We can see that regardless
of model specification, casualties heighten the online activity of the affected oblasts’ residents—
they significantly increase the number of comments. Moreover, the effect appears to last for at
Note that we use the actual dates of casualties rather than their announcement in official media. To
allow for the news to reach the public we include time lags. Effectively, we assume that the public
becomes aware of the casualties from a wide variety of sources, including unofficial sources, rumors, and
word-of-mouth. We do not want to discount such alternative methods of obtaining information.
Moreover, the ministry of Defense has published daily updates of casualties’ data, making the lags of
official reporting sufficiently uniform. Another potential issue is that many army battalions are formed on
territorial basis, hence making the natives of particular oblasts overrepresented in daily death tolls. The
model’s province-level fixed effects, however, help accounting for the presence of territorial battalions.
The rotation of forces in the regular army is slow and usually exceeds three months. Since we base our
analysis on daily fluctuations, Casualties, for the most part, are exogenous in the model.
19 We used six lags to take into account cyclicality of weekly work and news schedules (contemporaneous
value plus six lags produces one full week). Our results are robust to adding an extra seventh lag and
could be seen in Appendix A, Table A1.
20 We estimate all regressions in levels. Log-log regressions, although less sensitive to outliers, are not a
good fit for our data because the dependent and independent variables in equation (2) equal zero for
most of 2013. Table A3 demonstrates, however, that re-estimating the model in logs does not affect our
conclusions.
18
18
least a week as indicated by the joint significance of the six lags of Casualties in addition to the
variable itself. Depending on the set of controls, the point estimate for the contemporaneous
effect ranges from 48 to 212.
One of the caveats of the regression specified above is that before Maidan, which
roughly coincides with the middle of our sample, casualties were zero and the number of
political discussions and therefore comments was very small. After Maidan, on the other hand,
both increased substantially. Hence, despite the fact that we attempted to control for time trends
in the data, there is a possibility of finding a spurious relationship. To make sure that our results
are not driven by the zero values of the variable before November 2013, we run the same set of
fixed-effects regressions, while restricting the sample to the post-Maidan period (Columns 4-6).
There is no clear trend in the number of comments, nor is there a trend in war casualties in 2014.
Nevertheless, it can be seen from Table 1 that while the size of point estimates are expectedly
smaller for this subsample, the coefficients remain statistically significant.
[Table 1 about here]
Table 2 re-estimates our model in the regionally divided subsamples. We see that the
previous results are driven by the users from Eastern-Southern provinces responding to their
casualties, while the results for West-Central provinces are not significant.
[Table 2 about here]
It appears that unlike VK users from Eastern-Southern provinces, residents of WestCentral provinces either do not respond at all or are no more sensitive to their home province
casualties than the casualties suffered by other provinces. To examine whether VK users indeed
react to casualties from other provinces, in Table 3 we estimate the effect of casualties
attributable to all other provinces except those from province i and specify the independent
variable as: ℎ 1 , 5,6
=∑
, 5,6
− , 5,6
.
Columns 1-3 in Table 3 present different model specifications for the entire sample.
Columns 4 and 5 re-estimate the models for the regionally-defined subsamples. These show that
19
our expectations about the effect of casualties on online activisms hold for the entire sample and
for West-Central provinces, but not for the East-South. VK users from East-Southern oblasts do
not respond to war casualties from outside of their oblasts. While the news about war casualties
heighten online activism across the entire country, our results suggest that in West-Central
provinces, users show more solidarity with causalities from other provinces, while the online
activism of their East-Southern counterparts is driven primarily by their home province losses.21
[Table 3 about here]
4. 2. Assessing Fragmentation
We have already uncovered some non-trivial differences between West-Center and
South-Eastern parts of Ukraine. Next we proceed to assess how the intensity of armed conflict
affects online fragmentation. Prior research indicates that regional fragmentation of online
communication was typical during the times of electoral political mobilization (Duvanova et al.
2015). First, we would like to test whether Maidan and the war in Eastern Ukraine have changed
this and opened up a dialogue between different parts of the country. We examine whether
online discussions bring together users from the parts of the country historically separated by
linguistic and political divides.
Had the majority of comments to individual political posts come from either SouthEastern or West-Central Ukraine, or is there a balanced mix of the two thus indicating an
inclusive dialogue? Figure 5 illustrates how our dyadic cohesion index (1), which measures
fragmentation of communication between any two different groups of provinces, evolved over
time in 2013 and 2014.22 One common element in the nature of political communication across
different groups of provinces is its high volatility before Maidan, which reflects the fact that
political discussions were relatively infrequent before the Revolution of Dignity. The volatility
21 In
a recent experimental study, Bauer et al. (2013) found that individual experiences with war
strengthen preference for in-group parochialism, but reduce trust in strangers. One possible psychological
mechanisms accounting for our findings is that the information about war casualties may heighten the
sense of belonging to a narrowly defined group and alienate the out-group members.
22 We had to take monthly averages to smooth out daily fluctuations in the index, particularly in the first
half of the sample with generally fewer political posts and comments.
20
subsided significantly for most groups of provinces after the Maidan protests. Although on
average the levels of communication between West-Central and East-Southern parts of the
country remained the same, after Maidan, cohesion capacity slightly decreased for the West vs.
East pair, but improved or stayed the same for all other combinations of provinces. It appears
that communication became more integrated and inclusive for all regional pairs that don't
include the East. For those that do, it has improved only between South and East—the regions
that are culturally and historically close to each other.
To test whether these changes are statistically significant, we regress the cohesion index
on a Maidan dummy, defined as zero before November 21, 2013 and one after that. The
regression results are presented in Table 4. The results in Column 1 confirm that engagement
between West-Central and East-Southern Ukraine did not increase after the Maidan protests. At
the same time, we find that engagement went up for all other groups of provinces except the
West—East pair, which experienced a negative, but statistically insignificant change. The effects
of statistically significant improvements in communication are larger for the pairs including
Central provinces than for the pairs including Southern Ukraine.
[Table 4 about here]
When it comes to Hypotheses 1a and 2a, both find only partial support. We can see that while
there is no change in communication between the broad West and broad East, the intensification
of the violent conflict clearly increased communication in all regional pairs except West—East,
reducing intra-regional fragmentation.
Our measure of inter-provincial cohesion relies on the geographical decomposition of
VK users’ comments, which constitute responses to the original messages posted by users
residing in a given province, but ignores the posts and their origins. This might bias our analysis,
making it more likely to find null results. The engagement capacity index described in Section 3.2
takes into account the original post and measures their capacity to entice response from outside
the originators’ home oblasts. In line with Hypothesis 1, we test whether posts from provinces
21
with casualties will attract more attention (and perhaps, sympathy) from the wider online
community. Specification 1, and especially specification 2, with additional controls in Table 5
clearly rule out the notion that news about oblast-specific losses entices users from other oblasts to
more readily engage in cross-provincial communication. We see no or only marginally significant
effects of casualties on the ability of posts from affected provinces to entice strong response
from other provinces; moreover, point estimates are negative. Hence, Hypothesis 1a finds no
clear support when using both the communication cohesion and engagement capacity measures.
Columns 3-6 in Table 5 also rule out both Hypotheses 2a and 2b. We can see that while
no effect can be detected in the East, South, and Center subsamples, the West’s ability to engage
users from other provinces actually diminishes with casualties (the opposite of Hypotheses 2a is
true). This means that online activism on the part of users from the affected provinces is perhaps
directed at and influences other users from the same province, rather than bridging across
provincial lines. Coupled with our finding that West-Center responds more actively to casualties
suffered by other provinces, this result suggests that province-specific casualties make VK
communication more region-specific, rather than appealing to broad, cross-regional dialog that
engages heterogonous parts of Ukraine.
Together our results show that while war casualties tend to mobilize VK users from the
affected provinces (as well as users from West-Central provinces regardless of whether their
home province incurred losses), such mobilization does not necessarily lead to cross-provincial
communication. Not only does the analysis fail to detect evidence that casualties increase interregional engagement, VK users from the West engage in less dialog with the rest of the country
as the casualties go up. As a result of such regional compartmentalization, the ability of users’
posts to attract comments from outside of their province (and hence “influence” the social
network) diminishes.
[Table 5 about here]
4. 3. Assessing the polarization hypotheses
22
Although the above analysis finds users of VKontakte respond to war casualties with
increasing fragmentation of inter-provincial communication, this does not necessarily mean
virtual conversation becomes polarized as well. In our data, some communication continues to
be carried between users from different provinces. It is possible that those users develop
bridging cross-regional connections by expressing sympathy and solidarity, or abstaining from
divisive political debates. (The latter interpretation would be consistent with our findings of
decreased engagement of the West). Without knowing whether users agree or argue over the
posted content, it is impossible to rule such interpretation out.
In order to differentiate between constructive and destructive conversations, we analyze
their context with the help of positive and negative "bags of words." If the news about war
casualties increase polarization in online discussions, we should see the rise in negative sentiment
(negative connotation words) go hand-in-hand with the increase in positive sentiment (and vice
versa). If, as Hypothesis 1b postulates, news about war casualties decreases polarization, we
should expect to see the positive and negative words move in opposite directions as a result of
the news about war casualties. Table 6 reveals that for Ukraine as a whole, discussions do
become more polarized in response to war casualties; both coefficients in columns 1 and 6 are
positive and statistically significant. Interestingly, this conclusion does not hold for all parts of
Ukraine. Only provinces located in the Eastern part of the country experience simultaneous
increases in both positive and negative sentiments. Our analysis shows that war casualties have
no discernible effect on the intensity of negative and positive sentiments in other parts of the
country. These results are inconsistent with Hypothesis 2a. The fact that war casualties appear to
be a divisive subject in Eastern provinces is consistent with Hypothesis 2b: as a province located
in the East experiences more war fatalities, the overall sentiment of the comments responding to
this province’s posts becomes more emotionally charged and polarized.
Piecing together the results of our empirical tests, we can rule out the notion that
Ukrainians as a whole are becoming more united in the face of the challenges to the country’s
23
territorial integrity. Most revealing results come from the analysis of regional subsamples that
show profound differences in the way West and East Ukrainians react to the conflict. EastSouthern provinces, including Crimea, which is currently occupied by Russia, and Donetsk and
Luhansk in active war zone, tend to respond to their own, but not other oblasts’ war casualties
with greater participation in online discussions. Yet, oblast-specific death tolls have no discernible
impact on the fragmentation of online discussion in East-Southern Ukraine. This is in sharp
contrast to West-Central provinces that tend to increase their network participation in response
to other oblast’s casualties. West-Central oblasts do not respond to their own casualties with
greater online activism when compared to casualties from other oblasts. Our results also show
that war casualties have a polarizing effect in the East of Ukraine, but not in West-South-Central
Ukraine. As a result, Eastern oblasts appear to be increasingly polarized by the conflict.
5. Conclusion
Discussing the ideological battles waged in Russia and Ukraine, Sean Roberts and Robert
Orttung (2015) summarized an increasingly popular notion that: “In the context of the war,
Ukrainians are becoming only more united in their patriotism and opposition to Russia.” Using
the data on cross-regional structure of politically relevant online communication among users of
VKontakte social networking site, we put this notion to a test. We examined how political online
communication among Ukraine’s residents responds to violent civil conflict and foreign
intervention.
Ukraine’s geographically concentrated political and ethno-linguistic divisions presented
us with a unique opportunity to identify the underlying structure of social cleavages without the
use of individual-level data which often are subject to a host of methodological problems and are
rarely available in time series. The paper analyzed the panel of provinces spanning the most
active phases of domestic protests and military conflict and evaluated how online activism, interprovincial cohesion and peer influence (engagement), as well as the overall discourse sentiment
respond to the province-specific war casualties.
24
Our analysis provides little support for the notion that Ukraine’s public as a whole
becomes more united in response to the violent conflict in the Donbas – for a number of
provinces this does not appear to be the case. We found that generally, war casualties led to
increased levels of online activism (measured by the amount of user-contributed context). At the
same time, we found no evidence that war casualties impact the level of network cohesion in
inter-provincial online communication between West-Central and East-Southern Ukraine. While
the intensity of military conflict entices online activism, it mainly activates regional rather than
nation-wide connections. Our analysis suggests that at least in the sphere of social network
communication among VK users, the war separates rather than unites Ukrainians living far apart.
Our analysis also revealed nontrivial differences in the ways different parts of the country
respond to the violent conflict. While the users of social network platforms in East-Southern
Ukraine react very little to war casualties from other parts of Ukraine, network users from WestCentral provinces respond with increasing activism to war casualties outside of their home
provinces. Moreover, we found that while the war intensity does not affect the degree of
fragmentation of network communication in East, Central, and, Southern provinces, Western
provinces’ communication becomes more, rather than less fragmented as indicated by the
engagement index. Together these results suggest that as the war in the Donbas progresses,
Ukrainian virtual society, at least its part that uses VKontakte platform, becomes more politically
galvanized but fragmented along political and ethno-linguistic divisions.
We also found that military conflict tends to polarize network discourse in Eastern
Ukraine, but not in other provinces. This shows that while Western oblasts’ residents’ online
behavior contributes to the provincial fragmentation of online discussions, Eastern oblasts appear
to be increasingly polarized by the conflict. In some sense, the “West” becomes more united as
its network communication drifts away from the increasingly polarized “East.”
This research makes several contributions. It engages the literature on mass
communication and political conflict. Studies of persuasion and public opinion formation
25
suggest that selective exposure and media bias may reinforce partisan preferences and further
fragment and polarize the public (Stroud 2010, Durante & Knight, 2012). At the same time,
media messages may reinforce the sense of in-group (as in the case of Nazi propaganda in
Germany [Adena et al. 2015]) and out-group (as in the case of Serbian radio in Croatia
[DellaVigna et al. 2014]) identities. In Ukraine, exposure to Russian media had been linked to the
electoral support for pro-Russian parties (Peisakhin and Rozenas 2015), and pre-war national
elections had coincided with increased regional fragmentation of social media (Duvanova et al.
2015). Our research suggests that the social media might further reinforce selective exposure,
and as a result, contribute to political fragmentation and polarization.
Our research also contributes to the study of new, technology-enabled forms of political
participation. A growing number of people around the world use social networking platforms
not only to consume, but also to produce political information. This paper furthers our
understanding of digitally-enabled forms of political participation and its relation to conventional
forms of political conflict. Our research contributes to the emerging field in the study of political
roles of digital social media. We believe the Ukrainian case is of major significance because it
helps highlight the importance of social networks in societies with a deficit in traditional
channels of political expression, collective action, and organization.
We also make a methodological contribution. We propose practical methods for
integrating big data in the study of mass attitudes and social behavior. As the ongoing
digitization of social relations produces large repositories of data, social sciences face an
important task of developing tools and approaches towards utilizing these resources. Our
research contributes to this task.
26
Bibliography
Acemoglu, Daron, and James A. Robinson. 2006. Economic Origins of Dictatorship and Democracy.
Cambridge and New York: Cambridge University Press
Adena, Maja; Enikolopov, Ruben; Petrova, Maria; Santarosa, Veronica; Zhuravskaya, Ekaterina
(2015): Radio and the rise of the Nazis in prewar Germany, WZB Discussion Paper, No. SP II
2013-310r
Anderson, Benedict. 1982. Imagined communities: reflections on the origin and spread of nationalism.
London: Verso.
Aslund, Anders, and Michael McFaul. 2006. Revolution in Orange: The Origins of Ukraine’s Democratic
Breakthrough. Brookings Institution Press.
Baker, Scott R., Nicholas Bloom, and Steven J. Davis, 2013. “Measuring Economic Policy
Uncertainty,” Working paper, Chicago Booth Research Paper No. 13-02. Available at
SSRN: http://ssrn.com/abstract=2198490 orhttp://dx.doi.org/10.2139/ssrn.2198490
Bauer, Michal, Alessandra Cassar, Julie Chytilová, and Joseph Henrich, 2013. “War’s Enduring
Effects on the Development of Egalitarian Motivations and In-Group Biases,” Psychological Science
25(1):47-57.
Baum, Matthew. 2012. “Preaching to the Choir or Converting the Flock: Presidential
Communication Strategies in the Age of Three Medias.” In iPolitics: Citizens, Elections, and
Governing in the New Media Era, ed. Richard Fox and Jennifer Ramos. Cambridge University Press
pp. 183– 205.
Beissinger, Mark R. 2012. “Russian Civil Societies, Conventional and Virtual,” Taiwan Journal of
Democracy 8 (2): 91-104.
Beissinger, Mark R. 2013. “The Semblance of Democratic Revolution: Coalitions in Ukraine’s
Orange Revolution,” American Political Science Review, 107(3): 574-592.
Bennett, Lance W. and Alexandra Segerberg. 2013. The Logic of Connective Action: Digital Media and
the Personalization of Contentious Politics. Cambridge University Press
Birch, Sarah. 2000. Elections and Democratization in Ukraine. Palgrave Macmillan.
Blattman, Christopher and E. Miguel, 2010. “Civil War,” Journal Of Economic Literature, American
Economic Association, 48(1): 3-57.
Blattman, Christopher, 2009. “From Violence To Voting: War And Political Participation in
Uganda,” American Political Science Review, 103(2): 231-247
27
Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow,
Jaime E. Settle and James H. Fowler. 2012. “A 61-million-person experiment in social influence
and political mobilization.” Nature 489:295–298.
Christakis, N. A. and James H. Fowler. 2009. Connected: The Surprising Power of Our Social Networks
and How They Shape Our Lives. Little, Brown, and Company.
Clem, Ralph S., and Peter R. Craumer. 2008. “Orange, Blue and White, and Blonde: The
Electoral Geography of Ukraine’s 2006 and 2007 Rada Elections.” Eurasian Geography and
Economics 49, (2): 127–151.
Davenport, Christian. 2009. Media Bias, Perspective, and State Repression: The Black Panther
Party. Cambridge: Cambridge University Press.
DellaVigna, Stefano and Matthew Gentzkow, 2010. “Persuasion: Empirical Evidence,” Annual
Review of Economics 2:643–669.
DellaVigna, Stefano, Ruben Enikolopov, Vera Mironova, Maria Petrova and Ekaterina
Zhuravskaya, 2014. “Cross-Border Media and Nationalism: Evidence From Serbian Radio in
Croatia,” American Economic Journal: Applied Economics, 6(3): 103–132.
Driscoll, John C. and Aart C. Kraay, 1998. “Consistent Covariance Matrix Estimation with
Spatially Dependent Panel Data,” Review of Economics and Statistics, 80(4): 549-560.
Durante, Ruben and Brian Knight, 2012. "Partisan Control, Media Bias, And Viewer Responses:
Evidence From Berlusconi's Italy," Journal of the European Economic Association 10(3):451-481.
Duvanova, Dinissa, Alexander Semenov, and Alexander Nikolaev, 2015. “Do Social Networks
Bridge Political Divides? The Analysis of Vkontakte Social Network Communication in
Ukraine,” Post-Soviet Affairs, 31(3): 224-49.
Earl, Jennifer and Katrina Kimport. 2011. Digitally Enabled Social Change: Activism in the Internet
Age. Cambridge, MA: MIT Press.
Enikolopov, Ruben, Maria Petrova, and Konstantin Sonin, 2015. “Social media and Financial
Markets: Evidence from Russia.” Unpublished.
Enikolopov, Ruben, Maria Petrova and Ekaterina Zhuravskaya, 2011. “Media and Political
Persuasion: Evidence from Russia,” American Economic Review, 111(7): 3253-85.
Fearon, James D. and David D. Laitin, 2000. “Violence and the Social Construction of Ethnic
Identity,” International Organization 54, 4, Autumn 2000, pp. 845–877.
Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” American
Political Science Review, 97(1): 75–90.
Fox, Richard and Jennifer Ramos. 2012. iPolitics: Citizens, Elections, and Governing in the New Media
Era, ed. Richard Fox and Jennifer Ramos. Cambridge University Press.
28
Gentzkow, Matthew and Jesse M. Shapiro, 2011. “Ideological Segregation Online and Offline,”
The Quarterly Journal of Economics, 126: 1799-1839.
Gentzkow, Matthew and Jesse M. Shapiro, 2010. “What Drives Media Slant? Evidence from US
Daily Newspapers,” Econometrica, 78(1): 35–71.
Godre, S., A.G. Nikolaev, S. Khopkar, and V. Govindaraju, 2015. “Engagement Capacity: a
Measure of the Value Created by Users in Online Social Network or Forum Communication”
UB technical report.
Hamilton, Allison and Caroline Tolbert. 2012. “Political Engagement and the Internet in the
2008 U.S. Presidential Elections: A Panel Survey.” In Digital Media and Political Engagement Worldwide: A Comparative Study, ed. Eva Anduiza, Michael Jensen and Laia Jorba. Cambridge University
Press pp. 56–79.
Howard, Philip and Muzammil Hussain. 2013. Democracy’s Fourth Wave? Digital Media and the Arab
Spring. Oxford University Press.
Huckfeldt, Robert, and John T. Sprague. 1995. Citizens, Politics, and Social Communication: Influence
in an Election Campaign. New York: Cambridge University Press.
Jackson, M. O. And Massimo Morelli, 2011. “Political Bias And War, “ American Economic Review
97 (4): 1353-1373.
Jennings, Kent and Vicki Zeitner. 2003. “Internet Use and Civic Engagement: A Longitudinal
Analysis.” Public Opinion Quarterly 67:311–34.
Jensen, Michael, James Danziger and Alladi Venkatesh, 2007. “Civil Society and Cyber Society:
The Role of the Internet in Community Associations and Democratic Politics.” Information Society
23(1):39–50.
Jensen, Michael, Laia Jorba and Eva Anduiza. 2012. “Introduction.” In Digital Media and Political
Engagement Worldwide: A Comparative Study, ed. Eva Anduiza, Michael Jensen and Laia Jorba.
Cambridge University Press pp. 1–15.
Kerbel, Matthew. 2009. Netroots: Online Progressives and the Transformation of American Politics. New
York: Paradigm Publisher.
Kerbel, Matthew. 2012. “The Dog That Didn’t Bark: Obama, Netroots Progressives, and Health
Care Reform.” In iPolitics: Citizens, Elections, and Governing in the New Media Era, ed. Richard Fox
and Jennifer Ramos. Cambridge University Press pp. 233–258.
Lawless, Jennifer. 2012. “Twitter and Facebook: New Ways of Members of Congress to Send
the Same Old Message?” In iPolitics: Citizens, Elections, and Governing in the New Media Era, ed.
Richard Fox and Jennifer Ramos. Cambridge University Press pp. 206–232.
29
Lim, Merlyna. 2012. “Clicks, Cabs, and Coffee Houses: Social Media and Opposition
Movements in Egipt, 2004–2011.” Journal of Communication 62(2):231–48.
Montalvo, Jose G. and Marta Reynal-Querol, 2010. "Ethnic polarization and the duration of civil
wars,” Economics of Governance, 11(2):123-143.
Montalvo, Jose G., and Marta Reynal-Querol. 2005. “Ethnic Polarization, Potential Conflict, and
Civil Wars.” American Economic Review, 95(3): 796–816.
Mueller, John. 1970. "Presidential Popularity from Truman to Johnson." American Political Science
Review 64(1):18-.
Murphy, Kevin M. and Andrei Shleifer. 2004. “Persuasion in Politics.” American Economic Review,
94(2):435–39.
Mylonas, Harris, 2012. The Politics of Nation-Building: Making Co-Nationals, Refugees, and Minorities.
New York: Cambridge University Press.
Norris, Pippa, 2003. “Preaching to the Converted: Pluralism, Participation, and Party Websites.”
Party Politics 9(1):21–45.
O'Connor, B. R. Balasubramanyan, B.R. Routledge, and N.A. Smith, 2010. From Tweets to
Polls: Linking Text Sentiment to Public Opinion Time Series. In Proceedings of ICWSM. 2010.
Pang, B. and Lillian Lee. 2008. “Opinion mining and sentiment analysis,” Foundations and Trends
in Information Retrieval 2(1-2): 1–135.
Peisakhin, Leonid and Arturas Rozenas, 2015. “Persuasion and Dissuasion with Biased Media:
Russian Television in Ukraine,” Unpublished.
Prior, Markus. 2007. Post-Broadcast Democracy. New York: Cambridge University Press.
Roberts, Sean and Robert Orttung, 2015. “How to understand the post-Soviet ‘war of lapels’,”
The Washington Post, May 8.
Rosenblat, Tanya S. and Markus M. Mobius, 2004. “Getting Closer or Drifting Apart?” Quarterly
Journal of Economics, 119(3):971-1009.
Russett, Bruce, 1990. Controlling the Sword: The Democratic Governance of National Security. Cambridge,
MA: Harvard University Press, pp. 20-51.
Semenov, A. and J. Veijalainen, “A modeling framework for social media monitoring,” Int. J.
Web Eng. Technol., vol. 8, no. 3, pp. 217 – 249, 2013.
Semenov, A, J. Veijalainen, and A. Boukhanovsky, “A Generic Architecture for a Social Network
Monitoring and Analysis System,” in The 14th International Conference on Network-Based Information
30
Systems, Los Alamitos, CA, USA, 2011, pp. 178–185.
Shapley, Lloyd S. 1953. "A Value for n-person Games". In Contributions to the Theory of Games,
volume II, by H.W. Kuhn and A.W. Tucker, editors. Annals of Mathematical Studies v. 28, pp. 307–
317. Princeton University Press.
Stroud, Natalie. 2011. Niche News: The Politics of News Choice, Oxford University Press.
Stroud, Natalie Jomini. 2010. “Polarization and Partisan Selective Exposure.” Journal of
Communication 60(3):556–76.
Sunstein, Cass, 2001. Republic.com. Princeton: Princeton University Press.
Tang, Min, Laia Jorba and Michael Jensen. 2012. “Digital Media and Political Attitudes in
China.” In Digital Media and Political Engagement Worldwide: A Comparative Study, ed. Eva Anduiza,
Michael Jensen and Laia Jorba. Cambridge University Press pp. 221–239.
Thies, C. G., 2005. “War, rivalry, and state building in Latin America,” American Journal of Political
Science 49(3): 451-465
Tilly, Charles, 1992. Coercion, Capital, and European States, AD 990–1992. Malden, Mass. and
Oxford: Blackwell.
Tufekci, Zeynep and Christopher Wilson. 2012. “Social Media and the Decision to Participate in
Political Protest: Observations from Tehrir Square.” Journal of Communication 62(2):363–79.
Vitak, Jessica, Paul Zube, Andrew Smock, Caleb T. Carr, Nicole Ellison and Cliff Lampe. 2011.
“It’s complicated: Facebook users’ political participation in the 2008 election.” Cyberpsychology,
Behavior, and Social Networking 14(3):107–114.
Voors, Maarten J., Eleonora E. M. Nillesen, Philip Verwimp, Erwin H. Bulte, Robert Lensink,
and Daan P. Van Soest. 2012. "Violent Conflict and Behavior: A Field Experiment in Burundi."
American Economic Review, 102(2): 941-64.
Webster, James. 2007. “Diversity of Exposure.” In Media Diversity and Localism: Meaning and
Metrics, ed. Philip Napoli. Mahwah, NJ: Erlbaum pp. 309–325.
Wilkinson, Steven I. 2004. Votes and Violence: Ethnic competition and ethnic violence in India.
New York: Cambridge University Press.
Zielinski, Jacob, K. M. Slomczynski, and Goldie Shabad. 2008. “Fluid Party Systems, Electoral
Rules and Accountability of Legislators in Emerging Democracies: The Case of Ukraine,” Party
Politics (January): 91-112.
31
Figure 1. The geographical breakdown of Ukrainian oblasts. W – Western region; C – Central region; S –
Southern region; E – Eastern region. The cities of Kyiv and Sevastopol are included into Kyivs’ka oblasts
and Autonomous Republic of Krym, respectively. The cities of Luhansk and Donetsk are excluded from
the empirical analysis since they both are war-ridden cities, currently controlled by the Russian Army and
pro-Russian rebels.
32
Figure 2. Interprovincial Network Ties: VKontakte discussion groups with active contribution
during November 2013 (top) and August 2014 (bottom). Each line represents the intensity of
shared communication as the number of discussion groups that both provinces contributed to.
For each discussion group, we calculated its share in the total number of messages contributed
by each of the provinces. Then we added the products of provinces’ contributions (as the share
of total messages) to each of the shared topics for the corresponding dyad. The width and color
of the lines reflect the strength of inter-provincial ties. Thicker, darker lines correspond to
stronger ties.
33
Figure 3. Monthly Casualties suffered by Ukraine and the total nuber of political post and
comments on VK.com
Figure 4. Engamenent Index for the West&Center and South&East oblasts of Ukraine
34
Figure 5. Over-time changes in cohesion capacity indices for various groups of Ukrainian
provinces.
Notes: OY axis is the cohesion index (0; 1). OX axis is time to Maidan (November, 2013), in
months. The confidence bands are computed using non-parametric bootstrap.
35
Table 1. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast. Oblast-Level Fixed Effects Regressions.
Whole
Whole
Whole
Whole
Whole
Whole
Region
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
(1)
(2)
(3)
(4)
(5)
(6)
Casualtiesit
212.310***
120.610***
48.148*
18.788**
16.742**
17.286**
(74.288)
(41.251)
(26.899)
(7.019)
(6.981)
(7.225)
Sum of lagged casualties
755.027***
264.878*
57.579***
64.084**
48.043*
(252.503)
(155.727 )
(19.390)
(27.702)
(25.606)
Max lag of casualties
114.316***
43.018
12.467**
12.902*
12.580*
(40.045)
(26.283)
(5.743)
(6.673)
(6.684)
964.557***
Year 2014 dummyt
(237.415)
2014 only sample
No
No
No
Yes
Yes
Yes
M/D/W dummies
No
No
Yes
No
Yes
Yes
Daily quadratic time trend
No
No
No
No
No
Yes
R-Square
0.017
0.059
0.353
0.002
0.130
0.154
Sample size
16790
16652
16652
8395
8395
8395
Notes: DV: Absolute number of comments from an oblast suffering casualties. Daily oblast-level panel data spanning 2013 and 2014 are
used for estimation. M/D/W stands for month, calendar day, and weekday. All specifications include a constant term. All specifications,
except specification 1, include six lags of the independent variable in addition to its contemporaneous value. Robust oblast-clustered
standard errors are in parentheses.25 * p<0.10, ** p<0.05, *** p<0.01.
25We
also estimated our regressions using Driscoll and Kraay (1998) standard errors (Table A2 in Appendix A) and found results to be very similar to those
obtained using clustering.
36
Table 2. The Effect of Casualties on the Number of Comments from the Same Oblast. Regional Analysis.
Oblast-Level Fixed Effects Regressions.
Region
Casualtiesit
Sum of lagged casualties
Max lag of casualties
R-Square
Sample size
West&Center
South&East
(1)
(2)
11.689
(9.104)
13.68993
(16.04082)
8.885
(8.717)
0.092
5840
28.424**
(8.779)
110.4895
(50.00867)
24.566**
(9.724)
0.405
2555
Notes: DV: Absolute number of comments from an oblast suffering casualties.
Daily oblast-level panel data for 2014 only. West, Center, South, and East are
the regions of Ukraine, defined according to Figure 1. All specifications include
a constant term; daily quadratic time trend; and month, calendar day, and
weekday dummies. Robust oblast-clustered standard errors are in parentheses. *
p<0.10, ** p<0.05, *** p<0.01.
37
Table 3. The Effect of Casualties from outside the Oblast on the Number of Comments from a Given Oblast.
Oblast-Level Fixed Effects Regressions.
Whole
Whole
Whole
Region
West&Center South&East
Ukraine
Ukraine
Ukraine
OtherCasualtiesit
(1)
(2)
(3)
9.691***
(2.556)
57.332***
(15.098)
8.577***
(2.251)
No
No
No
0.075
16652
1.523**
(0.618)
3.202
(2.862)
0.945**
(0.364)
Yes
No
Yes
0.001
8395
1.091***
(0.386)
2.376
( 1.783)
1.240***
(0.321)
No
Yes
Yes
0.347
16652
(4)
(5)
1.105**
1.060
(0.455)
(0.801)
Sum of lagged OtherCasualties
4.168**
-1.740
(1.595 )
(4.593)
Max lag of OtherCasualties
1.159***
1.459*
(0.383)
(0.639)
2014 only sample
No
No
M/D/W dummies
Yes
Yes
Daily quadratic time trend
Yes
Yes
R-Square
0.238
0.656
Sample size
11584
5068
Notes: DV: Absolute number of comments from a given oblast. OtherCasualtiesit is defined as ΣiCasualtiesit – Casualtiesit. Daily oblastlevel panel data for 2013 and 2014. M/D/W stands for month, calendar day, and weekday. All specifications include a constant term.
All specifications, except specification 1, include six lags of the independent variable in addition to its contemporaneous value. Robust
oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01.
Table 4. Dyadic Communication Cohesion among Various Regions of Ukraine
Maidan Dummy
R-Square
Sample size
West&Center
vs. East&South
South
vs. East
West vs.
Center
West vs.
South
West vs.
East
Center
vs. South
Center vs.
East
(1)
(2)
(3)
(4)
(5)
(6)
(7)
0.001
(0.012)
0.000
730
0.067***
(0.012)
0.062
713
0.116***
(0.013)
0.146
718
0.066***
(0.012)
0.062
705
-0.010
(0.014)
0.001
716
0.084***
(0.010)
0.118
717
0.026**
(0.011)
0.011
721
Notes: Daily time-series data is used for estimation. All regression includes a constant term. Maidan Dummy = 1 after Nov. 2013.
West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. Newey-West standard errors are in
parentheses. * p<0.10, ** p<0.05, *** p<0.01.
1
Table 5. The Effect of Casualties on the Engagement Index of a Given Oblast. Oblast-Level Fixed Effects Regressions.
Region
Casualtiesit
Sum of lagged casualties
Max lag of casualties
2014 only sample
M/D/W dummies
Daily quadratic time trend
R-Square
Sample size
Whole
Ukraine
Whole
Ukraine
West
Center
South
East
(1)
(2)
(3)
(4)
(5)
(6)
-1.008
(0.596)
-6.991*
(3.902)
-0.958*
(0.514)
No
No
No
0.004
16629
-0.851
(0.547)
-5.671
(3.480)
-0.721
(0.458)
No
Yes
Yes
0.011
16629
-0.060*
(0.033)
-.585
(.401)
-0.041
(0.050)
Yes
Yes
Yes
0.403
3650
-0.047
(0.049)
-.510
(.348)
-0.033
(0.063)
Yes
Yes
Yes
0.258
2190
0.290
(0.313)
2.377
(2.523)
0.445
(0.363)
Yes
Yes
Yes
0.226
1825
0.022
(0.050)
.371
(.127)
0.082
(0.024)
No
Yes
Yes
0.410
730
Notes: Daily oblast-level panel data for 2013 and 2014. M/D/W stands for month, calendar day, and weekday. West, Center, South, and
East are the regions of Ukraine, defined according to Figure A1. All specifications include a constant term. All specifications, except
specification 1, include six lags of the independent variable in addition to its contemporaneous value. Robust oblast-clustered standard
errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01
2
Table 6. The Relationship Between Casualties and the Number of Positive/Negative Words Used in Comments. Oblast-Level Fixed Effects
Regressions.
Region
Casualtiest
Sum of lagged casualties
Max lag of casualties
R-Square
Sample size
Ukraine
(1)
5.795*
(3.219)
13.013
(13.343)
4.825
(3.067)
0.170
8395
Negative Words
West
Center
South
(2)
(3)
(4)
-1.223
10.466
25.155
(1.126)
(9.478)
(18.666)
-2.226
2.696
108.1747
(2.738)
(8.711)
(110.914)
1.595
5.659
19.031
(2.022)
(5.958)
(25.486)
0.177
0.139
0.393
3650
2190
1825
East
(5)
10.289**
(3.443)
45.050**
(14.471)
13.645***
(1.138)
0.467
730
Ukraine
(6)
13.872**
(6.472)
34.256*
(19.717)
9.932*
(4.845)
0.166
8395
Positive Words
West
Center
(7)
(8)
-0.996
25.695
(1.661)
(21.647)
4.370*
33.366
(2.281)
(34.959)
4.125
13.422
(4.259)
(14.977)
0.189
0.132
3650
2190
South
East
(9)
(10)
45.180
20.991**
(27.241)
(6.178)
177.465 76.675**
(157.331) (26.095)
29.108 20.676***
(36.945)
(3.021)
0.381
0.469
1825
730
Notes: Daily oblast-level panel data for 2014 only. West, Center, South, and East are the regions of Ukraine, defined according to Figure A1. All specifications
include a constant term; daily quadratic time trend; and month, calendar day, and weekday dummies. Robust oblast-clustered standard errors are in parentheses.
* p<0.10, ** p<0.05, *** p<0.01.
3
Appendix A. Robustness Checks for Results reported in Table 1.
Table A1. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast.
Seven-Lags Specification. Oblast-Level Fixed Effects Regressions.
Whole
Whole
Whole
Whole
Whole
Region
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
(1)
(2)
(3)
(4)
(5)
Casualtiesit
212.310***
108.748***
43.925*
17.424**
15.371**
(74.288)
(37.669)
(24.754)
(6.606)
(6.655)
Sum of lagged casualties
799.759***
286.719*
64.216***
75.158**
(267.633)
(167.543)
(21.359)
(30.659)
Max lag of casualties
102.926**
38.878
11.163*
11.485*
(36.734)
(24.131)
(5.411)
(6.281)
Year 2014 dummyt
961.455***
(236.023)
2014 only sample
No
No
No
Yes
Yes
M/D/W dummies
No
No
Yes
No
Yes
Daily quadratic time trend
No
No
No
No
No
R-Square
0.017
0.062
0.354
0.002
0.131
Sample size
16790
16629
16629
8395
8395
Notes: See notes to Table 1. * p<0.10, ** p<0.05, *** p<0.01.
4
Whole
Ukraine
(6)
16.216**
(6.923)
56.730*
(28.023)
11.478*
(6.358)
Yes
Yes
Yes
0.154
8395
Table A2. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast.
Oblast-Level Fixed Effects Regressions with Driscoll and Kraay (1998) Errors
Whole
Whole
Whole
Whole
Whole
Whole
Region
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
(1)
(2)
(3)
(4)
(5)
(6)
Casualtiesit
212.310***
120.610***
48.148***
18.788**
16.742***
17.286***
(33.395)
(19.337)
(8.347)
(8.664)
(5.690)
(6.280)
Sum of lagged casualties
755.027***
264.878***
57.579
64.084***
48.043**
(128.899)
(44.349)
(48.813)
(20.189)
(21.658)
114.316***
43.018***
12.467
12.902***
12.580**
Max lag of casualties
(25.328)
(9.183)
(8.136)
(4.799)
(5.315)
Year 2014 dummyt
-964.557***
(35.855)
2014 only sample
No
No
No
Yes
Yes
Yes
M/D/W dummies
No
No
Yes
No
Yes
Yes
Daily quadratic time trend
No
No
No
No
No
Yes
Sample size
16790
16652
16652
8395
8395
8395
Notes: See notes to Table 1. * p<0.10, ** p<0.05, *** p<0.01.
5
Table A3. The Effect of Casualties on the Logarithm of the Absolute Number of Comments from the Same Oblast.
Log-Log Specification. Oblast-Level Fixed Effects Regressions..
Whole
Whole
Whole
Whole
Whole
Whole
Region
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
Ukraine
(1)
(2)
(3)
(4)
(5)
(6)
log Casualtiesit
1.588***
(0.307)
0.800***
(0.156)
5.283***
(1.003 )
0.783***
(0.151)
No
No
No
0.031
16790
No
No
No
0.102
16652
Sum of lagged log casualties
Max lag of log casualties
0.213*
(0.108)
1.271*
(.666)
0.195*
(0.103)
0.553***
(0.081)
No
Yes
No
0.619
16652
Year 2014 dummyt
2014 only sample
M/D/W dummies
Daily quadratic time trend
R-Square
Sample size
0.090***
(0.020)
.387***
(.071)
0.066***
(0.015)
0.051*
(0.026)
.251**
( .112)
0.057***
(0.020)
0.054*
(0.028)
.149
(.109)
0.036
(0.021)
Yes
No
No
0.005
8395
Yes
Yes
No
0.231
8395
Yes
Yes
Yes
0.285
8395
Notes: See notes to Table 1. Both dependent and independent variables are in logs. To deal with the zero values of the number of
comments and casualties, we added 1000 to the number of comments, and added 10 to the number of casualties, before taking logs. *
p<0.10, ** p<0.05, *** p<0.01.
6
Appendix B. Weekly Data Analysis
Table B1. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast. Weekly Oblast-Level Fixed Effects Regressions.
Region
Casualtiesit
Whole
Ukraine
(1)
745.045***
(258.912)
Year 2014 dummyt
2014 only sample
Month dummies
Daily quadratic time trend
R-Square
Sample size
No
No
No
0.062
2415
Whole
Ukraine
(2)
274.633
(164.483)
6663.461***
(1649.563)
No
Yes
No
0.365
2415
Whole
Ukraine
(3)
69.596**
(27.227)
Whole
Ukraine
(4)
63.555*
(31.479)
Whole
Ukraine
(5)
57.967*
(30.733)
Yes
No
No
0.003
1196
Yes
Yes
No
0.165
1196
Yes
Yes
Yes
0.191
1196
Notes: DV: Absolute number of comments from an oblast suffering casualties. Weekly oblast-level panel data spanning
2013 and 2014 are used for estimation. All specifications include a constant term. Robust oblast-clustered standard
errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01.
7
Table B2. The Effect of Casualties on the Number of Comments from the Same Oblast. Regional Analysis. Weekly Oblast-Level Fixed Effects
Regressions.
Region
West&Center
(1)
30.831
(30.324)
0.109
832
Casualtiesit
R-Square
Sample size
South&East
(2)
101.463*
(51.918)
0.505
364
Notes: DV: Absolute number of comments from an oblast suffering casualties. Weekly oblast-level panel data for 2014 only. West, Center, South,
and East are the regions of Ukraine, defined according to Figure 1. All specifications include a constant term; quadratic time trend; and month
dummies. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01.
8
Table B3. The Effect of Casualties from outside the Oblast on the Number of Comments from a Given Oblast. Weekly Oblast-Level Fixed Effects
Regressions.
Whole
Whole
Whole
Region
West&Center South&East
Ukraine
Ukraine
Ukraine
OtherCasualtiesit
(1)
(2)
56.655***
(14.976)
4.189
(2.919)
(3)
(4)
(5)
3.762*
5.084**
0.733
(1.858)
(1.795)
(4.739)
Year 2014 dummyt
6873.703***
5490.030**
1.0e+04***
(1768.482)
(2336.605)
(2047.761)
2014 only sample
No
Yes
No
No
No
Month dummy
No
No
Yes
Yes
Yes
R-Square
0.079
0.002
0.358
0.244
0.681
Sample size
2415
1196
2415
1680
735
Notes: DV: Absolute number of comments from a given oblast. OtherCasualtiesit is defined as ΣiCasualtiesit – Casualtiesit. Weekly oblastlevel panel data for 2013 and 2014. All specifications include a constant term. Robust oblast-clustered standard errors are in parentheses.
* p<0.10, ** p<0.05, *** p<0.01.
9
Table B4. The Effect of Casualties on the Engagement Index of a Given Oblast. Weekly Oblast-Level Fixed Effects Regressions.
Region
Casualtiesit
2014 only sample
Month dummies
Quadratic time trend
R-Square
Sample size
Whole
Ukraine
Whole
Ukraine
West
Center
South
East
(1)
(2)
(3)
(4)
(5)
(6)
-0.970*
(0.537)
No
No
No
0.013
2415
-0.777
(0.471)
No
Yes
Yes
0.027
2415
-0.086
(0.049)
Yes
Yes
Yes
0.476
520
-0.105
(0.066)
Yes
Yes
Yes
0.292
312
0.356
(0.386)
Yes
Yes
Yes
0.285
260
0.024
(0.054)
Yes
Yes
Yes
0.533
104
Notes: Weekly oblast-level panel data for 2013 and 2014. West, Center, South, and East are the regions of Ukraine, defined according to
Figure 1. All specifications include a constant term. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, ***
p<0.01
10
Table B5. The Relationship Between Casualties and the Number of Positive/Negative Words Used in Comments. Weekly Oblast-Level Fixed Effects
Regressions.
Region
Casualtiest
R-Square
Sample size
Ukraine
(1)
16.345
(14.057)
0.223
1196
Negative Words
West
Center
South
(2)
(3)
(4)
-3.721
19.768
108.554
(4.559) (20.088) (123.056)
0.236
0.159
0.513
520
312
260
East
(5)
39.394
(32.874)
0.637
104
Ukraine
(6)
38.835*
(22.559)
0.212
1196
Positive
West
Center
South
East
(7)
(8)
(9)
(10)
0.449
77.928
159.192 61.182***
(5.299) (59.922) (169.949) (23.046)
0.241
0.147
0.499
0.618
520
312
260
104
Notes: Weekly oblast-level panel data for 2014 only. West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. All
specifications include a constant term, quadratic time trend, and month dummies. Robust oblast-clustered standard errors are in parentheses. * p<0.10, **
p<0.05, *** p<0.01.
11
Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Omega
journal homepage: www.elsevier.com/locate/omega
A subjective evidence model for influence maximization in
social networks$
Mohammadreza Samadi a, Alexander Nikolaev a,n, Rakesh Nagi b
a
b
Department of Industrial and Systems Engineering, University at Buffalo (SUNY), Buffalo 14260, NY, United States
Department of Industrial and Enterprise Systems Engineering, The University of Illinois at Urbana-Champaign, IL 61801, United States
art ic l e i nf o
a b s t r a c t
Article history:
Received 6 October 2014
Accepted 30 June 2015
This paper introduces the notion of subjective evidence, which fuels a new parallel cascade influence
propagation model. The model sheds light on the phenomena of belief reinforcement and viral spread of
innovations, rumors, opinions, etc., in social networks. Network actors are assumed to be testing a
Bayesian hypothesis, e.g., for making judgment about the superiority of some product(s) or service
(s) over others, or (dis)utility of a given program/policy. The model-based influence maximization
solutions inform the strategies for market niche selection and protection, and identification of
susceptible groups in political campaigning. The NP-Hard problem of influential seed selection is first
solved as a mixed-integer program. Second, an efficient Lagrangian Relaxation heuristic with guaranteed
bounds is presented. In small, medium and large-scale computational investigations, we analyze:
(1) how the success of an influence cascade triggered in a (sub)community, long exposed to an opposite
belief, depends on the structural properties of the underlying social network, (2) to what extent growing
(increasing the density of) a consumer network within a market niche helps a company protect the
niche, (3) given a competitor's strength, when a company should counter the competitor on “their turf”,
and when and how it should look for limited-time opportunities to maximally profit before eventually
surrendering the market.
& 2015 Elsevier Ltd. All rights reserved.
Keywords:
Influence maximization
Social networks
Bayesian inference
Evidence
Seed selection
1. Introduction and motivation
People tend to view product recommendations received from
friends or through friends more favorably compared to advertisements offered by commercial mass media channels [15,41]. Social
connections enable the propagation of ideas, judgments and
opinions; the phenomenon where knowledge transfer between
individuals significantly affects their decisions about purchasing a
product is known as social influence/contagion [71,63,14]. Social
influence and diffusion of innovations in social networks are
mainly explored in managerial and sociological studies [75,1,2].
However, the need for simulating information diffusion/peer
influence in social networks and solving optimization problems
to algorithmically find potent success of cascade initiation strategies led to the introduction of the Influence Maximization (IM)
problem. The objective of the IM problem is to find such early
starters, termed seeds, for influence spread in a social network
that will direct information transfer so as to achieve a desired
☆
This manuscript was processed by Associate Editor Prokopyev.
Corresponding author.
E-mail addresses: [email protected] (M. Samadi),
[email protected] (A. Nikolaev), [email protected] (R. Nagi).
n
impact on the expected product adoption, or people's decisions/
judgments/opinions with respect to a query of interest [45,16].
Early mathematical formulations of the IM problem in social
networks view social ties as indicators of dyadic dependence,
where the random graph or Markov random field-based approach
is a natural choice for model design [24,59]. More recent literature
on the algorithmic analysis of influence spread has been dominated by diffusion-based models [45], in which ties are viewed as
information flow channels. The Independent Cascade (IC) and
Linear Threshold (LT) models are most notable ones, both allowing
for elegant discrete optimization problem statements; these models also provided the basis for a streak of subsequent studies
[46,36,72,23].
Application-wise, diffusion models have been found suitable
for research studies in marketing [5,15] and health care [60].
However, algorithmic investigations up to date failed to culminate
in significant managerial insights and strategies. This is in part due
to the fact that existing models do not specify the medium and
nature of influence flow through a network, i.e., fail to explain the
diffusion of what leads to social influence, and how it does so.
This paper takes a previously unexplored approach to modeling
the spread of competitive influence in social networks, rooted in
Bayesian Inference theory and focused on propagation of evidence.
http://dx.doi.org/10.1016/j.omega.2015.06.014
0305-0483/& 2015 Elsevier Ltd. All rights reserved.
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
2
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Bayesian inference logic helps quantify social influence under the
premise that people treat new information as evidence and update
their beliefs in support of or against the null hypothesis. In this
approach, network nodes represent intelligent agents (actors) who
seek to form judgments about a product/query by testing a
relevant hypothesis (e.g., that a particular claim is true), based
on their prior beliefs as well as the knowledge acquired through
friends. A node's decision to significantly favor the null hypothesis
signals the node's “positive activation”; significantly favoring the
alternative implies “negative activation”; finally, whenever the
collected evidence is inconclusive, the node is labeled “inactive”.
This paper presents a Parallel Cascade (PC) diffusion framework
for modeling evidence spread through social networks. The flow of
information in this PC model is classified as parallel duplication in
the typology of flow processes on social networks, introduced by
Borgatti [11], which supports the idea of belief reinforcement
through subjective evidence duplication in social communication.
The paper reports insightful observations, e.g., pertinent to the
identification of penetrable market niches and convenient points
of initial influence for conquering new market segments, obtained
from solving basic instances of the PC model-based IM (PCIM)
problem. The paper develops problem-specific optimization
schemes for handling medium and large-scale instances of PCIM
problem formulated as a Mixed-Integer program.
The rest of this paper is organized as follows. Section 2 reviews
the literature on diffusion models for IM. Section 3 formally
introduces the PC diffusion model, formulates PCIM problem and
discusses its application to two empirical case studies. Offering a
more computationally efficient approach to the problem, Section 4
presents a Lagrangian Relaxation heuristic tool suit, with solution
quality guarantees achieved via two problem-specific heuristics
for finding lower bounds for PCIM problem optima. Section 5
reports on the conducted experimental studies. Section 6 summarizes the findings, discusses the potential applications of the
proposed methods and outlines future research directions. The
paper contains two appendices: Appendix A presents the NPhardness proof for the PCIM problem; Appendix B details the
Subgradient Search algorithm for finding an upper bound for PCIM
problem optima.
2. The landscape of the social influence research domain
The concept of word-of-mouth has received attention in the
1940s as an effective way for diffusion of information (e.g., about
new products) and soon became a coined term in the experimental marketing research [53,76,44]. Models of information
diffusion over networks, also first introduced in the marketing
field, were developed more recently and found use in health care
[55], sociology [50,69] and politics [20]. From an experimental
point of view, the phenomenon of social contagion is known to be
a significant factor affecting the strength of diffusion processes in
social networks [52,1,58,3,62,4].
The investigations into the impact of influential people, or
opinion leaders, on cascade formation comprise a large part of the
literature. Opinion leaders are defined as the individuals that have
the ability to strongly affect the opinions or decisions of their
network peers [79]. While some studies degrade the value/power
of opinion leaders for social cascade progression [7,74], most
authors see opinion leader presence as a critical facilitator for
cascade emergence [49,68,30,42,70]. Hinz et al. [41] experimentally showed that a wisely selected group of opinion leaders can
increase the influence spread rate in a cascade up to eight times.
Yet, two questions remain unanswered: How can one select the
appropriate opinion leaders for maximizing the spread of influence in
a social network? and How does this selection depend the social
network structure? While the literature reviewed above is more
concerned with exploring the mechanisms of successful cascade
propagation, it does not provide a readily available method/
solution/strategy (for a company or a political party) to artificially
create a cascade in support of a product or opinion by recruiting
the “best” opinion leaders. The latter objective, however, may be
highly sought-after by research-aware practitioners.
The first organized efforts for identifying influential nodes in
social networks relied on centrality-based heuristics [12]. The
degree centrality heuristic assumes that any node with a large
number of direct connections (called a hub) must be highly
influential in a social network. The distance centrality heuristic,
on the other hand, considers a node influential if it has short paths
to other nodes in the network (called a bridge) [73,41]. The
centrality-based heuristics, however, provide no quality guarantee
for the solution of the IM problems with multiple required seeds.
To formulate an algorithmic approach to finding influential node
sets, the term “influence maximization” was coined by Domingos
and Richardson [24]. While the first attempts to address the IM
problem employed a Markov random field approach [24,59],
Kempe et al. [45] were first to re-frame it as a discrete optimization problem.
The Independent Cascade (IC) model and the Linear Threshold
(LT) model, proposed by Kempe et al. [45], are the most wellknown diffusion models for IM; the optimization problems based
on these models are NP-hard [72,17]. Kempe et al. [45] discovered
a submodularity property of the IM objective function and
presented a greedy seed selection algorithm with guaranteed,
albeit loose, optimality bounds. The problem of assuming submodularity lies in the fact that under it, the marginal gain of
adding new seeds should be decreasing, which does not support
the idea of fixed threshold effects and reveals a manifest shortcoming in the respective diffusion models [37,50]. Furthermore,
the original greedy algorithm and even its extensions were found
overly demanding computationally [47,36,16,15,72].
A separate branch of literature has explored social influence
from the empirical data mining perspective. As discovered by Aral
et al. [2], a financially viable cascade initiation requires the
selection (buy-in) of no more then 0.2% of the nodes in a network.
This finding underlines the value of precise seed selection algorithms that can ensure a desirable cost/returns ratio in cascade
seeding. However, one observes a gap between the literature
based on data-driven studies and algorithmic research. The latter
efforts, unfortunately, have often focused on computational investigations in impractical settings and failed to produce managerial
insights. The present paper serves as a bridge between these two
research thrusts. It designs a realistic diffusion model, strongly
supported by mathematical sociology findings, and solves the seed
selection problem optimally over real social network datasets, and
hence, paves a way to rigorously explore strategic decision-making
in social networks.
Note that the original IM problem formulation was concerned
with maximizing the expected number of activated nodes at the
end of the diffusion process, when the activation status of all the
nodes becomes fixed, irrespective of the sequence and timing of
node activation. However, in many practical IM applications, an
influence campaign has a predefined time window, over which it
has to achieve the maximized possible effect. Only recently, Goyal
et al. [35] addressed the issue of unconstrained time horizon for
IM and introduced MINTIME problem where the objective is to
minimize the time until the activation of a predefined number
of nodes.
Note also that in most IC and LT model-based seed selection
problems have ignored the aspect of activation timing; furthermore, they assumed no competition. Meanwhile, the existing
literature confirms the co-existence of (competing) opinions in
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
real-world decision-making settings [8,9]. Chen et al. [13] were
first to recognize this issue by introducing an IC model that allows
negative influence to impede the spread of positive influence.
In summary, the diffusion-based IM problems have received
attention from the research community for finding influential
nodes in social networks and have been confirmed to be useful
for treating real-world problems. However, the current diffusion
models for IM problems have not been able to produce many
managerial insights, in part due to the underspecification of
influence flow mechanisms. The present work proposes a mathematical framework for finding exact solutions to seed selection
problems, which allows one to more rigorously explore the
structure of such optimal solutions.
3. Bayesian inference logic in influence maximization
This section formalizes the Parallel Cascade (PC) model for
diffusion of subjective evidence through social networks, provides
the mathematical model for solving the problem of identifying the
best positive seeds, and lastly, illustrates the use of the presented
PC model with two case studies. The PC model views the node's
adoption of a product or opinion, called activation, as a decisionmaking process based on continuously collected evidence. It has
been established that human decision-making can be modeled as
Bayesian inference with a high precision whenever the alternatives are given as mutually exclusive hypotheses [66]. These
human subject experiments confirm that the collection rate and
the impact of evidence on human decision-making process can be
studied empirically, and hence, the models presented in this paper
can be specified based on real-world data. Naturally, the current
paper posits that people use their initial preferences and beliefs, as
well as the incoming information they receive from peers, to
decide whether a given null hypothesis (H0) or the opposite
alternative hypothesis (H1) is more likely to be true. It is assumed
that, based on the evidence accumulated and processed through
Bayesian updates, a decision-maker may turn from an observer
into a supporter of the hypothesis that convincingly appears more
likely to them at a particular point in time; the described
transition is defined by thresholds (on the evidence scale) as done
in a great deal of sociology literature [50,69,80].
Bayesian inference logic uses Bayes' rule to update beliefs in
such hypotheses testing (i.e., update the probability that a particular hypothesis is true) as new evidence is received and processed; here, evidence is an objective quantity, that valuates the
new information regarding a hypothesis, e.g., as a result of
observing a new fact. However, in reality, beliefs are not necessarily updated based on such facts. When the source of in-coming
information is not given (not traceable or forgotten), people still
treat the information (supposedly new to them) as evidence,
which we term subjective, and update their beliefs [18,33]. Fig. 1
demonstrates the effect of subjective evidence spread on updating
beliefs in social networks.
The PC model views positive activation as the event when an
actor begins to significantly favor one hypothesis over the other.
Once a network node becomes active, it begins to deliver the
messages in support of their favored hypothesis to their connected
peers. The evidence accumulation can be mathematically
expressed by using the “Odds” function (O), defined as the
probability that “H0 is true” divided by the probability that “H0 is
false”. Taking the logarithm of the Odds leads to an additive
evidence function. The evidence function for H0 is given as
PðdjH 0 RÞ
;
ð1Þ
eðH 0 jRdÞ ¼ 10log 10 ðOðH 0 jRdÞÞ ¼ eðH 0 jRÞ þ 10 log 10
PðdjH 1 RÞ
where R is the prior knowledge of the null hypothesis (before the
3
Evidence
Source
Fig. 1. Belief reinforcement through subjective evidence spread in a social network.
Consider a triplet of actors who are testing the same hypothesis, e.g., that a new
phone service is reliable. Suppose node 1 observes a fact supporting the hypothesis,
e.g., using the new phone service for a month and experiencing few dropped calls,
and presents their impression to nodes 2 and 3. Both nodes 2 and 3 update their
beliefs about the hypothesis, and then, node 2 shares the absorbed information to
node 3 without providing the source of information. Node 3 captures the
information (in fact, the rumor) from node 2, treats it as if it provides new
evidence supporting the hypothesis and updates its belief again. This process
shows how a person's belief about a hypothesis can be reinforced multiple times as
a result of a single external test/fact. In social networks, edges serve as channels
that permit evidence duplication, and hence, can enable (unfounded) belief
reinforcement.
evidence diffusion begins) and d is one signal (a piece of new
information) that supports the null hypothesis [43]. Thus, the
evidence function combines the prior evidence and observed
evidence. With no prior information (data) available, equal probabilities are typically assigned to the null and alternative hypotheses. When a sequence of multiple signals (data) D is received and
processed, the updated evidence is given as
X
Pðdi jH 0 RÞ
eðH 0 jRDÞ ¼ 10log 10 ðOðH 0 jRDÞÞ ¼ eðH 0 jRÞ þ 10
:
log 10
Pðdi jH 1 RÞ
i
ð2Þ
þ
The increment of the positive evidence (e ) resulting from a single
observation supporting the null hypothesis (d), and the increment
of the negative evidence (e ) resulting from a single observation
0
supporting the alternative hypothesis (d ) are respectively given by
e þ ¼ eðH 0 jdÞ ¼ 10n log 10 PðdjH 0 RÞ log 10 PðdjH 1 RÞ ;
ð3Þ
0
0
0
e ¼ eðH 1 jd Þ ¼ 10n log 10 Pðd jH 1 RÞ log 10 Pðd jH 0 RÞ :
ð4Þ
Therefore, upon collecting and processing multiple observations D
with n þ positive and n negative signals, the evidence supporting
the H0 becomes
eðH 0 jDRÞ ¼ eðH 0 jRÞ þ e þ ðn þ Þ e ðn Þ:
The evidence increment values (e
meters in the PC model.
þ
ð5Þ
and e ) are used as para-
3.1. The parallel cascade diffusion model
Define an influence graph as a directed graph G ¼ ðN; AÞ, with a
set of nodes N and a set of arcs A. Let the sets of positive and
negative seeds, i.e., the initial sets of evidence propagators, be
denoted by S þ and S , respectively. Note that the notion of
“positivity” of evidence is arbitrary: the hypothesis that postulates
a claim preferred by the grand policy-maker will hereafter be
viewed as positive, hence the distinction between positive and
þ
negative evidence. For each node iA N, let θi Z 0 (θi Z 0) denote
a positive (negative) threshold for the evidence that a node must
accumulate, in support of (against) the null hypothesis, to become
positively (negatively) activated. In a given problem, θ þ and θ can be set using Bayesian logic: these values should reflect the
desired levels of assurance for a node not to make a mistake (about
the product/query) when it gets positively or negatively activated
[43].
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
4
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
At each discrete time period t ¼ 0; 1; 2; …; T, let Lit Z 0 and
K it Z 0 denote the cumulative levels of positive and negative
evidence for node i, respectively. Node i is said to be positively
þ
(negatively) activated at period t if and only if Lit K it Z θi
(K it Lit Z θi ); otherwise, it maintains the inactive status. Let
þ
Li0 ¼ θi and K i0 ¼ 0 denote the initial evidence levels of node
i A S þ , at time t¼ 0. Equivalently, Lj0 ¼ 0 and K j0 ¼ θj denote the
initial levels of evidence for node j A S , at time period t¼ 0. Each
node is assumed to accumulate evidence incoming from its
activated neighbors, regardless of its own activation status. For
each node i A N, the nodes that have arcs toward (coming from) i
are termed in-neighbors (out-neighbors) of i; Nin(i) and Nout(i) are
the sets of in-neighbors and out-neighbors of i, respectively.
At time period t ¼ 0; 1; …; T, let N tþ ðiÞ D Nin ðiÞ Nt ðiÞ DN in ðiÞ
denote the set of positively (negatively) activated in-neighbors of i.
A node p A N tþ ðiÞ sends positive feedback (positive evidence)
toward i and n A N t ðiÞ provides negative feedback (negative
evidence) for i. The numerical values of the positive and negative
evidence provided by node i A N at time t 4 0 to its out-neighbors
are denoted by Eitþ Z 0 and Eit Z 0, respectively. If node i is
positively activated at time t, Eit is zero; if it is negatively activated
at time t, Eitþ is zero; finally, when i is inactive at time t, both Eit
and Eitþ are zero. The evidence value provided by a node to its outneighbors in the time period immediately following positive
(negative) activation is given by e þ (e ). Note that e is defined
as the absolute value of the negative evidence calculated using
Bayesian logic (i.e., e 4 0). At the end of each time period, each
node updates its cumulative evidence levels (positive and negative) by adding the newly received evidence to the current
evidence levels, and possibly, updates its activation status (to be
used for the next time period). Once an activated node loses its
activation (becomes inactive), its ability to propagate evidence is
immediately revoked. Note that node activation does not have to
be followed by an action (e.g., product purchase): the specific
application of the model will dictate a desirable assumption in this
regard [9].
In order to realistically capture the effects of information
transfer and evidence accumulation in social networks, two decay
factors are incorporated in the presented evidence propagation
model, one pertaining to evidence provision and the other
pertaining to evidence collection and processing. The value of
positive (negative) evidence provided by activated nodes
decreases by α þ (α ) as time passes from the last positive
(negative) activation. As a result, the effect of the transferred
information in updating the evidence level of out-neighbors is
expected to diminish. Furthermore, “forgetfulness” rate β þ (β ) is
introduced into the PC model to allow nodes to forget (discount as
old) a part of the positive (negative) evidence they previously
collected. Forgetfulness rate, that has been well studied in marketing literature [51,10], causes the recently observed evidence to
make a greater contribution to the decision making process. Also,
with time, the nodes will become indifferent to the query, as it
often occurs in practice. Fig. 2 illustrates the dynamics of PC
model-driven evidence propagation over a small network.
Sets S þ and S include the Influence graph nodes that are
positively and negatively activated, respectively, at t¼ 0; set S is
given; the nodes in S þ are to be selected by the decision-maker
solving the IM problem. The diffusion process is terminated after a
pre-set (practically relevant) number of time periods (T). Following
the traditional setup, the PC model-based IM (PCIM) problem is
n
concerned with populating S þ so as to maximize some measure
of the evidence spread in the network. The measure taken in this
paper accounts for both the earliness and sustainment of node
activation: PCIM amounts to maximizing the count of time periods
with positive activation (Γ G ðS þ ; S Þ) while minimizing the count
of time periods with negative activation (ΔG ðS þ ; S Þ) over all the
nodes,
n
Sþ A
ðS
þ
arg max
D NjS
D N;S
þ
\S
¼ ∅Þ
Γ G ðS þ ; S Þ ΔG ðS þ ; S Þ :
It thus makes the model applicable for such marketing, political
and military problems where the timing and duration of activation
matter. For example, when activation stands for subscription for a
service, each node generates profit in each time period that it's
positively activated. As a result, the total duration of a node's
positive activation determines its contribution to the objective
function. Note that a positively activated node still observes both
positive and negative evidence: as such, a positively activated
node can become negatively activated after receiving enough
negative evidence, and vice versa.
Note also that in the absence of negative seeds, when communication can only reinforce the nodes' beliefs, the PC model with
þ
the decay factors set to α þ ¼ 1 and β ¼ 0, reduces to a special
case equivalent to the original LT model introduced by Kempe
et al. [45], with the fixed threshold values.
By accommodating conflicting evidence and thanks to its
objective function, the PCIM problem can inform decisions even
in situations where the decision-maker stands to eventually loose
its market position(s). Via the threshold values and forgetfulness
rates, the PC model also easily accommodates the non-symmetry
of positive and negative influence effects in social networks, i.e.,
the phenomenon known as the “Negativity Bias”, which, e.g.,
reflects the fact that only a few negative product feedback
comments can turn a potential buyer away [54,6,65].
3.2. Optimization model specification and solution methodology
In this section, a mixed-integer program is constructed for
finding exact optimal solutions to the PCIM problem. It is first
noted that the PCIM problem is NP-hard.
Theorem 1. The PCIM problem is NP-hard.
Proof. By a polynomial Turing reduction from the Maximum
Coverage Problem (see Appendix A).
The PCIM problem is now formally stated, with the notation
summarized in Table 1.
As stated earlier in the paper, at every time period, each node is
either positively activated, negatively activated or inactive. At the
end of each time period, every node collects all the incoming
evidence and updates its cumulative evidence level to determine
its activation status for the next time period. The mixed-integer
programming model (P) for the PCIM problem is given as
ðPÞ
max Z ¼
jNj X
T
X
ðX it Y it Þ
ð6Þ
i¼1t ¼0
Subject to:
Y it Z ððK it Lit Þ θi Þ=M;
i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T;
þ
1 X it Zðθi ðLit K it ÞÞ=M;
X it þ Y it r1;
i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T;
Lit ¼ β þ Lit 1 þ ∑ Ejtþ 1 ;
ðj;iÞ∈A
i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T;
K it ¼ β K it 1 þ
X
Ejt 1 ;
ð7Þ
ð8Þ
ð9Þ
i ¼ 1; 2; …j Nj ; t ¼ 1; 2; …; T it ;
ð10Þ
i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T;
ð11Þ
ðj;iÞ A A
þ
Li0 ¼ X i0 ðθi Þ;
i ¼ 1; 2; …jNj;
K i0 ¼ Y i0 ðθi þ ϵÞ;
i ¼ 1; 2; …jNj;
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
ð12Þ
ð13Þ
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
0
-2
0
-2.5
0
0
0
0
0
-5
2
-10
23.3
2.1
-0.4
28.4
32.2
6.7
-10
94.4
4.2
49.7
11.7
51
-7.9
7.1
40.1
42.6
49.7
6.3
21.3
65.9
t=3
87.2
t=4
inactive node
56.8
71
14.2
44.6
11.3
71
25.9
28.4
28.4
39.7
39.7
66
8.8
15.6
18.4
18.4
37.6
7.1
t=2
-5.4
6.4
6.7
14.2
0
t=1
-2.8
-2.5
14.2
14.2
0
t=0
15.9
-0.4
0
0
2
-5
-2.5
7.1
-2.5
9.2
7.1
0
6.7
2.1
0
0
2.1
0
7.1
0
-5
-12
-2.5
0
0
0
-2.5
-2
0
5
t=5
positively activated node
negatively activated node
Fig. 2. The spread of positive and negative evidence through a network with jNj ¼ 15, T ¼ 5, e þ ¼ 7:1, e ¼ 2:5, α þ ¼ α ¼ 1:0, β þ ¼ β ¼ 1:0, θ þ ¼ 2 and θ ¼ 2. The net
activation value (Lti K it ) for each node is found beside each node. Each graph reports the activation status of each node at a singe time period t.
Table 1
Definition of indices, input parameters and decision variables in mathematical problem.
Indices
Node indices
Time period index
i; j
t
Inputs
GðN; AÞ
jNj
T
jS þ j
θiþ
θi
eþ
e
αþ
α
βþ
β
Si
The Influence Graph; a directed graph with a set of nodes N and a set of arcs A
Total number of nodes in the network
Total number of time periods in the time horizon considered in the problem
Total number of positive seeds
The value of positive threshold for ith node
The value of negative threshold for ith node
Maximum value of positive evidence a node can send in a single time period
Maximum value of negative evidence a node can send in a single time period
Discount rate for the value of positive evidence sent by a positively activated node
Discount rate for the value of negative evidence sent by a negatively activated node
The rate that each node forgets the previously received positive evidence
The rate that each node forgets the previously received negative evidence
(
1 if ith node is a negative seed;
0 otherwise;
Decision variables
X it
1 if node i is positively activated at time t
0 otherwise
1 if nodei is negatively activated at time t
0 otherwise
cumulative level of positive evidence for ith node at time t
Cumulative level of negative evidence for ith node at time t
The value of positive evidence that ith node provides for its neighbors at time t
Y it
Lit
Kit
Eitþ
Eit
Eitþ r ðα þ Eitþ 1 Þ þ ð1 X it 1 Þe þ ;
The value of negative evidence that ith node provides for its neighbors at time t
Eit Z ðα Eit 1 Þ þ ðY it Y it 1 Þe ;
i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T;
i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T;
ð16Þ
ð14Þ
Eitþ r e þ ðX it Þ;
i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T;
ð15Þ
Eit r e ðY it Þ;
Y i0 ¼ Si ;
i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T;
i ¼ 1; 2; …jNj;
ð17Þ
ð18Þ
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
6
þ
Ei0
¼ X i0 e þ ;
i ¼ 1; 2; …jNj;
ð19Þ
¼ Y i0 e ;
Ei0
i ¼ 1; 2; …jNj;
ð20Þ
jNj
X
To the best of our knowledge, this paper presents a first mixedinteger program for solving IM problems. The experimental results
with (P) are reported in Section 5.
3.3. Case studies with the PC model
X i0 r jS þ j;
ð21Þ
i¼1
0 r Lit ; K it ; Eitþ ; Eit ;
Y it ; X it A f0; 1g;
i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T;
i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T:
ð22Þ
ð23Þ
The objective function in (6) takes into account the timing of
node activation through the counts of positively and negatively
activated nodes in each time period. Note that removing the
timing of activation from the objective function in (6) would
generate the problem of maximizing the number of positively
activated nodes and minimizing the number of negatively activated nodes at the end of the diffusion process, i.e., in period T,
which can be solved as a special case of (P).
The constraints (7) and (8) ensure that each node gets positively activated when its net evidence level (the difference
between cumulative positive evidence and cumulative negative
evidence) is greater than or equal to the positive threshold (θ þ ),
and gets negatively activated when the net evidence level is less
than or equal to the negative threshold (θ ). In constraints (7) and
(8), M is a large positive number greater than or equal to
þ
½maxi A N ðθi þ θi Þ þ ϵ þðjNj 1ÞðT þ 1Þe . Constraint (9) guarantees that, at each time period, each node is either positively or
negatively activated, or otherwise inactive. Constraints (10) and
(11) ensure the correct updates of the level of cumulative evidence
for each node at each time period. The diffusion process starts
with the cumulative level of positive and negative evidence set to
zero for all the nodes except the seeds. Constraints (12) and (13)
ensure that the cumulative level of positive evidence for each
positive seed is greater than the positive threshold (θ þ ), and the
cumulative level of negative evidence for each negative seed is
greater than the negative threshold (θ ). This is required to ensure
that the seeds do not lose their ability for propagating influence
immediately following the initial time period. As the objective
function favors reducing the number of negative activations
(deactivates a negatively activated node in case that its negative
level of evidence is exactly equal to its negative threshold), a very
small positive parameter ϵ, as small as 0.0001, is needed to force
the model to keep the negative seeds negatively activated at the
end of the initial time period. Such a parameter is not needed in
constraint (12) because the objective function favors keeping the
positive seeds positively activated when the level of positive
evidence and the positive threshold are the same. Note that
assigning a large value to ϵ and adding it to both (12) and (13)
would increase the time over which positive and negative seeds
can sustain their respective activation. Constraints (14) and (15)
set the value of the positive evidence that any node can propagate
over a single time period t Z 0 (Eitþ ); they guarantee that: (a) Eitþ is
zero when node i is not positively activated at time t (X it ¼ 0), (b)
Eitþ is equal to e þ when i has become positively activated at time t
(X it X it 1 ¼ 1), and (c) Eitþ is equal to α þ Eitþ 1 , otherwise. Constraints (16) and (17) set the value of the negative evidence that
any node can propagate over a single time period t Z 0 (Eit ).
Constraint (18) ensures the initial activation of the negative seeds.
Constraints (19) and (20) set the initial value of the evidence
propagated by each node in the network. Constraint (21) ensures
that the total number of positive seeds at time t ¼0 does not
exceed the pre-defined number of seeds in the problem. The nonnegativity and binary constraints for the decision variables in the
problem are defined in (22) and (23).
The most meaningful and valuable IM modeling efforts,
reported in the literature, allow for the characterization of the
properties of optimal solutions, derived from the analyses of
distinct small- and medium-sized problem instances; such findings can then be extrapolated to more general, large problem
instances. This section presents two examples using data from
real-world networks that illustrate the power of the PC model in
explaining the consequences of seed selection decisions when
positive and negative influences clash in social networks with
specific structure. The PC model reveals how and why the optimal
strategy for positive influence spread depends on the selected
seeds' positions, on the time length of the window of opportunity
the decision-maker has, on the network structure, on the locations
of the opponent's seeds, and on the specifications of the evidence
accumulation mechanism.
Case study 1: This example studies the flow of information over
the Zachary's karate club network, a well-known network in the
literature of social network analysis. The dataset contains 34
members of a karate club who were observed for two years and
the friendship links were extracted based on the interactions
among members outside of the club-related activities [81]. During
the data collection course, a disagreement grew between the
club's administrator and instructor, which led to the club's
break-up into two clubs. Fig. 3 shows the Zachary's karate club
social network, named Network 1, where node 1 denotes the
instructor who is the central node in the first cluster (C1) and node
34 denotes the administrator who is the central node of the
second cluster (C2). The clusters depict the eventual student
memberships in the two separated clubs [31].
In order to define a PCIM problem on Network 1, consider it as
a new market for a vitamin supplement product. Through personal
connections, the students can share information with each other
and observe each other using the product: consequently, they can
process such observations as evidence in support of the hypothesis
that the new product is good.
Firm F1, the producer of a particular model (variation) of the
new product, plans to offer the product at a discounted price to
two people in the network, as seeds, to encourage other people to
adopt the product, which they were reluctant to adopt otherwise.
Meanwhile, a competing producer F2, that produces an alternative
product's model, has also identified Network 1 as a niche and has
already incentivized node 34, the administrator who has the
highest degree in Network 1, to serve as their seed. It is assumed
that each person, when exposed to both F1's and F2's products,
tests the null hypothesis that “F1's model is better than F2's” versus
the alternative hypothesis that “F2's model is better than F1's”. The
acceptance (rejection) of the null hypothesis by any node corresponds to the adoption of F1's (F2's) model, while staying undecided signals that the node has not yet adopted the product. The
set position of the negative seed serves as a constraint in the
problem that F1 formulates, with the objective of locating its own
seeds more efficiently.
In competing against each other, each company (F1 and F2) not
only tries to maximize its own profit, but also tries to minimize the
competitor's profit. Without any further assumptions, if F2 has a
significantly stronger brand image than F1, the best intuitive
strategy for F1 is to locate its seeds far away from the negative
seed to influence a group of people and reap some profit in the
limited time window before all the people adopt the F2's product.
A more challenging problem arises when F1's brand image is as
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
strong as F2's. In this situation, F1 can assign both seeds to cluster
C1 to influence all people in this cluster, while a more reasonable
strategy is to assign one seed to cluster C2, in the neighborhood of
the negative seed to cancel it out, and assign the other seed to
cluster C1.
Turning the described intuition into exact PCIM solutions,
however, is not trivial. To this end, one can use program ðPÞ; see
the results in Table 2. In the solved PCIM instances, all the nodes
process evidence in the same manner (i.e., the community is
homogeneous), and the evidence threshold values are set to
θiþ ¼ 2, θi ¼ 2, for every node iA N. The positive seeds are
gradually strengthened over several problem instances, by increasing the value of positive evidence increment (e þ ), which leads to
the changing optimal seed locations. The time horizon in every
problem instance is set to T ¼5 (note that, since the network is
small, with the diameter of five links, then any two positive nodes
can potentially reach the whole network within five time periods).
In Table 2, the first column shows the experiment index, the
second and third columns show the evidence increment values,
reflective of the relative quality levels of the F1's (e þ ) and F2's (e )
products, and the last column reports the optimal seed set for each
instance. The analysis of the optimal seed sets showcases the
transition in the optimal seed allocations, as the problem parameters are varied. When the positive evidence increment value is
too small, the optimal positive seeds find themselves in the
locations most distant from the negative seed. As the positive
evidence strength grows in the subsequent instances, the optimal
positive seed locations first gradually move toward the negative
seed and then spread out evenly over the network. These results
are well in line with the intuition.
25
26
29
30
24
28
34
32
4
13
8
33
15
10
3
2
23
31 19
5
16
9
1
7
27
21
20
11
14
17
18
6
12
22
Fig. 3. The IM problem on Network 1 with jNj ¼ 34, T ¼ 5, α þ ¼ α ¼ 0:8,
β þ ¼ β ¼ 1:0, θ þ ¼ 2 and θ ¼ 2. Square nodes represent members of the first
cluster (C1) and circle nodes represent members of the second cluster (C2). Node 34
is the club administrator, who serves as the negative seed.
7
In order to study the effect of the different time horizon
settings on the optimal solution for F1, the experiments are
repeated with various time horizons and the results are reported
in Table 3. Tracking the changes in the optimal positive seed
locations with the varied T reveals that the decision-maker should
become more conservative as the time horizon for the problem
increases. In order to further study the patterns in the optimal
solution formation with the growing T, assume that the decisionmaker (F1) earns (loses) one dollar per positive (negative) activation per time period. Then, the PCIM objective can be interpreted
as the amount of money that the decision-maker earns by the end
of the marketing campaign. Taking any action other than the
optimal one leads to a regret compared to the objective value that
would be obtained under the optimal seed selection. As such, a
decision-maker that relies on centrality-based heuristics, will
always select the nodes (1, 33) as the positive seeds, as they have
the highest degree and betweenness centrality values (except for
node 34, which cannot be selected), irrespective of the evidence
values and T.
Table 3 shows the regret of the heuristic solution; the regret
increases with T, which in part explains why the decision-maker
becomes more conservative as T increases. Note that the regret
values should be standardized to allow for proper comparison
across that problem instances with different time horizons. As the
maximum amount of money that F1 can theoretically make in each
instance is NðT þ 1Þ, termed the maximum theoretical revenue
(MTR), the heuristic regret of each problem is divided by MTR and
the standardized regrets are plotted in Fig. 4.
When the positive seeds are weak, the negative evidence
conquers the whole network, and vice versa. The peak in Fig. 4
corresponds to the case where the groups of positive seeds and the
negative seed are almost equally strong – this is when calculated
seed selection can have a big impact. The calculations of the area
under the standardized regret curve on Fig. 4 reveal that the regret
value of the heuristic-based seed selection increases with T.
Overall, these results emphasize the importance of optimal seed
selection in (a) the problems with a large time horizon, and (b) the
problems where neither positive nor negative evidence is overly
dominant.
Note that when the positive evidence increment (e þ ) becomes
much larger than the negative evidence increment (e ), such that
any strategy eventually leads to full positive activation in the
network, then the optimal strategy is indifferent to both the
location of the negative seed and the time horizon, and places
the positive seeds so as to minimize the time of reaching all the
nodes. Interestingly, this observation brings up the idea of minimizing the maximum distance (or the average distance) of nodes
to positive seed(s) as a heuristic method for locating positive seeds
in social networks, when positive evidence strongly dominates the
negative evidence. This finding connects the problem of locating
positive seed(s) for maximizing the spread of evidence in a noncompetitive social network to the p-center and p-median
Table 2
Optimal positive seeds competing the negative seed in node 34 for T ¼ 5 (Network 1).
Exp. index
eþ
e
Opt. seeds
Remarks
1
2
3
4
5
6
7
8
9
0.5
0.8
1.1
1.7
2.1
2.6
3.5
4.4
6
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
(6, 7)
(5, 6)
(1, 2)
(1, 33)
(3, 33)
(32, 33)
(3, 33)
(1, 33)
(1, 33)
Both seeds are as far away from the neg. seed as possible
Both seeds are far away from the neg. seed
Both seeds inch closer to the neg. seed, still in C1
One seed stays in C1 and the other one moves to counter the neg. seed in C2
One seed moves to the bridge of C1 and C2 and the other one is still in C2
Both seeds move to C2 to block the neg. seed in its own cluster
One seed stays close to the neg. seed and the other one begins to move away
The seeds spread out over the network, without regard to the neg. seed
The seeds spread out over the network, without regard to the neg. seed
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
8
Table 3
The effect of time horizon on the optimal position of positive seeds (Network 1).
Exp. index
eþ
e
Opt. seeds
T¼2
Heu.
reg.
Opt. seeds
T¼4
Heu.
reg.
Opt. seeds
T¼7
Heu.
reg.
Opt. seeds
T¼9
Heu.
reg.
Opt. seeds
T ¼ 15
Heu.
Reg.
1
2
3
4
5
6
7
8
9
0.5
0.8
1.1
1.7
2.1
2.6
3.5
4.4
6
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
3.5
(7, 14)
(1, 2)
(1, 2)
(1, 33)
(1, 33)
(1, 33)
(1, 33)
(1, 33)
(1, 33)
3
5
10
0
0
0
0
0
0
(7, 11)
(5, 6)
(1 ,2)
(1, 33)
(3, 33)
(32, 33)
(3, 33)
(1, 33)
(1, 33)
6
10
35
0
9
31
20
0
0
(6, 7)
(5, 6)
(1, 2)
(1, 33)
(3, 33)
(32, 33)
(3, 33)
(1, 33)
(1, 33)
6
13
48
0
24
116
29
0
0
(6, 7)
(7, 11)
(1, 2)
(1, 3)
(3, 33)
(32, 33)
(3, 33)
(1, 33)
(1, 33)
6
12
48
2
21
184
31
0
0
(6, 7)
(7, 11)
(5, 6)
(1, 3)
(3, 33)
(32, 33)
(3,33)
(1, 33)
(1, 33)
7
13
55
9
104
393
31
0
0
0.8
Standardized Regret
0.7
0.6
0.5
T=2
T=4
T=5
T=7
T=9
T=15
0.4
0.3
0.2
0.1
0
0
-0.1
1
2
3
4
5
6
7
Positive Evidence Value
Fig. 4. The standardized regret value of the centrality-based heuristics in Table 3.
problems in facility location literature, that try to locate facilities
so that it minimizes the maximum or average distance of facilities
to the points of demand [38,64].
Case study 2: This example studies the stability of judgments in
a network in the presence of external influence. To this end, a new
concept reflecting the consequences of subjective evidence reinforcement is introduced, and its utility is illustrated in application
to the Florentine families' marriage network [56]. Define a network cluster's “defendability” as the number of its nodes that
withstand the pressure of an external judgement, i.e., do not
change their opinions/decisions (e.g., related to product purchasing, political party support, etc.). This case study showcases how
the defendability of a cluster depends on its interconnectedness
and the timing of an external “attack”.
The Florentine families' marriage network, named Network 2,
contains 16 elite families in Florence in which the links represent
the inter-family marriages in the time period 1394–1434. Padgett
and Ansell [56] illustrated how Medici family took power through
creating strategic marriage links in this network. It is of interest to
explore how the growing number of within-cluster marriage links
would help Medici remain in power, if a new family were to
emerge from the outside and attempt to impose its own influence
on the cluster (see Fig. 5(b)–(d)).
Without loss of generality, the Medici family is taken as a
positive seed in Fig. 5: it is assumed to begin a political campaign
at time period t ¼0. After d time periods, a new family Bruno,
taken as a negative seed, creates a marriage link to Lambertes
family, a peripheral node in the original network, hoping to
initiate an oppositional campaign. The negative influence is
assumed stronger than positive (e þ ¼ 1 and e ¼ 3:5) ensuring
that the negative influencer has the potential to penetrate the
cluster. Intuitively, one expects the network cluster to reinforce a
particular view as it is exposed to it for a long time. Also, the
number of within-cluster connections should accelerate the information exchange, and thereby, make the cluster more defendable.
With both positive and negative seed locations given, the PC
diffusion model is employed for evaluating the spread of evidence
through the network over time (up to T ¼50) until the optimal
solution no longer changes with the growing T. In order to gauge
the impact of the cluster density on defendability, two marriage
links first are added to the original network: Acciaioul to Pazzi,
and Albizzi to Bischeri (Fig. 5(b)); then, two additional links are
added: Ridolfi to Albizzi, and Strozzi to Guadagni (Fig. 5(c)); and
finally, two more links: Medici to Guadagni and Ridolfi to Bischeri
(Fig. 5(d)).
Table 4 summarizes the results for the four clusters (2(a)–2
(d) shown in Fig. 5(a)–(d)): the first row gives the cluster labels;
the second row reports the clusters' densities. The first column of
Table 4 reports the delay (in the number of time periods) after
which the cluster gets exposed to the negative influence. For each
cluster in Table 4, the first (second) column reports the total
number of nodes (families) that adopt the positive (negative)
political opinion by the end of the diffusion process. The results
reported in Table 4 quantify how the cluster defendability
increases with the growing density, confirming the claim of Easley
and Kleinberg [25] about the association of pluralistic ignorance
and the number of direct contacts in the network. The clusters are
also observed to become more defendable after a certain delay
period, termed critical delay threshold, which depends on the
evidence strengths, cluster connectivity and proximity of the point
of attack to the positive seed in the cluster.
In the real-world scenario, Bruno family would be unlikely to
be able to marry into any family in the core of Network 2. A link to
Pucci, an isolated node, would hardly be useful. The results of the
diffusion process with the same settings as in Table 4, but with
Bruno targeting Acciaiiuol and Pazzi (low-degree families), are
reported in Table 5. The results showcase the fact that attacking a
network through a point far from the positive seed provides a
better opportunity for the external evidence to succeed. In order to
see how the distance of the point of attack from the positive seed
affects the success of the external influence to spread in the
cluster, the same experiments are repeated when a link is added
to the network to connect the point of attack (Lambertes) to the
positive seed (Medici) – see Table 6. The comparison of
Tables 4 and 6 reveals that decreasing the distance between the
positive seed and the point of attack hurts the prospects of the
negative seed. More generally, reinforcing a network with more
links makes it more defendable against an opposing influence.
Case study 2 highlights the fact that investments into influencing a well-connected community must be carefully calculated.
Both the community structure and intervention timing are impor-
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
9
GINORI
GINORI
SALVIATI
SALVIATI
LAMBERTES
PAZZI
LAMBERTES
PAZZI
ALBIZZI
ALBIZZI
BRUNO
GUADAGNI
TORNABUON
MEDICI
ACCIAIOUL
PUCCI
STROZZI
BRUNO
TORNABUON
MEDICI
ACCIAIOUL
PUCCI
BISCHERI
GUADAGNI
STROZZI
RIDOLFI
BISCHERI
RIDOLFI
BARBADORI
BARBADORI
PERUZZI
CASTELLAN
PERUZZI
CASTELLAN
GINORI
GINORI
SALVIATI
SALVIATI
LAMBERTES
PAZZI
LAMBERTES
PAZZI
ALBIZZI
ALBIZZI
BRUNO
BRUNO
GUADAGNI
TORNABUON
MEDICI
ACCIAIOUL
GUADAGNI
MEDICI
ACCIAIOUL
TORNABUON
PUCCI
STROZZI
PUCCI
BISCHERI
STROZZI
RIDOLFI
BISCHERI
RIDOLFI
BARBADORI
BARBADORI
PERUZZI
CASTELLAN
PERUZZI
CASTELLAN
Fig. 5. The IM problem on Network 2 with jNj ¼ 16, T ¼ 50, α þ ¼ α ¼ 0:7, β þ ¼ β ¼ 1:0, θ þ ¼ 2 and θ ¼ 2: (a) the initial setup, (b) the red edges are added to the
network, (c) the green edges are added to the network, (d) the blue edges are added to the network. Adding edges to the cluster (increasing the density of the cluster)
increases its defendability and makes it more difficult for the Bruno family (negative seed) to penetrate the network cluster.
Table 4
The counts of positively and negatively activated nodes in Network 2 at time T ¼ 50.
Table 5
Attacking low-degree families.
Network (density)
2 (a) (0.167)
2 (b) (0.183)
2 (c) (0.2)
2 (d) (0.217)
Network (density)
Attacking Pazzi (0.167)
Attacking Acciaioul (0.167)
Delay
(þ)
()
(þ)
()
( þ)
()
(þ)
()
Delay
(þ)
()
(þ)
()
0
1
2
3
4
5
6
7
–
–
–
–
–
1
6
13
15
15
15
15
15
11
9
1
–
–
–
3
8
9
9
13
15
15
15
9
4
4
3
1
–
–
–
14
14
14
14
14
15
15
15
1
1
1
1
1
–
14
14
14
14
14
14
14
15
1
1
1
1
1
1
1
0
1
2
3
4
5
6
7
13
13
13
13
13
13
13
13
2
2
2
2
2
2
2
2
–
14
14
14
14
14
14
14
15
1
1
1
1
1
1
1
tant to such ventures. Note that in a marketing problem of
occupying and protecting a market niche, the delay considered
in this section can be viewed as that of introducing a competing
product. In this context, the PC diffusion model can help valuate
long-term marketing strategies, i.e., assess the trade-off between
an earlier yet more expensive or a delayed but less expensive
product introduction.
In summary, Section 3.3 showcases the value of the PC diffusion
model for expressing the spread of evidence in practical IM
problems. Furthermore, the provided case studies exemplify how
sensitive the optimal solutions to IM problems may be to the
numerical values of problem parameters. Notably, the present
section connects the seed positioning problem in social networks
to the facility location problem, a well-studied problem in the
literature of Operation Research, that opens a door to applying
Table 6
Attacking Lambertes family.
Network (density)
Attacking Lambertes (0.175)
Delay
(þ)
()
0
1
2
3
4
5
6
7
–
–
13
13
13
13
13
13
15
15
1
1
1
1
1
1
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
10
Table 7
Computational results for small- and medium-sized PCIM problem instances.
Instance
Dataset
jS þ j ¼ 4 (BP)
jS þ j ¼ 5
jS þ j ¼ 6
jS þ j ¼ 7
Opt. sol. (jNj ¼ 35, T ¼ 10)
Inclusion of BP solution (%)
Opt. sol. (jNj ¼ 40, T ¼ 30)
Inclusion of BP solution (%)
Opt. sol. (jNj ¼50, T ¼ 25)
Inclusion of BP solution (%)
MPE
(2, 10, 12, 20)
–
(6, 8, 17, 19)
–
(10, 19, 25, 43)
–
(2, 10, 12, 18, 20)
100
(6, 8, 17, 19, 25)
100
(10, 16, 19, 25, 43)
100
(2, 10, 12, 14, 20, 31)
100
(6, 8, 17, 19, 25, 34)
100
(10, 16, 19, 25, 43, 48)
100
(2, 10, 12, 14, 20, 29, 31)
100
(5, 6, 8, 17, 19, 25, 34)
100
(10, 16, 19, 25, 43, 47, 48)
100
FB1
FB2
location theory models for IM problems in social networks. Section
4.2.1 explains how the methods from the location theory literature
can inform new IM heuristics.
4. A set of Lagrangian heuristic tools for PCIM
As mentioned in Section 3 (and proved in Appendix A), the
PCIM problem is NP-hard. The sources of complexity of the PCIM
problem include the number of nodes in the Influence Graph, the
total number of time periods for spreading the evidence and the
number of positive seeds. An efficient approach is required for
solving PCIM problem instances with large Influence Graphs.
This section presents a guaranteed-performance heuristic
method for PCIM using Lagrangian Relaxation. It works by relaxing
a preselected subset of constraints in (P) and including weighted
penalty terms for violating the relaxed constraints into the
objective function. Lagrangian Relaxation has been applied for
solving optimization problems in various areas, including supply
chain network design [57], scheduling [21], network planning [61]
and data clustering [22]. A Lagrangian Relaxation heuristic for
PCIM is designed in Section 4.1: it identifies good feasible solutions
in reasonable time, while returning a tight upper bound for the
optima. To achieve the latter, a Subgradient algorithm is presented
in Appendix B as a method for finding the lowest upper bound for
PCIM. Finally, two heuristic methods are presented in Section 4.2
for finding near-optimal feasible PCIM solutions and obtaining
tight lower bounds for the optima.
4.1. Lagrangian relaxation for finding an upper bound for PCIM
solutions
4.2. Obtaining the lower bounds for optimal PCIM solutions
By definition, incorporating the removed PCIM constraints into
its objective function results in a valid “relaxation” of the original
formulation [39,28].
A Lagrangian Relaxation problem (LRu) for (P) is given,
ðLRu Þ
max Z LRu ðuÞ ¼
jNj X
T
X
i¼1t ¼0
ðX it Y it Þ þ uðjS þ j jNj
X
X i0 Þ;
ð24Þ
i¼1
Subject to:
u Z 0;
ð25Þ
ð7Þ–ð20Þ
ð22Þ–ð23Þ:
Each feasible solution of (P) is feasible for the corresponding
(LRu), since (LRu) is at most as constrained as (P). In order to make
(LRu) a valid relaxation for (P), a non-negativity constraint is
required for the Lagrangian multiplier (u). As a result of defining
(LRu) as a relaxation for (P), each feasible solution for (LRu)
provides an upper bound for (P). In an effort to obtain tight upper
bounds for PCIM, a Lagrangian dual problem (LDu) is formulated,
ðLDu Þ
Z LDu ¼ Minu0 Z nLRu ðu0 Þ;
Lagrangian dual problem (LDu) can be iteratively solved for finding
the dual multipliers that minimize the optimal solution of (LRu) to
obtain the best (lowest) upper bound for (P).
To make the iterative search procedure of solving LDu more
efficient, a loose relaxation of (LRu) is preferable. Tighter relaxations, however, are expected to provide better bounds for (P); thus,
a trade-off arises between executing fewer iterations of the search
procedure for solving (LDu) with a tighter relaxation and executing
more iterations of the search procedure for solving (LDu) with a
less tight relaxation. Such relaxations that keep Xit
(i ¼ 1; 2; …; jNj; t ¼ 0; 1; …; T) binary and keep the negative seeds
fixed are computationally easier because adding (dropping) positive seeds to (from) their optimal solutions can provide valid
feasible solutions for (P). With this idea in mind, constraint set (21)
is relaxed with dual multiplier u. Although the selected relaxed
problem removes the constraint for the exact number of positive
seeds in (P), the maximum number of positive seeds in (P) is still
constrained by the sets (9) and (18), which do not allow a node to
be a negative seed and a positive seed at the same time. In this
paper, a Subgradient search procedure, a famous hill climbing
algorithm [29], is applied to solve the Lagrangian dual problem
(see Appendix B). There are other methods including simplexbased methods and multiplier adjustment methods proposed in
the literature for solving Lagrangian dual problems, but
Subgradient-based procedures, in general, achieve better computational performance [28,29]. Two heuristic methods are proposed
next for finding near-optimal feasible solutions for (P) to provide
the lower bound for calculating the heuristic gap and updating the
step size for the Subgradient algorithm.
ð26Þ
where Z nLRu ðu0 Þ is the optimal solution of (LRu), for a given u0 . The
Each feasible solution for (P) presents a valid lower bound for
the optimal solution for (P). The presented PCIM problem always
has at least one feasible solution if the total number of positive
seeds and negative seeds is less than or equal to the total number
of nodes in the Influence Graph. The simplest method for finding a
feasible solution for (P) is to trivially select any jS þ j nodes, which
are not negative seeds, as positive seeds. Although such solutions
satisfy the stopping criterion in the Subgradient algorithm, the
resulting lower bound is not necessarily tight. In this section, two
heuristic methods are presented for finding near-optimal feasible
solutions.
4.2.1. The iterative seed removal (ISR) algorithm
The PCIM problem has three properties, discovered through
experimental studies with the mathematical model (P) over Network 1, Netowrk 2 and real Facebook datasets from SNAP collection [48], and presented in this section as observations. The ISR
algorithm to be presented utilizes these properties to efficiently
find near-optimal feasible solutions for PCIM problem.
Observation 1. In the PCIM problem, when the positive seed
locations are given at time t¼0, the calculation of the resulting
objective function value in (P) takes OðTjNj2 Þ time.
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Observation 2. For the PCIM problems with the varying number
of positive seeds to be selected (jS þ j), the solution time is a
concave function of jS þ j.
Observation 3. The intersection of the sets of optimal seeds for
two instances of the PCIM problem with a different number of
positive seeds is generally non-empty; moreover, in a vast majority
of cases, the optimal seed set for a PCIM problem instance is a
subset of the optimal seed set for the PCIM instance with the same
parameter specification but more positive seeds.
The analysis of PCIM run-times in Section 5.2 experimentally
confirms Observations 1 and 2. As a piece of evidence for
Observation 3, Table 7 gives the results for three PCIM problems
with the growing number of positive seeds. The first problem uses
the Mexican Political Elite (MPE) network that contains the
significant friendship, kinship, political and business connections
within a powerful political group in Mexico [19]. The next two
problems use Facebook data subsets, FB1 and FB2, found in the
SNAP collection [48]. In each case, the number of positive seeds in
the original PCIM problem BP (Base Problem) is equal to four, and
new PCIM problems are generated by iteratively incrementing the
number of positive seeds by one, and then, solved to see if the
optimal solution of a problem with more positive seeds includes
the seeds from the optimal solution for BP. The results confirm
that the optimal solution for all the problems with more than four
positive seeds contain the solution for BP. Note that Observation 3
experimentally authenticates the utility of the Greedy algorithm of
Kempe et al. [45] and the facility location Stingy algorithm (also
known as the Greedy-Drop algorithm) of Feldman et al. [27], for
solving PCIM problems.
The ISR algorithm employs the PCIM problem properties
captured in the three presented Observations to efficiently obtain
good and tight solutions to practical problem instances. Consider
an instance of the PCIM problem with jS þ j positive seeds to be
identified (henceforth referred to as the original problem). The ISR
algorithm increases the number of positive seeds and defines a
Dummy problem with jSdþ j4 jS þ j positive seeds. The Dummy
problem is exactly the same as the original PCIM problem in the
Influence Graph and input parameters, but it seeks for a greater
number of positive seeds. An optimal solution for the Dummy
problem is necessarily infeasible for the original PCIM; the ISR
algorithm works to iteratively obtain the best combination of the
seeds to be removed from the optimal solution for the Dummy
problem and obtain a good feasible solution for the original
problem. The number of positive seeds in the Dummy problem
is chosen to be large, but not too large, so that it can be solved fast,
and also, the seed removal procedure can be efficient.
According to Observation 2, it is always possible to find a
simple Dummy problem with jS þ jþ jS jr jNj. According to
Observation 3, an optimal solution for the Dummy problem is
expected to include the positive seeds present in the optimal
solution for the original problem. Hence, the ISR algorithm
executes the greedy algorithm of Kempe et al. [45] backwards.
At each iteration of the ISR algorithm, the problem with more
positive seeds is called a superior problem because its objective
function value is necessarily greater than or equal to that of a
subproblem achieved by removing one seed, hence called an
inferior problem. To begin with, let the first superior problem have
jSdþ j positive seeds and define an inferior problem as a maximization problem for finding the best set of jSdþ j 1 positive seeds.
Instead of solving the inferior problem, the ISR algorithm traverses
all the distinct combinations of jSdþ j 1 positive seeds in the
solution of the superior problem and selects the combination that
maximizes the objective function of the inferior problem. According to Observation 1, computing the objective function value of the
11
inferior problem for each possible combination of jSdþ j 1 positive
seeds in the optimal solution of the superior problem takes
OðTjNj2 Þ time. To proceed, the ISR algorithm keeps removing the
positive seeds until it obtains a set of jS þ j positive seeds, and
reports it as a feasible solution for the original PCIM problem and a
lower bound for the optimal solution. The total number of inferior
problems that the ISR algorithm solves to obtain the feasible
solution for the original PCIM problem with jS þ j positive seeds
using a Dummy problem with jSdþ j positive seeds is
!
!
!
jSdþ j 1
jSdþ j
jS þ jþ 1
þ
þ⋯þ
þ
þ
jSd j 1
jSd j 2
jS þ j
¼ jSdþ j þ jSdþ 1jþ ⋯ þ jS þ j þ1
¼
j Sdþ j 2 þ jSdþ j jS þ j j S þ j 2
:
2
ð27Þ
The ISR algorithm elegantly employs the PCIM problem properties. However, its main drawback is its independence from the
Subgradient algorithm: the upper bound for the PCIM problem,
found by the Subgradient algorithm, does not feed into the ISR
algorithm. Furthermore, the ISR algorithm may not work well if all
the available Dummy problems are hard to solve.
Algorithm 1. The ISR Algorithm for PCIM.
Initialize jS þ j and jSdþ j in a Dummy problem, with jSdþ j 4 jS þ j;
Initialize bestSolutionValue with -M and currentSolutionValue
with 0; /n M is a large positive number n/
Solve the Dummy Problem with jSdþ j positive seeds;
Store the solution in Snsup ;
for t ’ 0 to ðjSdþ j jS þ j 1Þ do
for i ’ 1 to ðjSdþ j tÞ do
createNewSolution(i); /n This function removes ith
seed from Snsup to obtain a new solution for the inferior
problemn/
evaluateNewSolution(i); /n This function evaluates the
objective function for the new solutionn/
storeCurrentSolution(i);
/n This function stores the
objective value of the new solution in
currentSolutionValue*/
if currentSolutionValue Z bestSolutionValue then
recordBestSolutionIndex(); /n This function records i
as the index of the best solution for PCIMn/
updateBestSolutionValue(); /n This function updates
the best solution for PCIMn/
end if
end for
removeOneSeed();
/n This function removes the seed
with best solution index from seed set and updates Snsup n/
updateBestSolutionValue();
/n This function initializes
bestSolutionValue with -Mn/
end for
4.2.2. The adaptive subgradient-based (ASB) algorithm
The ASB algorithm is designed to utilize the information
obtained in executing the Subgradient algorithm to iteratively
improve the lower bound for PCIM. The optimal solution of (LRu)
does not necessarily provide a feasible solution for (P), since (LRu)
is not constrained by the number of positive seeds. Let SLþ be the
set of positive seeds in the optimal solution of (LRu). At each
iteration of the Subgradient algorithm, if jSLþ jZ jS þ j, the ASB
algorithm selects the first jS þ j positive seeds (with respect to a
fixed random ordering) from SLþ to obtain a feasible solution for
(P) with jS þ j positive seeds. On the other hand, when jSLþ jo jS þ j,
the ASB algorithm selects all the positive seeds in SLþ and
randomly selects jS þ j jSLþ j positive seeds from the nodes in the
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
12
network that are neither in SLþ nor in S . In the early iterations of
the Subgradient algorithm, the ASB algorithm blindly selects jS þ j
positive seeds, however, the selection process becomes more
precise as the Subgradient algorithm runs further.
Section 5.2 focusing on run-time and discusses the sources of
complexity in the PCIM problem.
5.1. Lagrangian relaxation performance
Algorithm 2. The ASB Algorithm for PCIM.
Initialize jS þ j;
Initialize bestSolutionValue with solution of ISR algorithm and
currentSolutionValue with 0;
while gapValue r acceptableGapdo
storeLagrangianSolution();
/n This function stores the
solution of LRu to be used for finding the lower boundn/
createFeasibleSolution();
/n This function creates a
feasible solution for (P) using SLþ n/
evaluateNewSolution(); /n This function calculates the
objective value of PCIM for the stored solutionn/
storeCurrentSolution();
/n This function stores the
objective value of the new solution in
currentSolutionValuen/
if currentSolutionValue Z bestSolutionValue then
recordBestSolution(); /n This function updates the best
feasible solution for PCIMn/
end if
end while
The ISR and ASB algorithms are very efficient when used
together in practice. The ASB algorithm first stores the best
feasible solution obtained by the ISR algorithm as the best feasible
solution of (P) found so far. At each iteration of the Subgradient
algorithm, the ASB algorithm extracts a feasible solution for (P),
using Algorithm 2, and quickly finds the corresponding objective
value (see Observation 1). At each iteration of the Subgradient
algorithm, the feasible solution for (P), obtained by the ASB
algorithm, is accepted only if it provides a greater objective
function value than the current best feasible solution.
To summarize, the presented algorithms form a Lagrangian
heuristic toolbox for obtaining near-optimal PCIM problem solutions with rigorously evaluated bounds; the utility of and relationships between the algorithms are explained in Fig. 6.
In order to analyze the performance of the presented Lagrangian Relaxation heuristic, this section solves the PCIM problem
instances formulated on four Facebook networks found in SNAP
collection [48]. The network size- and structure-dependent statistics of these undirected datasets, indexed F1, F2, F3 and F4, are
reported in Table 8. The nodes in these networks are labeled; in
order to evaluate the performance of the heuristic method, each
experiment takes a sub-network of the main dataset with
jNj nodes.
In this work, the mixed-integer program and the Lagrangian
Relaxation heuristic are implemented using Concert Technology in
JAVA and the commercial solver CPLEX 12.5. All the experiments
have been performed on a desktop with Intel(R)Core(TM)i3
3.3 GHz processor, with 8 GB RAM and 64 bit operating system.
Table 9 shows the computational results for small and mediumsized problems, all solved to optimality using CPLEX. The availability of the optimal solutions for these problems permits
calculating both the optimality gap and heuristic gap. For the
small problems, CPLEX outperforms the Lagrangian Relaxation
heuristic in terms of solution time. As the problem size increases
the PCIM solution time with CPLEX increases rapidly (see Section
5.2 for the sources of the PCIM problem complexity), while the
Lagrangian Relaxation heuristic remains fast. Note that in the
majority of the PCIM problem instances reported in Table 9, the
ISR and ASB algorithms have found the optimal solution (the
optimality gap is equal to zero).
The results of the computational study with large-sized problems are given in Table 10. For these problems, CPLEX runs out of
computer memory and fails to return optimal solutions. In such
cases, the Lagrangian Relaxation heuristic runs in a reasonable
computational time and provides an acceptable heuristic gap. For
large problems, the optimality gap is unknown, due to unknown
optimal solution, and the heuristic gap remains the only criterion
Table 8
Dataset statistics.
Dataset
Nodes
Edges
Directed
Density
F1
F2
F3
F4
150
747
534
1034
1693
30,025
4813
26,749
No
No
No
No
0.151
0.108
0.034
0.05
5. Computational results
This section presents the computational results with the PCIM
instances on some real social networks. Section 5.1 studies the
performance of the Lagrangian Relaxation toolbox for PCIM.
The Lagrangian Heuristic Toolbox
Returns near optimal solutions
Provides the Upper
Bound for a given u
Finds u that
minimizes LR(u)
Subgradient
Search Algorithm
ISR Algorithm
ASB Algorithm
Guides the stopping
Criterion
Provide/ Improve the
Lower Bound
UB
Heuristic Gap
The original
Problem (P)
Provides a feasible or an
infeasible solution
Lagrangian
Relaxation (LR (u))
LB
Fig. 6. The Lagrangian heuristic toolbox: an overview of the components.
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
13
Table 9
Computational results with small- and medium-sized PCIM problem instances.
Dataset
jNj
T
jS þ j
LR time (s)
LR LB
LR UB
Cplex time (s)
Cplex sol. (opt.)
Opt. Gap (%)
Heu. gap (%)
Iter. #
F1
F1
F1
F2
F2
F2
30
40
60
45
60
85
40
100
50
64
50
75
6
9
7
7
9
14
11.53
72.01
81.32
26.59
32.06
49.66
1167
4020
2849
2795
2887
6233
1186
4021
2931
2870
2929
6289
0.69
7.89
74.32
7.71
45.84
3425.23
1167
4020
2853
2795
2887
6233
0
0
0.1
0
0
0
1.6
0.02
2.7
2.6
1.4
0.8
20
20
20
20
20
30
Table 10
Computational results with large-sized PCIM problem instances.
Dataset jNj
T
jS þ j LR
time
(s)
LR
LB
LR
UB
Cplex
Cplex
time (h) gap (%)
Heu.
Iter.
gap (%) #
F1
F1
F2
F2
F3
F3
50
60
70
70
50
70
9
10
10
10
11
10
3839
5764
8021
8021
3691
8021
3923
5981
8280
8280
3836
8280
44
44
44
44
44
44
2.1
2.4
3.1
3.1
3.7
3.1
80
100
120
120
80
120
70.91
112.99
259.73
259.73
78.65
259.73
4 195
4 190
4 198
4 198
4 192
4 198
60
60
60
60
60
60
Table 11
Computational results with large-sized PCIM problem instances.
jS þ j LR time (s)
Dataset
jNj
T
F2
F2
F3
F4
550
720
480
1034
50
40
35
60
30
84
30 100
594.96
831.72
522.73
1589.34
LR LB
LR UB
6. Discussion
Heu. gap (%) Iter. #
26,940 27,494 2.0
24,467 25,070 2.4
13,977 14,208 1.6
28,991 29,822 2.8
Fig. 7(b) shows the effect of the total number of time periods, T,
on the run-time of (P) revealing a linear trend. Problems with a
small number of nodes appear to remain tractable even with large
T.
The results of the third set of experiments (see Fig. 7(c)) show
how the solution time of (P) is affected by the number of positive
seeds in the PCIM problem. These results authenticate the second
observation given in Section 4.2.1. As shown in Fig. 7(c), the
solution time resembles a concave function in the number of
positive seeds, which motivates the ISR Algorithm: the number of
positive seeds in a hard PCIM problem instance needs to be just
slightly increased to find a Dummy problem with a significantly
lower solution time.
60
60
60
60
for the evaluation of the heuristic's performance. The run-time for
the Lagrangian Relaxation heuristic smoothly increases with the
dimensions of PCIM problem instances and it illustrates the
supreme contribution provided by the heuristic approach.
In order to assess the scalability of the Lagrangian Relaxation
heuristic for solving practical PCIM problems, it is executed with
large Facebook networks, where CPLEX cannot even create a
feasible solution in the computer memory. It is observed that the
Lagrangian Relaxation heuristic still provides acceptable bounds
for optimal PCIM solutions. Table 11 reports the results of a
computational study with five large problems where the only
concerns are the heuristic gap and solution time of the Lagrangian
Relaxation heuristic. The results of computational studies in
Table 11 show that the proposed heuristic method provides
encouraging results for large-scale problems, establishing its
practical value.
5.2. A sensitivity analysis of the PCIM problem run-time dynamics
Three elements affect the solution time of program (P) for
PCIM: the number of nodes in the Influence Graph, number of
time periods for evidence spread and number of positive seeds to
be selected in the problem. In this section, the PCIM input
parameters are varied selectively, and three sets of experiments
are performed with F3 dataset; in each case, only one of the three
aforementioned factors is changed to see how it affects the
solution time. The results in Fig. 7(a) show that the solution time
increases rapidly with the growing number of nodes in the
Influence Graph, e.g., solving a problem with 90 nodes takes about
50 times more time over a problem with 80 nodes.
This section discusses the limitations of the presented models
and methods, and concludes the paper.
6.1. Study limitations
This paper provides insightful findings and develops a framework for modeling the spread of influence in a social network.
However, this study has limitations worth mentioning.
First, while the PC model relies on the theory and findings
established in the sociology literature for human decision-making
[66], it treats stochasticity only implicitly (through Bayesian
updates) and does not emphasize the differences between network actors and the uncertainty in capturing such differences. The
deterministic diffusion process makes the PCIM problem mathematically tractable, i.e., allows one to solve it as a mixed-integer
program, design efficient heuristics exploiting known and fixed
network characteristics, and make insightful observations (after
all, linear programs are often found useful in practice even though
real-world problems are rarely truly deterministic). Future work,
however, can involve stochastic optimization for PCIM.
Second, data-focused studies are needed to uncover and
address potential challenges in specifying the model parameters,
i.e., learning how people really process subjective, as opposed to
objective, evidence. The investigations with the latter have been
previously conducted [77,34], which gives promise to the expansion of the presented research in this direction, too; such studies,
however, should lie in the consumer psychology domain. On a
positive note, from the modeling and algorithmic perspectives, the
PC model can be used with user-defined parameters, and its ability
to produce practical insights is confirmed through the reported
case studies.
6.2. Concluding remarks
This paper models social influence as a consequence of subjective evidence transfer, and quantitatively derives general
insights about cascading behavior and belief reinforcement in
social networks. The presented Parallel Cascade (PC) diffusion
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
14
Fig. 7. Sensitivity analyses of (P) with α þ ¼ α ¼ 0:7, β þ ¼ β ¼ 0:9, e þ ¼ 1:2, e ¼ 1:25: (a) run-time dynamics as a function of the total number of nodes in the network
(T¼ 10, jS þ j ¼ 4, jS j ¼ 3), (b) run-time dynamics as a function of the number of time periods (jNj ¼ 50, jS þ j ¼ 4, jS j ¼ 3), and (c) run-time dynamics as a function of the
number of positive seeds (jNj ¼ 50, T ¼ 10, jS j ¼ 3).
Objective Value
Heuristic Gap
Optimality Gap
(Maybe unknown)
UB
Optimal Value
LB (Maybe unknown)
Iteration
Fig. 8. The Heuristic and optimality gaps achieved with the Subgradient algorithm.
model defines the rules of exchange and accumulation of subjective evidence, which feeds into node-level hypothesis testing,
en route to making decisions, forming judgments, etc. The preference of a null hypothesis over an alternative hypothesis determines a node's participation status in regards to further evidence
propagation. The value of evidence collected and accumulated, in
support of or against the null hypothesis, is calculated using
Bayesian update logic. The optimization problem of finding the
set of influential nodes to initiate the evidence spread in support
of the null hypothesis under the PC diffusion model (PCIM) is
formulated as a mathematical program and solved using CPLEX.
The PCIM problem is shown to be NP-hard, and next, an efficient,
guaranteed-performance heuristic tool set is presented, exploiting
Lagrangian Relaxation.
The studies of the spread of evidence in social networks using
the PC diffusion model showcase that the ability of the decisionmaker to trigger a successful cascade or keeping a cascade alive is
sensitive to the density of network connections and the presence
of the opposite opinions in a target cluster. This paper focuses on
node-level IM solutions, utilizing the exact fine features of the
network structure; however, it also opens a door to studying the
problem on the network level, e.g., describing the general properties of the seeds' optimal locations based on metrics such as
density and clusterization. Based on the presented PC diffusion
model, one can potentially develop new centrality metrics for
evaluating network ability to reinforce/preserve beliefs. Future
research can also explore how PCIM instances with extremely
large Influence Graphs can be reduced, e.g., via clustering, to
become manageable.
The PC model quantifies belief reinforcement through social
connections, and informs the changes in optimal seed allocation
for creating successful cascades. As noted in Section 3, the model
can incorporate actors' actions: based on the collected evidence,
the actors may not only be active in spreading their opinions and
judgments, but also, choose to buy a product, vote for a party, etc.
Such actions will result in the acquisition of first-hand objective
evidence by the actors, which can be processed differently in
comparison with the processing of subjective evidence). The
addition of actions in the model can lead to more insightful
analyses, e.g., of low-quality but actively advertised goods where
customers may get excited about a product but only until they
buy one.
Also, this paper opens up a new area for modeling the
defendability of cohesive clusters in social networks against strong
external opinions and for the identification of “vulnerable” nodes
in network clusters. Furthermore, the paper establishes connection between PCIM and location theory models. Further efforts will
pursue the construction of a theoretical method for solving
stochastic PCIM instances. Moreover, future studies can apply the
proposed optimization scheme for modeling the spread of evidence in the social networks that are growing, and in situations
where neither the structure of a social network nor the locations
of the opponent's opinion leaders are precisely specified.
Acknowledgements
This work was supported in part by the National Science
Foundation Grant ICES-1216082, and a Multidisciplinary University Research Initiative (MURI) Grant W911NF-09-1-0392. This
support is gratefully acknowledged.
Appendix A. Proof of Theorem 1
PCIM problem is shown to be NP-hard by a polynomial Turing
reduction from the Maximum Coverage Problem (MCP), also
referred to as the max k-cover problem or set k-cover problem
in the literature [26]. The objective of MCP is to select a group of
sets, where some sets have common elements, such that the total
number of selected sets is less than the predefined limit and the
total number of selected elements is maximized. MCP is first
formally stated and then, the reduction from PCIM to MCP is
presented.
Maximum coverage problem
Instance: A number k 4 0 and a collection of sets J ¼ J 1 ; J 2 ; …; J m .
Objective: Find a subset J 0 DJ such that jJ 0 j r k and the number
of covered elements j⋃J i A J 0 J i j is maximized.
Given an arbitrary instance of MCP, define a particular instance
of PCIM as follows: Assume the Influence Graph GðN; AÞ is given
and
let
T ¼1,
jNj ¼ m,
jS þ j ¼ k
and
j S j ¼ 0.
Let
þ
þ
e 4 maxθi ; i ¼ 1; 2; …; jNj,
e ¼ 0,
and
lastly,
set
α þ ¼ α ¼ β þ ¼ β ¼ 1. Set Ji for i ¼ 1; 2; …; m can be defined
such that j A J i iff j¼ i or ði; jÞ A A; j ¼ 1; 2; …; N (all the nodes in the
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
first hop of node i). This transformation can be performed in
polynomial time in the size of the arbitrary instance of the MCP.
In order to show that an optimal solution to PCIM problem
maps to an optimal solution of MCP, let X ni0 for i ¼ 1; 2; …; jNj
(X A f0; 1g) to be an optimal solution to PCIM problem. Then,
Pi0
P
jNj
þ
j0 for j ¼ 1; 2; …; jNj, Y it ¼ 0 for
ði;jÞ A A X i;0 þX
i ¼ 1 X i0 rjS j, X jT r
PjNj
PT
i ¼ 1; 2; …; jNj; t ¼ 0; 1; …; T and
t ¼ 0 ðX it Y it Þ is maxii¼1
n
mized. The claim is that X i0 is an optimal solution for MCP. Note
that X n for i ¼ 1; 2; ‥; jNj is a feasible solution for MCP because
PjNj i0
þ
i ¼ 1 X i0 rjS j ¼ k.
Suppose there exists such solution to MCP X i0 for i ¼ 1; 2; ‥; jNj
that j⋃J A J 0 J i j 4 j⋃J i A J0n J i j. Solution X i0 for i ¼ 1; 2; ‥; jNj is a feasible
P
i
þ
solution for PCIM: jNj
the PCIM objeci ¼ 1 X i0 r jS j ¼ k. Therefore,
PjNj PT
tive function for this solution is
t ¼ 0 ðX it Y it Þ
i
¼
1
P
PT
¼ j⋃J A J 0 J i j þ k 4 j ⋃J i A J0n Jij þ k ¼ jNj
ðX nit Y nit Þ, which is a
t
¼
0
i
¼
1
i
contradiction. Thus, X ni0 for i ¼ 1; 2; ‥; jNj is an optimal solution
for MCP. □
Appendix B. Subgradient search algorithm for the Lagrangian
dual problem
The Lagrangian dual problem (LDu) is presented in Section 4.1
for finding the best (lowest) upper bound for the optimal solution
of (P). Since Z LDu ðuÞ is non-differentiable, the subgradient of this
function is employed in the implementation of a search algorithm
for finding the improved multipliers.
Definition 1. Vector s is a subgradient of Z LDu ðuÞ at point u0 if
Z LDu ðuÞ rZ LDu ðu0 Þ þ sðu u0 Þ;
8 u:
ð28Þ
A multiplier un is optimal for (LDu(u)) iff subgradient of Z LDu ðun Þ
is zero. At iteration k of the Subgradient search algorithm, the
subgradient can be expressed as
X k
sk ¼
X i0 jS þ j;
ð29Þ
i
where X ki0 ; i ¼ 1; 2; …; N; are the optimizers of ðLRuk Þ. According to
Fisher [29], the iterative Subgradient search algorithm for generating the sequence of Lagrangian multipliers uk , given an initial
value u0, is defined as
uk þ 1 ¼ uk lk ðsk Þ;
ð30Þ
LDu
nLDu
if lk →0 with
where lk denotes the step size and Z ðu Þ-Z
Pk
i ¼ 0 li -1 [32]. As lk approaches zero, it is guaranteed that the
Subgradient algorithm does not overstep un. Note that the summation of step size values approaches positive infinity, which,
theoretically, guarantees the convergence to un. At the end of each
iteration of the Subgradient algorithm, the step size value can be
updated using the quality of the solution obtained for (LDu) at the
same iteration,
lk ¼
λk ðZ LDu ðuk Þ Z 0 Þ
J sk J 2
;
k
ð31Þ
where λk is a positive scalar for the step size and Z 0 is a lower
bound for ZLD
u . The appropriate range of values for λk can be
defined experimentally; the range 0 o λk r 2 has been found to
work well in practice [29]. The maximum value in the selected
range for the step size is assigned to the initial value of the step
size (λ0), and it is split when ZLD
u fails to decrease for a given
number of consecutive iterations of the Subgradient algorithm
[40].
There is no mathematical proof for the optimality in the
Subgradient algorithm. As (P) is a maximization problem, the
Subgradient algorithm stops when the gap between the lower
15
bound, obtained by the ISR and ASB algorithms presented in
Section 4.2, and the upper bound, obtained by the Subgradient
algorithm, becomes less than a preselected threshold value, which
guarantees the quality of the solutions for Lagrangian Relaxation
heuristic. Alternatively, the Subgradient algorithm can be terminated after a predetermined number of iterations have been
executed or a predefined run-time limit has been reached
[67,78]. In this paper, the quality of the gradually improved
solutions drives the stopping criteria for the Subgradient algorithm to obtain a guaranteed-performance heuristic method for
solving (P). Fig. 8 shows how the heuristic and optimality gaps
change in the Lagrangian Relaxation method to reduce the gap
around the optimal solution.
References
[1] Angst CM, Agarwal R, Sambamurthy V, Kelley K. Social contagion and
information technology diffusion: the adoption of electronic medical records
in US hospitals. Management Science 2010;56(8):1219–41.
[2] Aral S, Muchnik L, Sundararajan A. Engineering social contagions: optimal
network seeding and incentive strategies. In: Winter conference on business
intelligence; 2011.
[3] Aral S, Walker D. Identifying social influence in networks using randomized
experiments. IEEE Intelligent Systems 2011;26(5):91–6.
[4] Aral S, Walker D. Tie strength, embeddedness, and social influence: a largescale networked experiment. Management Science 2014;60(6):1352–70.
[5] Arthur D, Motwani R, Sharma A, Xu Y. Pricing strategies for viral marketing on
social networks. In Internet and network economics. Springer; 2009. p. 101–
12.
[6] Baumeister RF, Bratslavsky E, Finkenauer C, Vohs KD. Bad is stronger than
good. Review of General Psychology 2001;5(4):323.
[7] Becker MH. Sociometric location and innovativeness: reformulation and
extension of the diffusion model. American Sociological Review 1970:267–82.
[8] Berger J, Sorensen AT, Rasmussen SJ. Positive effects of negative publicity:
when negative reviews increase sales. Marketing Science 2010;29(5):815–27.
[9] Bhagat S, Goyal A, Lakshmanan LV. Maximizing product adoption in social
networks. In: Proceedings of the fifth ACM international conference on Web
search and data mining. ACM; 2012. p. 603–12.
[10] Bimpikis K, Ozdaglar A, Yildiz E. Competing over networks; 2015, submitted
for publication. Available online at: 〈http://web.mit.edu/asuman/www/publi
cations.htm〉.
[11] Borgatti SP. Centrality and network flow. Social Networks 2005;27(1):55–71.
[12] Borgatti SP. Identifying sets of key players in a social network. Computational
and Mathematical Organization Theory 2006;12(1):21–34.
[13] Chen W, Collins A, Cummings R, Ke T, Liu Z, Rincon D, et al. Influence
maximization in social networks when negative opinions may emerge and
propagate. In: Proceedings of the SIAM international conference on data
mining; 2011. p. 379–90.
[14] Chen W, Lakshmanan LV, Castillo C. Information and influence propagation in
social networks. Synthesis Lectures on Data Management 2013;5(4):1–177.
[15] Chen W, Wang C, Wang Y. Scalable influence maximization for prevalent viral
marketing in large-scale social networks. In: Proceedings of the 16th ACM
SIGKDD international conference on knowledge discovery and data mining.
ACM; 2010. p. 1029–38.
[16] Chen W, Wang Y, Yang S. Efficient influence maximization in social networks.
In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2009. p. 199–208.
[17] Chen W, Yuan Y, Zhang L. Scalable influence maximization in social networks
under the linear threshold model. In: 2010 IEEE 10th international conference
on data mining (ICDM). IEEE; 2010. p. 88–97.
[18] Choi S, Gale D, Kariv S. Learning in networks: an experimental study.
Unpublished manuscript; 2005.
[19] De Nooy W, Mrvar A, Batagelj V. Exploratory social network analysis with
Pajek, vol. 27. Cambridge University Press; 2011.
[20] Deroıan F. Formation of social networks and diffusion of innovations. Research
Policy 2002;31(5):835–46.
[21] Diaby M, Bahl HC, Karwan MH, Zionts S. A Lagrangian relaxation approach for
very-large-scale capacitated lot-sizing. Management Science 1992;38
(9):1329–40.
[22] Ding C, He X, Simon HD. Nonnegative Lagrangian relaxation of k-means and
spectral clustering. In: Machine Learning: ECML 2005, Lecture Notes in
Computer Science, Springer; 2005. p. 530–8.
[23] Dinh TN, Zhang H, Nguyen DT, Thai MT. Cost-effective viral marketing for
time-critical campaigns in large-scale social networks. IEEE/ACM Transactions
on Networking (TON) 2014;22(6):2001–11.
[24] Domingos P, Richardson M. Mining the network value of customers. In:
Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2001. p. 57–66.
[25] Easley D, Kleinberg J. Networks, crowds, and markets: reasoning about a
highly connected world. Cambridge University Press; 2010.
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
16
M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎
[26] Feige U. A threshold of ln n for approximating set cover. Journal of the ACM
(JACM) 1998;45(4):634–52.
[27] Feldman E, Lehrer F, Ray T. Warehouse location under continuous economies
of scale. Management Science 1966;12(9):670–84.
[28] Fisher ML. The Lagrangian relaxation method for solving integer programming
problems. Management science 1981;27(1):1–18.
[29] Fisher ML. The Lagrangian relaxation method for solving integer programming
problems. Management science 2004;50(Suppl. 12):1861–71.
[30] Ghose A, Ipeirotis PG. Estimating the helpfulness and economic impact of
product reviews: mining text and reviewer characteristics. IEEE Transactions
on Knowledge and Data Engineering 2011;23(10):1498–512.
[31] Girvan M, Newman ME. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 2002;99(12):7821–6.
[32] Goffin J. On convergence rates of subgradient optimization methods. Mathematical Programming 1977;13(1):329–47.
[33] Golub B, Jackson MO. Naive learning in social networks and the wisdom of
crowds. American Economic Journal: Microeconomics 2010:112–49.
[34] Goodman ND, Ullman TD, Tenenbaum JB. Learning a theory of causality.
Psychological Review 2011;118(1):110.
[35] Goyal A, Bonchi F, Lakshmanan LV, Venkatasubramanian S. On minimizing
budget and time in influence propagation over social networks. Social
Network Analysis and Mining 2012:1–14.
[36] Goyal A, Lu W, Lakshmanan LV. Celfþ þ : optimizing the greedy algorithm for
influence maximization in social networks. In: Proceedings of the 20th
international conference companion on World wide web. ACM; 2011. p. 47–8.
[37] Granovetter M. Threshold models of collective behavior. American Journal of
Sociology 1978;83(6):1420.
[38] Hakimi SL. Optimum locations of switching centers and the absolute centers
and medians of a graph. Operations Research 1964;12(3):450–9.
[39] Held M, Karp RM. The traveling-salesman problem and minimum spanning
trees. Operations Research 1970;18(6):1138–62.
[40] Held M, Wolfe P, Crowder HP. Validation of subgradient optimization.
Mathematical Programming 1974;6(1):62–88.
[41] Hinz O, Skiera B, Barrot C, Becker JU. Seeding strategies for viral marketing: an
empirical comparison. Journal of Marketing 2011;75(6):55–71.
[42] Iyengar R, Van den Bulte C, Valente TW. Opinion leadership and social
contagion in new product diffusion. Marketing Science 2011;30(2):195–212.
[43] Jaynes ET. Probability theory: the logic of science. Cambridge University Press;
2003.
[44] Katz E. The two-step flow of communication: an up-to-date report on an
hypothesis. Public Opinion Quarterly 1957;21(1):61–78.
[45] Kempe D, Kleinberg J, Tardos É. Maximizing the spread of influence through a
social network. In: Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM; 2003. p. 137–46.
[46] Kempe D, Kleinbergm J, Tardos É. Influential nodes in a diffusion model for
social networks. In: Proceedings of the 32nd International Conference on
Automata, Languages and Programming. LNCS, ICALP'05, vol. 3580; 2005. p.
1127–38.
[47] Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N. Costeffective outbreak detection in networks. In: Proceedings of the 13th ACM
SIGKDD international conference on knowledge discovery and data mining.
ACM; 2007. p. 420–9.
[48] Leskovec J, Krevl A. SNAP Datasets: Stanford large network dataset collection.
〈http://snap.stanford.edu/data〉; June 2014.
[49] Lu Y, Jerath K, Singh PV. The emergence of opinion leaders in a networked
online community: a dyadic model with time dynamics and a heuristic for fast
estimation. Management Science 2013;59(8):1783–99.
[50] Macy MW. Chains of cooperation: threshold effects in collective action.
American Sociological Review 1991:730–47.
[51] Mahajan V, Muller E, Sharma S. An empirical comparison of awareness
forecasting models of new product introduction. Marketing Science 1984;3
(3):179–97.
[52] Manchanda P, Xie Y, Youn N. The role of targeted communication and
contagion in product adoption. Marketing Science 2008;27(6):961–76.
[53] Merton RK. Selected problems of field work in the planned community.
American Sociological Review 1947:304–12.
[54] Nam S, Manchanda P, Chintagunta PK. The effect of signal quality and
contiguous word of mouth on customer acquisition for a video-on-demand
service. Marketing Science 2010;29(4):690–700.
[55] Newman ME. Spread of epidemic disease on networks. Physical Review E
2002;66(1):016128.
[56] Padgett JF, Ansell CK. Robust action and the rise of the medici. American
Journal of Sociology 1993:1400–34.
[57] Pan F, Nagi R. Multi-echelon supply chain network design in agile manufacturing. Omega 2013;41(6):969–83.
[58] Peres R, Muller E, Mahajan V. Innovation diffusion and new product growth
models: a critical review and research directions. International Journal of
Research in Marketing 2010;27(2):91–106.
[59] Richardson M, Domingos P. Mining knowledge-sharing sites for viral marketing. In: Proceedings of the eighth ACM SIGKDD international conference on
knowledge discovery and data mining. ACM; 2002. p. 61–70.
[60] Sangachin MG, Samadi M, Cavuoto LA. Modeling the spread of an obesity
intervention through a social network. Journal of Healthcare Engineering
2014;5(3):293–312.
[61] Siomina I, Värbrand P, Yuan D. Pilot power optimization and coverage control
in wcdma mobile networks. Omega 2007;35(6):683–96.
[62] Susarla A, Oh J-H, Tan Y. Social networks and the diffusion of user-generated
content: evidence from youtube. Information Systems Research 2012;23
(1):23–41.
[63] Tang J, Sun J, Wang C, Yang Z. Social influence analysis in large-scale networks.
In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2009. p. 807–16.
[64] Tansel BC, Francis RL, Lowe TJ. State of the artlocation on networks: a survey.
Part I. The p-center and p-median problems. Management Science 1983;29
(4):482–97.
[65] Taylor SE. Asymmetrical effects of positive and negative events: the
mobilization-minimization hypothesis. Psychological Bulletin 1991;110(1):67.
[66] Tenenbaum JB, Griffiths TL, Kemp C. Theory-based Bayesian models of
inductive learning and reasoning. Trends in Cognitive Sciences 2006;10
(7):309–18.
[67] Trigeiro WW, Thomas LJ, McClain JO. Capacitated lot sizing with setup times.
Management Science 1989;35(3):353–66.
[68] Tucker C. Identifying formal and informal influence in technology adoption
with network externalities. Management Science 2008;54(12):2024–38.
[69] Valente TW, Frautschi S, Lee R, OKeefe C, Schultz L, Steketee R, et al. Network
models of the diffusion of innovations. Nursing Times 1994;90(35):52–3.
[70] Van den Bulte C, Joshi YV. New product diffusion with influentials and
imitators. Marketing Science 2007;26(3):400–21.
[71] Van den Bulte C, Lilien GL. Medical innovation revisited: social contagion
versus marketing effort1. American Journal of Sociology 2001;106
(5):1409–35.
[72] Wang C, Chen W, Wang Y. Scalable influence maximization for independent
cascade model in large-scale social networks. Data Mining and Knowledge
Discovery 2012;25(3):545–76.
[73] Wasserman S. Social network analysis: methods and applications, vol. 8l.
Cambridge university press; 1994.
[74] Watts DJ, Dodds PS. Influentials, networks, and public opinion formation.
Journal of Consumer Research 2007;34(4):441–58.
[75] Wejnert B. Integrating models of diffusion of innovations: a conceptual
framework. Sociology 2002;28(1):297.
[76] Whyte Jr. WH. The web of word of mouth. Fortune 1954;50(1954):140–3.
[77] Xu F, Tenenbaum JB. Word learning as Bayesian inference. Psychological
Review 2007;114(2):245.
[78] Xu J, Nagi R. Solving assembly scheduling problems with tree-structure
precedence constraints: a Lagrangian relaxation approach. IEEE Transactions
on Automation Science and Engineering 2013;10(3):757–71.
[79] Yoganarasimhan H. Impact of social network structure on content propagation: a study using youtube data. Quantitative Marketing and Economics
2012;10(1):111–50.
[80] Young HP. Individual strategy and social structure: an evolutionary theory of
institutions. Princeton University Press; 2001.
[81] Zachary WW. An information flow model for conflict and fission in small
groups. Journal of Anthropological Research 1977:452–73.
Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015),
http://dx.doi.org/10.1016/j.omega.2015.06.014i
Noname manuscript No.
(will be inserted by the editor)
Probabilistic Graphical Models in Modern Social
Network Analysis
Alireza Farasat · Alexander Nikolaev ·
Sargur N. Srihari · Rachael Hageman
Blair
Received: date / Accepted: date
Rachael Hageman Blair
Department of Biostatistics
State University of New York and Buffalo
E-mail: [email protected]
2
Alireza Farasat et al.
Abstract The advent and availability of technology has brought us closer
than ever through social networks. Consequently, there is a growing emphasis
on mining social networks to extract information for knowledge and discovery. However, methods for Social Network Analysis (SNA) have not kept pace
with the data explosion. In this review, we describe directed and undirected
Probabilistic Graphical Models (PGMs), and describe recent applications to
social networks. Modern SNA is flooded with challenges that arise from the
inherent size, scope, and heterogeneity of both the data and underlying population. As a flexible modeling paradigm, PGMs can be adapted to address
some SNA challenges. Such challenges are common themes in Big Data applications, but must be carefully considered for reliable inference and modeling.
For this reason, we begin with a thorough description of data collection and
sampling methods, which are often necessary in social networks, and underlie
any downstream modeling efforts. PGMs in SNA have been used to tackle
current and relevant challenges, including the estimation and quantification
of importance, propagation of influence, trust (and distrust), link and profile
prediction, privacy protection, and news spread through micro-blogging. We
highlight these applications, and others, to showcase the flexibility and predictive capabilities of PGMs in SNA. Finally, we conclude with a discussion
of challenges and opportunities for PGMs in social networks.
Keywords Probabilistic Graphical Modeling · Social Network Analysis ·
Bayesian Networks · Markov Networks · Exponential Random Graph Models ·
Markov Logic Networks · Social Influence · Network Sampling
Probabilistic Graphical Models in Modern Social Network Analysis
3
1 Introduction
Over forty years ago, social scientist Allen Barton stated that “If our aim is
to understand people’s behavior rather than simply to record it, we want to
know about primary groups, neighborhoods, organizations, social circles, and
communities; about interaction, communication, role expectations, and social
control.” (Barton, 1968 as reported in Freeman, 2004). This sentiment is fundamental to the concept of modularity. The importance of structural relationships in defining communities and predicting future behaviors has long been
recognized, and is not restricted to the social sciences [48].
Social Network Analysis (SNA) has a rich history that is based on the
defining principle that links between actors are informative. The advent and
availability of Internet technology has created an explosion in online social networks and a transformation in SNA. The analysis of today’s social networks is
a difficult Big Data problem, which requires the integration of statistics and
computer science to leverage networks for knowledge mining and discovery [99].
SNA scientists have had to rely on tractable records of social interactions and
experiments (e.g., Milgram’s small world experiment); now they have a luxury of accessing huge digital databases of relational social data. However, this
gain in information comes at a price; many of the statistical tools for analyzing such databases break due to the enormity of social networks and complex
interdependencies within the data. False discovery rates are not easily controlled, which makes the identification of meaningful signals and relationships
difficult [42]. Moreover, sampling networks is typically required, which can
propagate selection bias through and downstream inference procedures.
SNA relies on diverse data representations and relational information,
which may include (among others), tracked relationships among actors, events,
and other covariate information [130]. Modeling social networks is especially
challenging due to the heterogeneity of the populations represented, and the
broad spectrum of information represented in the data itself. In this review, we
focus on Probabilistic Graphical Models (PGMs), a flexible modeling paradigm,
which has been shown to be an effective approach to modeling social networks [81, 91]. Modern applications, including the estimation of influence, privacy protection, trust (and distrust) microblogging, and web-browsing, are
presented to highlight the flexibility and utility of PGMs in addressing current and relevant problems in modern SNA.
PGMs provide a compact representation of a high-dimensional joint probability distribution of variables, by utilizing conditional independencies in the
network of these variables; such a network, with local (in)dependency specifications, is called a model. PGM modeling is rooted in probabilistic reasoning,
querying and also can also be used for generative purposes (sampling) [81]. In
this review, we outline the basic theory, and model parameter and structural
learning, but emphasize practical application and implementation of these
4
Alireza Farasat et al.
models to solve modern problems in SNA. We describe some of the unique
statistical challenges that arise in using PGMs in SNA. The challenges are not
isolated to PGMs. Rather, they propagate from the very foundation of the
model - the data, through the local statistical models of the links and nodes,
and finally to the graphical model. This review is organized from the bottomup: from data sampling, to directed and undirected graphical models.
This paper is structured as follows. Section 2 provides an overview on
data collection methods for SNA, reviews the challenges that arise in network
sampling, and cites some network data repositories. In Section 3, directed
probabilistic graphical models, static and dynamic, are discussed accompanied
by application examples in SNA. Section 4 turns to undirected graphical model
types and their applications. Section 5 concludes the paper and outlines future
directions and challenges for PGM-based research in SNA.
2 Data collection and sampling
Data collection from social networks is a fundamental challenge that inherently
affects downstream analysis through sampling bias [11, 19]. The reproducibility and generalization of any statistical analysis performed depends critically
on the sample population, and how representative they are of the true population. In traditional observational and clinical studies, randomization and large
sample size are important aspects of experimental design [28]. The object of
a study may be driven by attributes such as the presence of a disease, or a
covariate such as profession, age, preferences, etc. In contrast, SNA focuses
primarily on the relations among actors, not the actors themselves and their
individual attributes. For this reason, the population is not usually comprised
of actors sampled independently; rather, the sampling scheme is driven by ties
among the actors.
Snowball sampling begins with an actor, or a set of actors, and moves
through the network by sampling ties [13]. Snowball methods are useful for
identifying modules within a population, e.g., leaders, sub-cultures, and communities. The inability to include isolated actors that are directly tied in, but
may be informative to the analysis, is a major limitation. Other disadvantages
include the overestimation of connectivity, and the sensitivity of the sample to
the initialization setting(s) of the snowball(s). Improvements on snowball sampling have been proposed to address some of these limitations [8, 44, 66, 133].
An alternative approach is to target actors in an ego-centric manner. There
are two main sampling designs, with and without alter peer connections [63].
In this setting, a set of focal actors is selected, and their first-level ties are
identified. In ego-centric networks with alter connections, those first-level ties
are examined to determine connections between them. Ego-centric network
without alter connections simply rely on focal actors and first-level ties; with
Probabilistic Graphical Models in Modern Social Network Analysis
5
this approach, the extrapolation and generalization to the whole network is
not possible.
Online Social Networks (OSNs) present unique challenges due to their massive size and the nature of the heterogenous attributes. A number of factors
complicate the data collection process. Individuals can customize personal
privacy setting, limiting crawlers from obtaining information and ultimately
creating a missing data problem for the analyst. The diversity and dynamic
nature of the data itself makes pages difficult to parse for collection purposes.
Furthermore, the sampling is critical for tractable inference and analyses of
large-scale OSNs. In most OSNs, we are faced with hidden populations, i.e.,
with unknown population size or the underlying distributions of the variables
(edges or actors). In these cases, access to the network is facilitated through
neighbors only. Crawling (through neighbors), either by random walks or graph
transversals, is one of the most widely-used network exploration technique for
OSNs.
– Random Walk: Metropolis-Hastings algorithm is a widely-used Markov
Chain Monte Carlo (MCMC) method for sampling social networks [26].
The random walk starts at a random (or targeted) node and proceeds iteratively, moving between nodes i to j according to a transition probability.
As n −→ ∞, the sampling distribution approaches the stationary distribution of actor characteristics, as if each sampled individual was uniformly
drawn from the underlying population. In practice, the heuristic diagnostics are performed to assess convergence; the success of the methods can
also depend on the starting point of the chain. Even with multiple chains,
mixing can be slow and the chain can get stuck in regions of the graph.
Note that these features are common to applications of MCMC methods,
and not restricted to OSNs [52].
– Graph transversals: Several graph transversal methods have been applied to OSNs. These techniques differ only slightly in the order in which
they systematically visit nodes in the network. Breadth-first search (BFS)
and snowball sampling visit the graph through neighbor nodes [57]. Depthfirst search (DFS) explores the graph from the seed node through the children nodes, and backtracks at dead-ends.
Factors such as sample size, as well as seed and algorithm choice can introduce bias into the statistical analysis of a network. Several authors have
performed detailed investigations of the efficiency and bias associated with
sampling algorithms using different OSNs [18,92]. Breadth First Search (BFS)
is the most widely used method for OSN sampling and has been shown to be
biased toward high-degree nodes [87,160]. Variants of the M-H algorithm have
been proposed: Metropolized Random Walk with Backtracking (MRWB), M-H
Random Walk (MHRW), Re-Weighted Random Walk (RWRW) and Unbiased
Sampling to Reduce Self Loops (USRS), which aim to reduce or correct sample
6
Alireza Farasat et al.
bias [54, 125, 139, 152].
Publicly Available Data:
Several data resources have been created to house a wealth of diverse social network data. These resources are usually open source, requiring, at a
minimum, a user agreement. Leveraging these resources is ideal for the development and testing of methodologies related to SNA. Max-Plank researchers
have released OSN data used in publications, which includes crawled data from
Flickr, YouTube, Wikipedia and Facebook [20, 21, 101, 149]. Several directed
OSNs have been released in the Stanford network analysis package (snap), e.g.
from Epinions, Amazon, LiveJournal, Slashdot and Wikipedia voting [138]. Recently, a Facebook dataset collected with MHRW was released, which exhibited
convergence properties and was shown to be representative of the underlying
population [54]. However, MHRW and UNI data sets contain only link information, thereby prohibiting attribute based analyses.
Document classification datasets have also been released [53]. A sample
from the CiteSeer database contains 3, 312 publications from one of six classes,
and 4, 732 links. The Cora dataset consists of 2, 708 publications classified into
seven categories and the citation network has 5, 429 links. Each publication
is described by a binary word vector which indicates the presence of certain
words within a collection of 1, 433. WebKB consists of 877 scientific publications from five classes, contains 1, 601 links and includes binary word attributes
similar to Cora. Terrorism databases are also publicly available [38, 141, 142].
The most extensive is the RAND Database of Worldwide Terrorism Incidents,
which details terrorist attacks in nine distinct regions of the world across the
time-span 1968 − 2009 (dates vary slightly depending on region) [38]. Several
well-known challenges may arise in the analysis and representation of terrorist
network data, including incomplete information, latent variables influencing
node dynamics, and fuzzy boundaries between terrorists, supporters of terrorists, and the innocent [85, 136].
An alternative option to access data is to enroll in data challenges, which
are often posed by corporations and operators of the networks themselves.
For example, the Nokia mobile data challenge data was released in 2012 [90].
The data follows 200 users throughout the course of a year, and includes: usage
(full call and message log), status (GPS readings, operation mode, environment
(accelerometer samples, wi-fi access points, bluetooth devices), personal (full
contact list, calendar), and user profile. Formal requests are required to use of
this data, ensuring use for search and development, and prohibiting commercial
use. Twitter has just posed TREC 2013, a collection of 240 million tweets
(statutes) collected over a two month period [100]. This is the third year of
TREC releases. The use of this data requires registration for a competition
that centers around a competition. The 2013 competition centers around realtime ad-hoc search tasks.
Probabilistic Graphical Models in Modern Social Network Analysis
7
3 Directed Probabilistic Graph Models
Bayesian Networks (BNs) are a special class of PGMs that capture directed
dependencies between variables, which may represent cause-and-effect relationships. We describe two different branches of BNs, static and dynamic,
which may be used to model social networks at a single time point or across a
series of time point respectively. Both rely on the Markov assumptions, which
enables the compact representation of the high-dimensional joint probability
distribution of the variables in the model. Arguably, the use of directed graphs
in SNA has been somewhat limited, although the applications themselves are
diverse. We describe the basic principles of these directed PGMs and motivate
them with applications in the literature, which showcase their utility in SNA.
Static Bayesian Networks utilize data from a single snapshot of a social community at a given time-point, described by Directed Acyclic Graph
(DAGs). A DAG conveys precise information regarding the conditional independencies between modeled variables (nodes). The resulting graph, G,
can be translated directly into a factored representation of the joint distribution [67, 91]. BNs obey the Markov condition which states that each variable,
Xi , is independent of its non-descendants (unconnected nodes), given its parents in G. Under these assumptions, a BN for a set of variables {X1 , X2 , . . . Xn }
is a network with the structure that encodes conditional independence relationships:
P (X1 , X2 , . . . , Xn ) = P (G)
n
Y
P (Xi | pa(Xi ), Θi ) ,
i=1
where P (G) is the prior distribution over the graph G, pa(Xi ) are the parent
nodes of child Xi , and Θi denotes the parameters of the local probability distribution.
Depending on the data and modeling objectives, BN learning may require
up to two layers of inference: structural and parameter learning. Identifying
the DAG that best explains the data is an NP-hard problem [27]. Structural
inference can be conducted by sampling the posterior distribution to obtain
an ensemble of feasible graphs, or through the implementation of a greedy hillclimbing algorithm, to identify a single graph structure that best approximates
the Maximum a Posteriori (MAP) probabilities [68]. In many applications of
SNA, the structure is often assumed, at least to some degree. In this case, the
statistical inference problem is local parameter inference conditional on the
assumed structure of the network.
The directionality and causal structure of the inferred model makes BN an
attractive modeling paradigm for social networks that captures and conveys
cause and effect relationships in a problem setting. Such examples, may manifest in decision making (influence). Screen-Based Bayes Net Structure (SBNS)
8
Alireza Farasat et al.
was developed as a search strategy for large-scale data, which relies on the
adopted assumption of sparsity in the overall network structure [55]. Sparsity
in BN is a popular assumption that can safeguard against over-fitting [68].
SPSN enforces the sparsity through a two stage process, which frames the
structural learning problem as Market Basket Analysis task [12]. The algorithm relies on the theory of frequent sets and support, to first screen for local
modules of nodes, and then connect them through a global structure search.
The Market Basket framework lends itself to transaction style data, which
is by nature large, sparse and binary. In this case, actors are assumed to be
linked to each other indirectly through items or events (Figure 1A). The learning problem is to identify an influence graph based on derived features of the
binary transaction data. The method was shown to be effective for modeling
a variety of SNs, including citation networks, collaboration data, and movie
appearance records [12].
Koelle et al. proposed applications of BNs to SNA for the prediction of
novel links and pre-specified node features (e.g., leadership potential) [80].
The authors emphasize the advantage of BN to account for uncertainty, noise,
and incompleteness in the network. For example, a topology-based network
measures such as degree centrality, which is often used as a surrogate for importance, is subject to summarizations over incomplete and sometimes erroneous
data. Comparatively, a BN affords more flexibility that enables measures such
as importance to be estimated in a more data-dependent manner. Koelle et al.
provide an example of combining topology-based network measures with covariate information (Figure 1B). Directed inference of this type leverages small
local models, which can be naturally translated to regression or classification
problems, depending on the child node (response variable). In this setting, the
local BN can be evaluated at the node-level, ranked probability estimates can
be used for predictive purposes, and the output serves as a surrogate for model
fit on a given structure.
Privacy protection is a major concern amongst users in online social networks [65]. Generally, people prefer that their personal information is shared
in small circles of friends and family, and shielded from strangers [24]. Despite
this common desire, relatively simple BNs have been shown to be successful in
the invasion of privacy though the inference of personal attributes, which have
been shielded through privacy settings [65]. The BNs operate under the often
accurate assumption that friends in social circles are likely to share common
attributes. In 2006, the recommendation by He et al. to improve privacy was
to hide friend lists through privacy settings, and to request that friends hide
their personal attributes. Practically speaking, setting the optimal privacy settings is complex, and can be a tedious and difficult for an average user [96].
In 2010, a privacy wizard template was proposed, which automates a persons
privacy settings based on an implicit set of rules derived using Naive Bayes
(the simplest BN) or Decision Tree methods [43].
Probabilistic Graphical Models in Modern Social Network Analysis
9
On the other side of the application spectrum, BNs are useful for recommending products and services, to users, taking into account their interests,
needs and communications patterns. Belief propagation has been used to summarize belief about a product and propagate that belief through a BN [9,159].
Belief propagation is the process in which node marginal distributions (beliefs)
are updated in light of new evidence [82]. In the case of a BN, evidence (e.g.,
opinion or ratings) is absorbed and propagated through a computational object
known as a junction tree, resulting in updated marginal distributions. Comparing the network marginals before and after evidence is entered and propagated
conveys a system-wide effect of influence(s), and insights into how perception
or ratings change when recommendations are passed through a network. Despite its simplicity, the BN approach has been shown to be competitive with
the more classical Collaborative Filterting (CF)-based recommendation [158].
Trust (and distrust) can be highly variable dynamic processes, which depends
not only on distance from a recommender, but also, the characteristics of the
network users [88, 153]. Accounting for trust in recommendation systems is an
open area of research
Microblogging networks represent another effective venue for rapidly disseminating information and influence throughout a community. Twitter is the
most well-known microblogging network, in which posts (tweets) are short and
time-sensitive with respect to the reference of current topics [89]. Users within
microblogging networks of this type participate though the act of following and
being followed, which gives rise naturally to directed associations [75]. With
over 50 million tweets submittted daily, ranking and querying micrblogs has
become an important and active area of open research [25, 97, 105, 110, 114].
Jabeur et al. proposed a retrieval model for tweet searches, which takes into
account a number of factors, including hashtags, influence of the microbloggers, and the time [72,73]. A query relevance function was developed based on
a BN that leverages the PageRank algorithm to estimate parameters, such as
influence, in the model (Figure 1C). The retrieval model was shown to outperform traditional methods for information retrieval on Twitter data from the
TREC Tweets 2011 corpus [111].
10
Alireza Farasat et al.
Bayesian Network Applications
A) Sparse Bayesian Influence
Individuals linked through events
Inferred Social Influence
Anne
Anne
Conference 1
Bill
Bill
Cal
Cal
Conference 2
Doug
Conference 3
Eden
Doug
Eden
B) Local Bayesian Network Prediction
Degree
Centrality
Link
Certainty
Individual
Importance
Sex
Education
Religion
- Attribute
- Centrality Measure
Individual Importance
Future Leadership
Potential
- Derived Metric
C) Bayesian models of Twitter Queries
Tweet “Layored” Search
Twitter
Edge Types
following
mentioning
tagging
publishing
re-tweeting
sharing
Node Types
microblogger
tweet
reply
retweet
hashtag
web resource
t1
k1
t2
k2
t3
k3
m1
m2
Microbloggers
Tweets
Terms
Q
Query
Fig. 1 Simplified schematics of select examples of Bayesian Networks in social networks.
(A) Inferring sparse Bayesian influence based on transaction style data, which links actors
to events. (B) Local models can be used to assess predict local metrics, such as individual
importance or leadership potential, from attributes and centrality measures on the network
itself. (C) Twitter is a microblogging community, which can be queried using a retrieval
model described by a Bayesian Network.
Thus far, the BNs discussed summarize information at a single time-point.
This represents an oversimplification of the true nature of the networks described, which are inherently dynamic [137]. In the described SN applications,
the dynamic aspects are simplified by extracting data from a snapshot (or
series of snapshots) of the SN across a time-period. The discretization, e.g.,
coarse or fine, can bias the results of the analysis. Discretization can give rise
to many of the issues related to data collection discussed in Section 2. Modeling the dynamics of a network over the time-course can be achieved in the
Probabilistic Graphical Models in Modern Social Network Analysis
11
BN framework with additional modeling assumptions.
Dynamic Bayesian Networks Dynamic Bayesian Networks (DBNs) provide compact representations for encoding structured probability distributions
over arbitrarily long time courses [103]. State-space models, such as Hidden
Markov Model (HMM) and Kalman Filter Models (KFMs), can be viewed as
a special class of the more general DBN. Specifically, KFMs require unimodal
linear Gaussian assumptions on the state-space variables. HMMs do not allow
for factorizations within the state-space, but can be extended to hierarchical
HMMs for this purpose. DBNs enable a more general representation of sequential or time-course data.
DBN modeling is achieved through the use of template models, which are
instantiated, i.e., duplicated, over multiple time points. The relationships between the variables within a template are fixed, and represent the inherent dependencies between ground variables in the model. The objective is to model
a template variable over a discretized time course, X 0 . . . X T , and represent
P (X 0 : X T ) as a function of the templates over the range of time points.
Reducing the temporal problem to conditional template models, makes the
problem computationally tractable, but requires the specification of a fixed
structure across the entire time trajectory.
In a DBN, the probability for a random variable X spanning the time
course can be given in factored form,
−1
TY
P X t+1 | X t ,
P X 0:T = P X 0
t=0
where X 0 represents the initial state, and the conditional probability terms
of the form P X t+1 | X t convey the conditional independence assumptions.
The conditional representation of the likelihood is similar in spirit to the static
BN representation, but conveys the conditional independence with respect
to time. The Markov assumption enables this factorization, which has different, yet analogous meanings in static and dynamic BNs. In a DBN, the
Markov assumption explains the memorylessness property, i.e., that the current state depends on the
previous and is conditionally independent of the
past X t+1 ⊥X 0:t−1 | X t . Comparatively, in static BNs, the Markov assumption only captures nodes’ independence of their non-descendants, given the
states of their parents.
Both DBNs and static BNs represent joint distributions of random variables. Similar to static BNs, DBNs also may require up to two layers of inference, structural and parameter learning. The learning paradigms are rather
similar. Structural learning is typically achieved by the same scoring strategies,
but with the added constraint that the structure must repeat over time [49].
Such a constraint alleviates the computational burden for search strategies.
12
Alireza Farasat et al.
Additionally, the best initial structure can be searched for independently from
the remainder of the time-course. The search is performed either through
greedy hill climbing or sampling. Several options exist for parameter learning, including junction trees, belief prorogation, and EM algorithm [33,78,132].
Despite the fact that social networks are typically inherently dynamic, the
applications of DBNs in SNA have been limited. Importantly, there have been
many attempts to model social networks probabilistically over time, but not
in the strict PGM context, which is the focus of this review; many of these
advances are discussed in Section 5. Chapelle et al. used DBNs to model web
users’ browsing history [22]. The DBN extends the traditional and widelyused cascade model for browsing behavior to a more general model [77]. The
dynamic studied here is that of click sequences, which is illustrated in Figure 3 for a single click (one time instance). The model takes into account the
information at the query and session levels, differentiating perceived/ actual
attraction (au and Ai respectively) and perceived/ actual satisfaction (su and
Si respectively) with links. At each click (time-step), the hidden binary variables for examination (Ei ) and satisfaction (Si ) track the time progression to
predict future clicks. The DBM approach was shown to outperform traditional
methods, and highlighted the sensitivity of click modeling to measures of relevance and popularity at the query level.
DBNs and HMMs are very popular in the area of speech recognition [115,
162]. Meetings are social events, in which valuable information is exchanged
mainly through speech. Effectively processing, capturing, and organizing this
information can be costly, but is critical in order to maximize the impact and
information flow for participants. Dielman et al. cast the problem of meeting structuring as a DBN, which partitions meetings into sequences of actions or phases based on audio [35]. Data including speaker order, location
detected from microphone array, talk rate, pitch, and overall energy (enthusiasm). DBNs outperformed baseline HMMs in detecting meeting actions in
a smart room, such as dialogue, notes at the board, computer presentations,
and presentations at the board.
Twitter, and microblogs in general, have become a major resource for the
media to obtain breaking news or a the occurrence of a critical event. Recently,
Sakaki et al. modeled Twitter activity using KFMs in an effort to identify
event and event location [124]. Each Twitter user is assumed to represent
a sensor that monitors tweet features such as keywords, locations of tweets,
their length and content. Support Vector Machines (SVM) are first used for
event classification, followed by a Kalman filter to identify the location and the
path itself. Location information of the quake is estimated through parameter
learning at each time-point. Through tweet modeling, the authors were able
to predict 96% of Japan’s earthquakes of a certain magnitude. Furthermore,
they developed a reporting system Torreter, which is quicker than the existing
government reporting system in warning registered individuals through email
Probabilistic Graphical Models in Modern Social Network Analysis
13
of an impending quake [74]. This important and highly cited work can be
generalized in this paradigm to model and predict other events.
Dynamic Bayesian Network Example: Click Modeling
Et-1
Et
Et+1
Ci
Ai
Si
au
su
Session Level
Query Level
- Latent
- Observed
Fig. 2 An example of a time instance in a DBN used for click modeling in a browser.
The temporal dimension is click sequence, which can be progressed through binary latent
variables depicting satisfaction (Si ) and examination (Ei ). Attraction (Ai ) and satisfaction
(Si ) are modeled at the session level, as well as the query level (au and su ), which is assumed
to be time invariant.
4 Undirected Probabilistic Graph Models
Markov Networks (MNs), also known as Markov Random Fields (MRFs), are
PGMs with undirected edges. Similar to directed BNs, a MN graph is a representation of the joint distribution between variables (nodes), where the absence of an edge between two nodes implies conditional independence between
the nodes, given the other nodes in the network. In this review, we restrict
our focus to MNs, Markov Logic Networks (MLNs) and Exponential Random
Graph Models (ERGMs), which can be viewed as generalizations of the random graphs [47], and are widely used in SNA [109]. The basic formulation of
these models and their utility in SNA will be highlighted.
Markov Networks can be decomposed into smaller complete sub-graphs
known as cliques. A clique is a maximal clique if it cannot be extended to
14
Alireza Farasat et al.
included addition adjacent nodes. Clique representation enables a compact
factorization of the probability density function (pdf). Specifically, the pdf
captured by a graph G can be represented in the form:
P (X) =
1 Y
ψC (XC ),
Z
(1)
C∈Ω
where C is a maximal clique in the set of maximal cliques Ω, and ψC (xC ) is
the clique potential. The clique potentials are positive functions that capture
the variable dependence within the cliques [82]. The normalizing constant, also
known as the partition function, is given as:
X Y
Z=
ψC (XC ).
X∈χ C∈Ω
Each clique potential in a MN is specified by a factor, which can be viewed
as a table of weights for each combination of values of variables in the potential. In some special cases of MNs such as log-linear models [104], clique
potentials are represented by a set of functions, termed features, with associated weights (i.e., φC (XC ) = log(ψC (XC )), where φC (XC ) is a feature derived
from the values of the variables in set XC ).
The Hammersley-Clifford theorem specifies the conditions under which a
positive probability distribution can be represented as a MN. Specifically, the
given representation (Equation 1) implies conditional independencies between
the maximal cliques and is, by definition, a Gibbs measure [61].
MN specification problems, including parameters estimation and structure learning from data, can be quite challenging. The main difficulty in MN
parameter estimation is that the maximum likelihood problem formulated
with Equation 1 has no analytical solution due to the complex expression
of Z [93]. The problem of finding the optimal structure of G [76] using available data, similar to BNs, is even more challenging [16]. Currently existing
approaches to structure learning are either constraint-based or score-based
(see [37, 81, 106, 123, 129, 161] for more details).
MNs found use in SNA with the emergence of online social networks (OSNs)
and digital social media (see [14] for a review of key problems in SNA).
The need to capture non-causal dependencies within and between data instances (e.g., profile information) and observed relationships (e.g., hyperlinks)
in these applications is exacerbated by the presence of missing or hidden data
in OSNs [156]. A popular problem instance in this domain, that of user (missing) profile prediction, has been attacked using MNs [107, 117, 140].
Along with the problem of predicting missing profiles, link prediction is
among the most prominent problems in Big Data SNA. Multiple variations
of MNs that have been used to estimate the probability that a (unobserved)
Probabilistic Graphical Models in Modern Social Network Analysis
15
link exists between nodes include Markov Logic Networks, Relational Markov
Networks, Relational Bayesian Networks and Relational Dependency Networks [5, 23, 143, 145].
Detection of community substructures is another area of MN application [41, 108]. Social network clustering is especially challenging in a dynamic
context, e.g. in Mobile Social Networks [70]. Wan et al. employed undirected
graphical models (i.e., conditional Random Fields) constructed from mobile
user logs that include both communication records and user movement information [151]. Communities can be discovered through examination and subsetting (cutting) network relationships according to labels of interest, and through
the use of weighted community detection algorithms. Relational Markov Networks can be used for labeling relationships in a social network with given
content and link structure [150].
Several generative models have been proposed, which are motivated by
MNs, and explain the effects of selection and influence (e.g., see [2]). Modeling
channeled spread of opinions and rumors, known more generally as diffusion
modeling, is an active area of research in SNA [10, 94, 119]. Several applications of diffusion models have been proposed for social networks including, but
not limited to the spread of information [30], viral marketing [77], spread of
diseases [7], the spread of cooperation [127]. Given a social network, for each
node, a corresponding random variable indicates the state of the node (e.g.,
product or technology adoption) and links in the network represent dependency [155].
Markov Logic Networks employ a probabilistic framework that integrates MNs with first-order logic such that the MN weights are positive for
only a small subset of meaningful features viewed as templates [117]. Formally,
let Fi denote a first-order logic formula, i.e., a logical expression comprising
constants, variables, functions and predicates, and wi ∈ R denote a scalar
weight. An MLN is then defined as a set of pairs (Fi , wi ). From the MLN,
the ground Markov network, ML,C , is constructed [117] with the probability
distribution [145],
!
X
1
wi ni (x) ,
(2)
P (X = x) = exp
Z
i
where ni (x) is the number of true groundings (e.g., logic expressions) of Fi ,
i.e., such formulae that hold, in x. Figure 3 gives an example of a ground MLN
represented as a pairwise MN (left) for two individuals [104].
Many problems in statistical relational learning, such as link prediction [39],
social network modeling, collective classification, link-based clustering and object identification, can be formulated using instances of MLN [117]. Dierkes
et al. used MLNs to investigate the influence of Mobile Social Networks on
16
Alireza Farasat et al.
Fig. 3 An example of MLN with two entities (individuals) A and B, the unary relations
“smokes” and “cancer” and the binary relation “friend”. The ground predicates are denoted
by eight elliptical nodes. Two formulas, F1 (“someone who smokes has cancer”) and F2
(“friends either both smoke or both do not smoke”) are captured. There exist two groundings
of the F1 (illustrated by the edges between the “smokes” and “cancer” nodes) and four
groundings of F2 captured by the rest of the edges [145].
consumer decision-making behavior. With the call detail records represented
by a weighted graph, MLNs were employed in conjunction with logit models as
the learning technique based on lagged neighborhood variables. The resulting
MLNs were used as predictive models for the analysis of the impact of word of
mouth on churn (the decision to abandon a communication service provider)
and purchase decisions [36].
As mentioned above, link mining and link prediction problems can also be
addressed using MLNs, since MLNs combine logic and probability reasoning
in a single framework [40, 131]. Furthermore, the ability of MLNs to represent
complex rules by exploiting relational information makes them an appropriate
alternative for collective classification (e.g., classification of publications in a
citation network, or of hyperlinked webpages) [31, 34].
The Ising model and its variations form a subclass of MN with foundations in theoretical physics [6]. The Ising model is a discrete and pairwise MN,
and is popular in applications in part due to its simplicity [82]. The variables
in the model, X1 . . . Xp , are assumed to be binary, and their joint probability
is given as:


X
p(X, Θ) = exp 
θjk Xj Xk − Φ(Θ) ∀ X ∈ χ,
(j,k)∈E
p
where χ ∈ {0, 1} , and Φ(Θ) is the log of the partition function



X
X
exp 
Φ(Θ) = log
θjk xj xk  .
x∈χ
(j,k)∈E
Special, efficient methods exist for learning the Ising Model parameters
from data [116]. While the model has been originally found useful for understanding magnetism and phase transitions, its utility has later expanded to
Probabilistic Graphical Models in Modern Social Network Analysis
17
image processing, neural modeling, and studies of tipping points in economics
and social domains [1].
In SNA, the Ising model can be employed to analyze factors such as network sub-structures and nodal features affecting the opinion formation process.
A classical example within this are is a study of medical innovation spread,
namely the adoption of drug tetracycline by 125 physicians in four small cities
in Illinois [17]. Figure 4 depicts the physicians’ advisory network from a data
set prepared by Ron Burt from the 1966 data collected by Coleman, Katz
and Menzel [29] about the spread of medical innovation. The figure illustrates
the physicians’ network in two different time points and shows how physicians
changed their opinions and adopted the new medication overtime.
Adopted
Not adopted
Fig. 4 The spread of new drug adoption through an advisory network physicians: two
snapshots at different time points, about two years apart (from left to right). The growth
dynamics in the number of adopters can be analyzed with an Ising Model.
Recently, the Ising Model has been used to examine social behaviors [148],
including collective decision making, opinion formation and adoption of new
technologies or products [50, 60, 84]. For example, Fellows et al. proposed a
random model of the full network by modeling nodal attributes as random
variates. They utilized the new model formulation to analyze a peer social network from the National Longitudinal Study of Adolescent Health [45]. Agliari
et al. proposed a model to extract the underlying dynamics of social systems
based on diffusive effects and people strategic choices to convince others [3].
Through the adaptation of a cost function, based on the Ising model, for social
interactions between individuals, they showed by numerical simulation that a
steady-state is obtained through natural dynamics of social systems.
18
Alireza Farasat et al.
Exponential Random Graph Models (ERGMs) [154], also known as
the p∗ -class models, are among the most widely-used network approaches to
modeling social networks in recent years [47,113,120,121,134]. A social network
of individuals is denoted by graph Gs with N nodes and M edges, M ≤ N2 .
The corresponding adjacency matrix of is denoted by Y = [yij ]N ×N , where yij
is a random variable and defined as follows:
1 if there exists a link between nodes i and j ∀ i, j, i 6= j
yij =
0 otherwise.
Based on an ERGM, the probability of an observed network, x, is:
!
K
X
1
θi fi (y) ,
P (Y = y, Θ) = exp
Z
i=1
(3)
where fi (y), i = 1, . . . , K, are called sufficient statistics [98, 102], based on
configurations of the observed graph and Θ = {θ1 , . . . , θK } is a K-vector of
parameters (K is the number of statistics used in the model). Network configurations, including but not limited to network edge count (tie between two
actors), as well as counts of 2-stars (two ties sharing an actor) and triads
of various types, are related to communication patterns among actors in a
social network (see [98] for more details about network configurations). The
parameters of ERGMs reflect a wide variety of possible configurations in social
networks [119]. In addition, Z is the normalization constant.
Some of the first proposed models, e.g., random graphs and p1 models [47],
used Bernoulli and dyadic dependence structures, which are generally overlysimplistic [120]. On the contrary, ERGMs are based on Markov dependence
assumption [47] supposing that two possible ties are conditionally dependent
when they share an actor (node). Moreover, Markov dependence assumption
can be extended to attributed networks which assumes each node has a set
of attributes influencing the node’s possible incoming and outgoing ties [120]
(e.g., more experienced actors in an advisory network, more incoming ties).
When nodal attributes are taken into account as random variables, ERGMs
and MNs can be integrated to model the social network due to similarities
that they share (see the Appendix and [45, 98, 144]).
ERGMs have been widely employed to study the network and friendship
formation [135] and global network structural using local structure of the observed network [146]. The observed network is considered as one realization
from too many possible networks with similar important characteristics [120].
For example, Broekel et al. used ERGMs to identify factors determining the
structure of inter-organizational networks based on the single observation [15].
Schaefer et al. used SNA to study the relation between weight status and friend
selection and ERGMs to measure the effects of body mass index on friend selection [128].
Probabilistic Graphical Models in Modern Social Network Analysis
19
Morover, Goodreau et al. used ERGMs to examine the generative processes
that give rise to widespread patterns in friendship networks [59]. Cranmer
and Desmarais used ERGMs to model co-sponsorship networks in the U.S.
Congress and conflict networks in the international system. They figured out
that several previously unexplored network parameters are acceptable predictors of the U.S. House of Representatives legislative co-sponsorship network [32].
The ERGMs have also been utilized in modelling the changing communication network structure and classifying networks based on the occurrence
of their local features [146] and to identify micro-level structural properties
of physician collaboration network on hospitalisation cost and readmission
rate [147]. Finally, a ERGM-based model of clustering nodes considering their
role in the network has been reported [126].
5 Discussion
Mining social networks for knowledge and discovery has proven to be a very
challenging and active research area [79]. This review focussed on PGMs, and
motivated their use in social networks through a variety of diverse applications. An important consideration is the issue of scalability, which is a major
challenge, not only for PGMs, but for SNA, in general. Structural and parameter learning in high-dimensions can be prohibitive. In practice, several
different network structures may be plausible, and equally likely. Moreover,
both greedy- and sampling-based search strategies can get stuck at local minima. These numerical caveats can give rise to misleading networks, generating models, and subsequent predictions. ERGMs can exhibit degeneracy [64],
which occurs when the generated networks show little resemblance to the generating model. Proposed modifications to the concept of goodness of fit have
been proposed to safeguard against the problems of degeneracy [58, 71].
Social networks continuously evolve over time. The methods we have discussed either utilize a static snapshot of the social network at a given time,
or a fixed template structure which captures the dynamics. Template-based
dynamics have proven their utility in a few social network applications. However, they are overly simplistic in their assumptions. More realistically, social
networks can give rise to several interrelated streams that contain complex
overlapping relational data [83]. Moreover, communities drift as new members
join, old members leave or becoming inactive, and activities change. PGMs
are not equipped to model temporal models of this type. Data stream mining
research is an active area of research that aims to analyze web data as a stream
and upon arrival [86]. There are considerable challenges related not only to the
sheer volume and speed in which data is processed, but also to the changes
in the features or targets being processed. Another major challenge, which
has been extensively studied, is the concept of drift [51]. This phenomena
20
Alireza Farasat et al.
occurs when the probability of features and targets change in time, in other
words, probability distributions change in the stream. Estimation in posterior
probabilities in DBNs is spirit to drift estimation, but much more severely
constrained due to the Markov assumption.
Alternative methods to modeling dynamics of the network have been proposed, including latent modeling approaches and the adoption of smooth transition assumptions. Sarkar et al. proposed a latent space model which assumes
smooth transitions between time-steps, i.e. networks that change drastically
from one time step to the next are assigned a lower probability. They also adopt
a standard Markov assumption which states that t+1 is conditionally independent of all previous time-steps given t, which is the assumption adopted in our
discussion of DBNs. Hoff et al. describe a latent space approach that relies on
mapping actors into a social space by leveraging assumed transitive tendencies
in relationships in order to estimate proximity in the latent space [69]. The
iterative Facetnet algorithm frames the dynamic problem in terms of a nonnegative matrix factorization, and uses the Kubler-Leibler divergence measure
to enforce temporal smoothness [95]. TESLA extends the well-known graphical LASSO method for sparse regression, and penalizes changes between time
steps using l1 -regularization. [4]. The TESLA algorithm was tested on both
biological and social networks.
In this review, we survey directed and undirected PGMs, and highlight
their applications in modern social networks. Despite limitations that arise related to scalability and inference, it is our opinion that the utility of PGMs has
been somewhat under-realized in the social network arena. It is indisputable
that methods for understanding social networks have not kept pace with the
data explosion. There are several relevant topics and opportunities in social
networks, e.g., link predication, collective classification, modeling information
diffusion, entity resolution, and viral marketing, where conditional independencies can be leveraged to improve performance. PGMs implicitly convey
conditional independence and provide flexible modeling paradigms, which hold
tremendous promise and untapped opportunity for SNA.
6 Acknowledgements
AF and AN are supported by a Multidisciplinary University Research Initiative
(MURI) grant (Number W911NF-09-1-0392) for Unified Research on Networkbased Hard/Soft Information Fusion, issued by the US Army Research Office
(ARO) under the program management of Dr. John Lavery. RHB is supported
through NSF DMS 1312250.
Probabilistic Graphical Models in Modern Social Network Analysis
21
References
1. Afrasiabi, M.H., Guérin, R., Venkatesh, S.: Opinion formation in ising networks. In:
Information Theory and Applications Workshop (ITA), 2013, pp. 1–10. IEEE (2013)
2. Aggarwal, C.C.: An introduction to social network data analytics. Springer (2011)
3. Agliari, E., Burioni, R., Contucci, P.: A diffusive strategic dynamics for social systems.
Journal of Statistical Physics 139(3), 478–491 (2010)
4. Ahmed, A., Xing, E.P.: Recovering time-varying networks of dependencies in social
and biological studies. PNAS 106(29) (2009)
5. Al Hasan, M., Zaki, M.J.: A survey of link prediction in social networks. In: Social
network data analytics, pp. 243–275. Springer (2011)
6. Anderson, C.J., Wasserman, S., Crouch, B.: A¡ i¿ p¡/i¿* primer: logit models for social
networks. Social Networks 21(1), 37–66 (1999)
7. Anderson, R.M., May, R.M., et al.: Population biology of infectious diseases: Part i.
Nature 280(5721), 361–367 (1979)
8. Atkinson, R., Flint, J.: Accessing hidden and hard-to-reach populations: Snowball research strategies. Social research update 33(1), 1–4 (2001)
9. Ayday, E., Fekri, F.: A belief propagation based recommender system for online services. In: Proceedings of the fourth ACM conference on Recommender systems, pp.
217–220. ACM (2010)
10. Bach, S.H., Broecheler, M., Getoor, L., O’Leary, D.P.: Scaling mpe inference for constrained continuous markov random fields with consensus optimization. In: NIPS, pp.
2663–2671 (2012)
11. Berk, R.A.: An introduction to sample selection bias in sociological data. American
Sociological Review pp. 386–398 (1983)
12. Berry, M.J., Linoff, G.: Data mining techniques: For marketing, sales, and customer
support. John Wiley & Sons, Inc. (1997)
13. Biernacki, P., Waldorf, D.: Snowball sampling: Problems and techniques of chain referral sampling. Sociological methods & research 10(2), 141–163 (1981)
14. Bonchi, F., Castillo, C., Gionis, A., Jaimes, A.: Social network analysis and mining
for business applications. ACM Transactions on Intelligent Systems and Technology
(TIST) 2(3), 22 (2011)
15. Broekel, T., Hartog, M.: Explaining the structure of inter-organizational networks using
exponential random graph models. Industry and Innovation 20(3), 277–295 (2013)
16. Bromberg, F., Margaritis, D., Honavar, V., et al.: Efficient markov network structure
discovery using independence tests. Journal of Artificial Intelligence Research 35(2),
449 (2009)
17. Van den Bulte, C., Lilien, G.L.: Medical innovation revisited: Social contagion versus
marketing effort1. American Journal of Sociology 106(5), 1409–1435 (2001)
18. Callaway, D.S., Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Network robustness and
fragility: Percolation on random graphs. Physical Review Letters 85, 5468 (2000)
19. Canali, C., Colajanni, M., Lancellotti, R.: Data acquisition in social netowrks: Issues
and proposals. In: Proc. of the International Workshop on Services and Open Sources
(SOS11) (2011)
20. Cha, M., Mislove, A., Adams, B., Gummadi, K.P.: Characterizing social cascades in
flickr. In: Proceedings of the 1st Workshop on Online Social Networks (WOSN’08),.
Seattle, WA (2008)
21. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information
propagation in the flickr social network. In: Proceedings of the 18th Annual World
Wide Web Conference (WWW’09). Madrid, Spain (2009)
22. Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search
ranking. In: Proceedings of the 18th international conference on World wide web, pp.
1–10. ACM (2009)
23. Chen, H., Ku, W.S., Wang, H., Tang, L., Sun, M.T.: Linkprobe: Probabilistic inference on large-scale social networks. In: Data Engineering (ICDE), 2013 IEEE 29th
International Conference on, pp. 290–301. IEEE (2013)
24. Chen, X., Michael, K.: Privacy issues and solutions in social network sites. Technology
and Society Magazine, IEEE 31(4), 43–53 (2012)
22
Alireza Farasat et al.
25. Cheong, M., Lee, V.: Integrating web-based intelligence retrieval and decision-making
from the twitter trends knowledge base. In: Proceedings of the 2nd ACM workshop
on Social web search and mining, pp. 1–8. ACM (2009)
26. Chib, S., Greenberg, E.: Understanding the metropolis-hastings algorithm. The American Statistician 49(4), 327–335 (1995)
27. Chickering, D., Heckerman, D., Meek, C.: Large-sample learning of Bayesian networks
in NP-hard. Computing Science and Statistics 33 (2001)
28. Cochran, W.G.: Sampling Techniques, 3rd edn. Wiley (1977)
29. Coleman, J.S., Katz, E., Menzel, H., et al.: Medical innovation: A diffusion study.
Bobbs-Merrill Company Indianapolis (1966)
30. Cowan, R., Jonard, N.: Network structure and the diffusion of knowledge. Journal of
economic dynamics and control 28(8), 1557–1575 (2004)
31. Crane, R., McDowell, L.K.: Evaluating markov logic networks for collective classification. In: Proceedings of the 9th MLG Workshop at the 17th ACM SIGKDD Conference
on Knowledge Discovery and Data Mining (2011)
32. Cranmer, S.J., Desmarais, B.A.: Inferential network analysis with exponential random
graph models. Political Analysis 19(1), 66–86 (2011)
33. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) pp. 1–38 (1977)
34. Dhurandhar, A., Dobra, A.: Collective vs independent classification in statistical relational learning. Submitted for publication (2010)
35. Dielmann, A., Renals, S.: Dynamic bayesian networks for meeting structuring. In:
Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on, vol. 5, pp. V–629. IEEE (2004)
36. Dierkes, T., Bichler, M., Krishnan, R.: Estimating the effect of word of mouth on churn
and cross-buying in the mobile phone market with markov logic networks. Decision
Support Systems 51(3), 361–371 (2011)
37. Ding, S.: Learning undirected graphical models with structure penalty. J. CoRR (2011)
38. Division, N.S.R.: Rand database of worldwide terrorism incidents (1948). URL
http://www.rand.org/nsrd/projects/terrorism-incidents.html
39. Domingos, P., Kok, S., Lowd, D., Poon, H., Richardson, M., Singla, P.: Markov logic.
In: Probabilistic inductive logic programming, pp. 92–117. Springer (2008)
40. Domingos, P., Lowd, D., Kok, S., Nath, A., Poon, H., Richardson, M., Singla, P.:
Markov logic: A language and algorithms for link mining. In: Link Mining: Models,
Algorithms, and Applications, pp. 135–161. Springer (2010)
41. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale social
networks. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on
Web mining and social network analysis, pp. 16–25. ACM (2007)
42. Efron, B.: Size, power and false discovery rates. The Annals of Statistics 35(4), 1351–
1377 (2007)
43. Fang, L., LeFevre, K.: Privacy wizards for social networking sites. In: Proceedings of
the 19th international conference on World wide web, pp. 351–360. ACM (2010)
44. Faugier, J., Sargeant, M.: Sampling hard to reach populations. Journal of advanced
nursing 26(4), 790–797 (1997)
45. Fellows, I., Handcock, M.S.: Exponential-family random network models. arXiv
preprint arXiv:1208.0121 (2012)
46. Fienberg, S.E.: A brief history of statistical models for network analysis and open
challenges. Journal of Computational and Graphical Statistics 21(4), 825–839 (2012)
47. Frank, O., Strauss, D.: Markov graphs. Journal of the american Statistical association
81(395), 832–842 (1986)
48. Freeman, L.: The development of social network analysis. Empirical Press (2004)
49. Friedman, N., Murphy, K., Russell, S.: Learning the structure of dynamic probabilistic
networks. In: Proceedings of the Fourteenth conference on Uncertainty in artificial
intelligence, pp. 139–147. Morgan Kaufmann Publishers Inc. (1998)
50. Galam, S.: Rational group decision making: A random field ising model at¡ i¿ t¡/i¿=
0. Physica A: Statistical Mechanics and its Applications 238(1), 66–80 (1997)
51. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept
drift adaptation. ACM Computing Surveys (CSUR) 46(4), 44 (2014)
Probabilistic Graphical Models in Modern Social Network Analysis
23
52. Gelman, A., Shirley, K.: Inference from simulations and monitoring convergence. In:
S. Brooks, A. Gelman, G.I. Jones, X.L. Meng (eds.) Handbook of Markov Chain Monte
Carlo, pp. 163–174. Chapman & Hall: CRC Handbooks of Modern Statistical Methods
(2011)
53. Getoor, L.: Social network datasets (2012). URL http://www.cs.umd.edu/ sen/lbcproj/LBC.html
54. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: A case
study of unbiased sampling of osns. In: INFOCOM, 2010 Proceedings IEEE, pp. 1 –9
(2010)
55. Goldenberg, A., Moore, A.: Tractable learning of large bayes net structures from sparse
data. In: Proceedings of the twenty-first international conference on Machine learning,
p. 44. ACM (2004)
56. Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M.: A survey of statistical
R in Machine Learning 2(2), 129–233 (2010)
network models. Foundations and Trends
57. Goodman, L.A.: Snowball sampling. Annals of Mathematical Statistics 32(1), 148–170
(1961)
58. Goodreau, S.M.: Advances in exponential random graph (¡ i¿ p¡/i¿¡ sup¿*¡/sup¿) models applied to a large social network. Social Networks 29(2), 231–248 (2007)
59. Goodreau, S.M., Kitts, J.A., Morris, M.: Birds of a feather, or friend of a friend?
using exponential random graph models to investigate adolescent social networks*.
Demography 46(1), 103–125 (2009)
60. Grabowski, A., Kosiński, R.: Ising-based model of opinion formation in a complex
network of interpersonal interactions. Physica A: Statistical Mechanics and its Applications 361(2), 651–664 (2006)
61. Hammersley, J.M., Clifford, P.: Markov fields on finite graphs and lattices (1971)
62. Handcock, M., Hunter, D., Butts, C., Goodreau, S., Morris, M.: Statnet: An r package
for the statistical analysis and simulation of social networks. manual. university of
washington (2006)
63. Handcock, M.S., Gile, K.J., et al.: Modeling social networks from sampled data. The
Annals of Applied Statistics 4(1), 5–25 (2010)
64. Handcock, M.S., Robins, G., Snijders, T.A., Moody, J., Besag, J.: Assessing degeneracy
in statistical models of social networks. Tech. rep., Working paper (2003)
65. He, J., Chu, W.W., Liu, Z.V.: Inferring privacy information from social networks. In:
Intelligence and Security Informatics, pp. 154–165. Springer (2006)
66. Heckathorn, D.D.: Respondent-driven sampling: a new approach to the study of hidden
populations. Social problems pp. 174–199 (1997)
67. Heckerman, D.: Bayesian networks for data mining. Data Mining and Knowledge
Discovery 1, 79–119 (1997)
68. Heckerman, D.: A tutorial on learning with Bayesian networks. Springer (2008)
69. Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network
analysis. Journal of the american Statistical association 97(460), 1090–1098 (2002)
70. Humphreys, L.: Mobile social networks and social practice: A case study of dodgeball.
Journal of Computer-Mediated Communication 13(1), 341–360 (2007)
71. Hunter, D.R., Goodreau, S.M., Handcock, M.S.: Goodness of fit of social network
models. Journal of the American Statistical Association 103(481) (2008)
72. Jabeur, L.B., Tamine, L., Boughanem, M.: Featured tweet search: Modeling time and
social influence for microblog retrieval. In: Web Intelligence and Intelligent Agent
Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on, vol. 1,
pp. 166–173. IEEE (2012)
73. Jabeur, L.B., Tamine, L., Boughanem, M.: Uprising microblogs: A bayesian network
retrieval model for tweet search. In: Proceedings of the 27th Annual ACM Symposium
on Applied Computing, pp. 943–948. ACM (2012)
74. Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as electronic
word of mouth. Journal of the American society for information science and technology
60(11), 2169–2188 (2009)
75. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging
usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007
workshop on Web mining and social network analysis, pp. 56–65. ACM (2007)
24
Alireza Farasat et al.
76. Karger, D., Srebro, N.: Learning Markov networks: maximum bounded tree-width
graphs. In: Proc. SIAM-ACM Symposium on Discrete Algorithms, pp. pp. 392–401
(2001)
77. Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a
social network. In: Proceedings of the ninth ACM SIGKDD international conference
on Knowledge discovery and data mining, pp. 137–146. ACM (2003)
78. Kjaerulff, U.: A computational scheme for reasoning in dynamic probabilistic networks.
In: Proceedings of the Eighth international conference on Uncertainty in artificial intelligence, pp. 121–129. Morgan Kaufmann Publishers Inc. (1992)
79. Kleinberg, J.M.: Challenges in mining social network data: processes, privacy, and
paradoxes. In: Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining, pp. 4–5. ACM (2007)
80. Koelle, D., Pfautz, J., Farry, M., Cox, Z., Catto, G., Campolongo, J.: Applications
of Bayesian belief networks in social network analysis. In: Proc. of the 4th Bayesian
Modeling Applications Workshop, UAI Conference, p. pp. (2006)
81. Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques.
Massachusetts Institute of Technology (2009)
82. Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques.
MIT press (2009)
83. Koren, Y.: Collaborative filtering with temporal dynamics. Communications of the
ACM 53(4), 447–455 (2009)
84. Krause, S.M., Böttcher, P., Bornholdt, S.: Mean-field-like behavior of the generalized
voter-model-class kinetic ising model. Physical Review E 85(3), 031,126 (2012)
85. Krebs, V.E.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002)
86. Krempl, G., Žliobaite, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V., Noack,
T., Shaker, A., Sievi, S., Spiliopoulou, M., et al.: Open challenges for data stream
mining research. ACM SIGKDD Explorations Newsletter 16(1), 1–10 (2014)
87. Kurant, M., Markopoulou, A., Thiran, P.: On the bias of bfs (breadth first search) pp.
1 –8 (2010)
88. Kuter, U., Golbeck, J.: Sunny: A new algorithm for trust inference in social networks
using probabilistic confidence models. In: AAAI, vol. 7, pp. 1377–1382 (2007)
89. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news
media? In: Proceedings of the 19th international conference on World wide web, pp.
591–600. ACM (2010)
90. Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T.M.T., Dousse, O.,
Eberle, J., Miettinen, M.: The mobile data challenge: Big data for mobile computing
research. In: Proceedings of the Workshop on the Nokia Mobile Data Challenge, in
Conjunction with the 10th International Conference on Pervasive Computing, pp. 1–8
(2012)
91. Lauritzen, S.L.: Graphical models. Oxford University Press (1996)
92. Lee, S.H., Kim, P.J., Hawoong, J.: Statistical properties of sampled networks. Phys.
Rev. E 73, 016,102 (2006). DOI 10.1103/PhysRevE.73.016102
93. Lee, S.I., Ganapathi, V., Koller, D.: Efficient structure learning of markov networks
using l 1-regularization. In: Advances in neural Information processing systems, pp.
817–824 (2006)
94. Leenders, R.: Longitudinal behavior of network structure and actor attributes: modeling interdependence of contagion and selection. Evolution of social networks 1 (1997)
95. Lin, Y.R., Chi, Y., Zhu, S., Sundaram, H., Tseng, B.L.: Facetnet: a framework for
analyzing communities and their evolutions in dynamic networks. In: Proceedings of
the 17th international conference on World Wide Web, WWW ’08, pp. 685–694. ACM
(2008)
96. Lipford, H.R., Besmer, A., Watson, J.: Understanding privacy settings in facebook
with an audience view. UPSEC 8, 1–8 (2008)
97. Luo, Z., Tang, J., Wang, T.: Propagated opinion retrieval in twitter. In: Web Information Systems Engineering–WISE 2013, pp. 16–28. Springer (2013)
98. Lusher, D., Koskinen, J., Robins, G.: Exponential Random Graph Models for Social
Networks: Theory, Methods, and Applications. Cambridge University Press (2012)
Probabilistic Graphical Models in Modern Social Network Analysis
25
99. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.:
Big data: The next frontier for innovation, competition, and productivity (2011)
100. McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On
building a reusable twitter corpus. In: Proceedings of the 35th international ACM
SIGIR conference on Research and development in information retrieval, pp. 1113–
1114. ACM (2012)
101. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: In Proceedings of the 5th
ACM/USENIX Internet Measurement Conference (IMC’07). San Diego, CA (2007)
102. Morris, M., Handcock, M.S., Hunter, D.R.: Specification of exponential-family random
graph models: terms and computational aspects. Journal of statistical software 24(4),
1548 (2008)
103. Murphy, K.P.: Dynamic bayesian networks: representation, inference and learning.
Ph.D. thesis, University of California (2002)
104. Murphy, K.P.: Machine learning: a probabilistic perspective. The MIT Press (2012)
105. Nagmoti, R., Teredesai, A., De Cock, M.: Ranking approaches for microblog search. In:
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM
International Conference on, vol. 1, pp. 153–157. IEEE (2010)
106. Netrapalli, P., Banerjee, S., Sanghavi, S., Shakkottai, S.: Greedy learning of markov
network structure. In: Proc. of Allerton Conf. on Communication, Control and Computing, Monticello, USA (2010)
107. Neville, J., Jensen, D.: Relational dependency networks. The Journal of Machine
Learning Research 8, 653–692 (2007)
108. Newman, M.E.: Modularity and community structure in networks. Proceedings of the
National Academy of Sciences 103(23), 8577–8582 (2006)
109. Newman, M.E., Watts, D.J., Strogatz, S.H.: Random graph models of social networks.
Proceedings of the National Academy of Sciences 99(suppl 1), 2566–2572 (2002)
110. O’Connor, B., Krieger, M., Ahn, D.: Tweetmotif: Exploratory search and topic summarization for twitter. In: ICWSM (2010)
111. Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the trec-2011 microblog
track. In: Proceeddings of the 20th Text REtrieval Conference (TREC 2011) (2011)
112. Park, J., Newman, M.E.: Statistical mechanics of networks. Physical Review E 70(6),
066,117 (2004)
113. Pattison, P., Wasserman, S.: Logit models and logistic regressions for social networks:
Ii. multivariate relations. British Journal of Mathematical and Statistical Psychology
52(2), 169–193 (1999)
114. Pochampally, R., Varma, V.: User context as a source of topic retrieval in twitter. In:
Workshop on Enriching Information Retrieval (with ACM SIGIR) (2011)
115. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech
recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
116. Ravikumar, P., Wainwright, M.J., Lafferty, J.D., et al.: High-dimensional ising model
selection using l1-regularized logistic regression. The Annals of Statistics 38(3), 1287–
1319 (2010)
117. Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1-2), 107–
136 (2006)
118. Rinaldo, A., Fienberg, S.E., Zhou, Y., et al.: On the geometry of discrete exponential
families with application to exponential random graph models. Electronic Journal of
Statistics 3, 446–484 (2009)
119. Robins, G., Pattison, P., Elliott, P.: Network models for social influence processes.
Psychometrika 66(2), 161–189 (2001)
120. Robins, G., Pattison, P., Kalish, Y., Lusher, D.: An introduction to exponential random
graph (¡ i¿ p¡/i¿¡ sup¿*¡/sup¿) models for social networks. Social networks 29(2), 173–
191 (2007)
121. Robins, G., Pattison, P., Wasserman, S.: Logit models and logistic regressions for social
networks: Iii. valued relations. Psychometrika 64(3), 371–394 (1999)
122. Robins, G., Snijders, T., Wang, P., Handcock, M., Pattison, P.: Recent developments
in exponential random graph (¡ i¿ p¡/i¿*) models for social networks. Social networks
29(2), 192–215 (2007)
26
Alireza Farasat et al.
123. Roy, S., Lane, T., Werner-Washburne, M.: Learning structurally consistent undirected
probabilistic graphical models. In: Proc ICML (2009)
124. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event
detection by social sensors. In: Proceedings of the 19th international conference on
World wide web, pp. 851–860. ACM (2010)
125. Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations
using respondent-driven sampling. Sociological Methodology 34(1), 193–240 (2004)
126. Salter-Townshend, M., Murphy, T.B.: Role analysis in networks using mixtures of
exponential random graph models. Journal of Computational and Graphical Statistics
(just-accepted), 00–00 (2014)
127. Santos, F.C., Pacheco, J.M., Lenaerts, T.: Evolutionary dynamics of social dilemmas
in structured heterogeneous populations. Proceedings of the National Academy of
Sciences of the United States of America 103(9), 3490–3494 (2006)
128. Schaefer, D.R., Simpkins, S.D.: Using social network analysis to clarify the role of
obesity in selection of adolescent friends. American journal of public health (0), e1–e7
(2014)
129. Schmidt, M., Murphy, K., Fung, G., Rosales, R.: Structure learning in random fields
for heart motion abnormality detection. In: CVPR (2010)
130. Scott, J., Carrington, P.J.: The SAGE handbook of social network analysis. SAGE
publications (2011)
131. Singla, P., Domingos, P.: Lifted first-order belief propagation. In: AAAI, vol. 8, pp.
1094–1099 (2008)
132. Smyth, P., Heckerman, D., Jordan, M.I.: Probabilistic independence networks for hidden markov probability models. Neural computation 9(2), 227–269 (1997)
133. Snijders, T.A.: Estimation on the basis of snowball samples: how to weight? Bulletin
de méthodologie sociologique 36(1), 59–70 (1992)
134. Snijders, T.A., Pattison, P.E., Robins, G.L., Handcock, M.S.: New specifications for
exponential random graph models. Sociological methodology 36(1), 99–153 (2006)
135. Song, X., Jiang, S., Yan, X., Chen, H.: Collaborative friendship networks in online
healthcare communities: An exponential random graph model analysis. In: Smart
Health, pp. 75–87. Springer (2014)
136. Sparrow, M.K.: The application of network analysis to criminal intelligence: An assessment of the prospects. Social networks 13(3), 251–274 (1991)
137. Spiliopoulou, M.: Evolution in social networks: A survey. In: Social network data
analytics, pp. 149–175. Springer (2011)
138. Stanford: Stanford network analysis package (snap) (2011). URL http://snap. stanford.
edu
139. Stutzbach, D., Rejaie, R., Duffield, N., Sen, S., Willinger, W.: On unbiased sampling for
unstructured peer-to-peer networks. Networking, IEEE/ACM Transactions on 17(2),
377 –390 (2009)
140. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational
data. In: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 485–492. Morgan Kaufmann Publishers Inc. (2002)
141. for the Study of Terrorism, N.C., to Terrorism (START), R.: International
center for political violence and terrorism research (icpvtr) (2012).
URL
http://www.pvtr.org/ICPVTR/
142. for the Study of Terrorism, N.C., to Terrorism (START), R.: Gtd global terrorism
database (2013). URL http://www.start.umd.edu/gtd/
143. Thi, D.B., Hoang, T.A.N.: Features extraction for link prediction in social networks.
In: Computational Science and Its Applications (ICCSA), 2013 13th International
Conference on, pp. 192–195. IEEE (2013)
144. Thiemichen, S., Friel, N., Caimo, A., Kauermann, G.: Bayesian exponential random
graph models with nodal random effects. arXiv preprint arXiv:1407.6895 (2014)
145. Tresp, V., Nickel, M.: Relational models. Encyclopedia of Social Network Analysis and
Mining. Ed. by J. Rokne and R. Alhajj. Heidelberg: Springer (2013)
146. Uddin, S., Hamra, J., Hossain, L.: Exploring communication networks to understand
organizational crisis using exponential random graph models. Computational and
Mathematical Organization Theory 19(1), 25–41 (2013)
Probabilistic Graphical Models in Modern Social Network Analysis
27
147. Uddin, S., Hossain, L., Hamra, J., Alam, A.: A study of physician collaborations
through social network and exponential random graph. BMC health services research
13(1), 234 (2013)
148. Vega-Redondo, F.: Complex social networks, vol. 44. Cambridge University Press
(2007)
149. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in facebook. In: Proceedings of the 2nd ACM SIGCOMM Workshop on Social
Networks (WOSN’09). Barcelona, Spain (2009)
150. Wan, H., Lin, Y., Wu, Z., Huang, H.: A community-based pseudolikelihood approach
for relationship labeling in social networks. In: Machine Learning and Knowledge
Discovery in Databases, pp. 491–505. Springer (2011)
151. Wan, H.Y., Lin, Y.F., Wu, Z.H., Huang, H.K.: Discovering typed communities in mobile social networks. Journal of Computer Science and Technology 27(3), 480–491
(2012)
152. Wang, D., Li, Z., Xie, G.: Towards unbiased sampling of online social networks. In:
Communications (ICC), 2011 IEEE International Conference on, pp. 1 –5 (2011)
153. Wang, Y., Vassileva, J.: Bayesian network-based trust model. In: Web Intelligence,
2003. WI 2003. Proceedings. IEEE/WIC International Conference on, pp. 372–378.
IEEE (2003)
154. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social networks:
I. an introduction to markov graphs andp. Psychometrika 61(3), 401–425 (1996)
155. Wortman, J.: Viral marketing and the diffusion of trends on social networks (2008)
156. Xiang, R., Neville, J.: Collective inference for network data with copula latent markov
networks. In: Proceedings of the sixth ACM international conference on Web search
and data mining, pp. 647–656. ACM (2013)
157. Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R.: Detecting communities and their evolutions in dynamic social networksa bayesian approach. Machine Learning 82, 57189
(2011)
158. Yang, X., Guo, Y., Liu, Y.: Bayesian-inference based recommendation in online social
networks. In: INFOCOM, 2011 Proceedings IEEE, pp. 551–555. IEEE (2011)
159. Yang, X., Guo, Y., Liu, Y.: Bayesian-inference-based recommendation in online social
networks. Parallel and Distributed Systems, IEEE Transactions on 24(4), 642–651
(2013)
160. Ye, S., Lang, J., Wu, S.F.: Crawling online social graphs. In: Proceedings of the 12th
International Asia-Pacific Web Conference, pp. 236–242 (20010)
161. Zhu, J., Lao, N., Xing, E.P.: Grafting-light: fast, incremental feature selection and
structure learning of markov random fields. In: Proc.16th ACM SIGKDD (2010)
162. Zweig, G., Russell, S.: Speech recognition with dynamic bayesian networks (1998)
7 Appendix
Similarity between MNs and ERGMs.
While MNs and ERGMs have been developed in different scientific domains,
they both specify exponential family distributions. MN models treat social
network nodes as random variables, and hence, their utility is most obvious in modeling processes on networks; ERGMs, on the other hand, have
been conceptualized to model network formation, where it is the edge presence indicators that are treated as random variables (these random variables
are dependent if their corresponding edges share a node). But in fact, this
application-related difference in what to treat as random is not fundamental.
This Appendix works to more rigorously disclose the similarity between MNs
and ERGMs by re-defining an ERGM as a PGM. We begin, however, by reviewing the branch of literature devoted exclusively to ERGMs.
28
Alireza Farasat et al.
Similar to MNs, a well-discussed problem of ERGMs for analyzing social
networks is related to the challenge of parameters estimation [122] due to the
lack of enough observed data. Robins et al. outline this and some other problems associated with ERGMs, e.g., degeneracy in model selection and bimodal
distribution shapes [122] (see also [62, 64, 118, 134]).
The roots of ERGMs in the Principle of Maximum Entropy [112] and the
Hammersley-Clifford theorem have been previously pointed out [56,119]. Here,
we illustrate how MNs and ERGMs are similar in terms of the form and structure using most popular significant statistics in ERGMs; under the assumption
of Markov dependence, for a given social network, one can build a corresponding Markov network via the following conversion: 1) each node in the Markov
network will correspond to an edge in the social network (Fienberg called this
construct a “usual graphical model” for ERGMs [46]), 2) when two edges share
a node in the social network, a link will be built between two corresponding
nodes in the Markov network.
Corresponding to each possible edge in a social network, a node in an
MN network is introduced; note the difference between the original social network and the MN network - they are not the same! Consider an ERGM with
the significant statistics including the number of edges, f1 (y), the number
of k-stars, fi (y) ,i = 2, . . . , N − 1 and the number of triangles, fN (y). In
an MN, a maximum Entropy (maxent) model proposes
the following form
P
for the internal energy of the system, Ec (x) = − i αci gci . Define, gci as
ith feature of clique
PN c ∈ Ω and αci is its corresponding weight in G. Thus,
ψc (x) = exp{βc i=1 αci gci }. Since there are too many parameters in the
MN, they can be deducted by imposing homogeneity constraints similar to
that of ERGMs [120]. Before imposing such constraints, these following facts
are required.
It is straightforward to demonstrate that G encompasses cliques of size
{3, . . . , N − 1}. In addition, all substructure in Gs can be redefined by features
in G. Considering these points, we can rewrite the joint probability of all
variables represented by the MN, P (X), as follows:
!
!
C
N
C
N
X
X
X
1
1 Y
exp βc
αci gci =
exp
βc
αci gci . (4)
P (X) =
Z(α) c=1
Z(α)
c=1
i=1
i=1
In (4), Z(α) is the partition function which is a function of parameters. The
homogeneity assumption, here, means αci = θi0 ∀ c = 1, . . . , C; then P (X) is:
!
N
C
X
X
1
0
P (X) =
exp
θi
βc gci .
(5)
Z(θ0 )
c=1
i=1
PC
In (5), let’s Z 0 = Z(θ0 ). In addition, we assume that c=1 βc gci represented
by fi0 , means that substructures i in all cliques c are added up by weight βc .
Probabilistic Graphical Models in Modern Social Network Analysis
29
y15
y45
Observed
1
Not observed
y14
y15
5
y25
y14
y45
2
y13
y35
y24
y24
y35
y23
Observed
Not observed
y34
4
y12
y12
y25
y23
3
y13
y34
Fig. 5 A social network with five actors(left) and its corresponding Markov network (right).
Finally, if we replace fi0 in (5):
N
X
1
P (X) = 0 exp
θi0 fi0
Z
i=1
!
.
(6)
Comparing P (Y = y) and (4) confirms that ERGMs and MNs are similar and
under the following conditions they are identical:
1) θi = θi0 ,
PC
2) fi = fi0 = c=1 βc gci .
The following Numerical Example depicts similarities between ERGMs and
MNs. A social network with five actors, N = 5, is assumed (Figure 5 (left)).
Considering Markov dependency assumption, there exists an unique corresponding Markov network shown in Figure 5 (right) with 10 nodes. There are
15 cliques (so-called factors) of size three or four,
Φ = {φ1 (y12 , y13 , y14 , y15 ), . . . , φ15 (y24 , y45 , y25 )}.
As already mentioned, the joint probability function of all variables in each
clique is proportional to the internal energy. For instance:
φ1 (x) =
1
exp{−β1 Ec (y12 , y13 , y14 , y15 )},
λ
P
where E1 (x) = − i αci gci and λ is the distribution parameter. This simple
example shows that how ERGMs and MNs are the same in terms of the underlying concept and the expressed probability distribution.
Social Networks 40 (2015) 154–162
Contents lists available at ScienceDirect
Social Networks
journal homepage: www.elsevier.com/locate/socnet
On efficient use of entropy centrality for social network analysis and
community detection
Alexander G. Nikolaev ∗ , Raihan Razib, Ashwin Kucheriya
Department of Industrial and Systems Engineering, 438 Bell Hall, State University of New York at Buffalo, Buffalo, NY 14260, United States
a r t i c l e
i n f o
Keywords:
Social network modeling
Centrality
Entropy
Community detection
Clustering
a b s t r a c t
This paper motivates and interprets entropy centrality, the measure understood as the entropy of flow
destination in a network. The paper defines a variation of this measure based on a discrete, random
Markovian transfer process and showcases its increased utility over the originally introduced path-based
network entropy centrality. The re-defined entropy centrality allows for varying locality in centrality
analyses, thereby distinguishing locally central and globally central network nodes. It also leads to a
flexible and efficient iterative community detection method. Computational experiments for clustering
problems with known ground truth showcase the effectiveness of the presented approach.
© 2014 Elsevier B.V. All rights reserved.
1. Introduction
Despite the abundance of existing methods for measuring
centrality in social networks, new research challenges and opportunities continue to emerge. In application to large network datasets,
computational efficiency of evaluation becomes a major indicator of utility of centrality measures. Even more importantly, the
typically reliable path-based measures lose sensitivity when the
number of paths contributing to their formulae grows too large,
making the evaluation of node centrality with respect to nearby
neighbors (as opposed to the whole network) particularly difficult.
In searching for answers to new challenges, it is desirable to design
centrality measures with solid grounding in theory, while not compromising interpretability sought by social science practitioners.
This paper develops a centrality measure whose computation
for a given node does not require dyad-based path enumeration.
Instead, the presented measure relies on an absorbing Markovian process evolving over finite time – this allows for matrix
multiplication-based computation of centrality. Depending on the
absorption rate and evolution time, the presented measure enables
centrality analysis at varying localities around a node of interest,
thereby distinguishing locally central and globally central network
nodes. The measure offers an information theory-based approach
to measuring centrality, and takes a particular, previously unoccupied spot in the typology of flow-based centrality metrics.
∗ Correspondence author. Tel.: +1 716 645 4710.
E-mail address: [email protected] (A.G. Nikolaev).
http://dx.doi.org/10.1016/j.socnet.2014.10.002
0378-8733/© 2014 Elsevier B.V. All rights reserved.
Different measures of centrality capture different aspects of
what it means for a node to be “central” to the network. In his
seminal paper, Freeman (1979) argued that node degree centrality, the number of direct links incident to a node, indexes the
node’s activity; node betweenness centrality, based on the position of a node with respect to the all-pair shortest paths in a
network, exhibits the node’s potential for network control; and
closeness centrality, the sum of geodesic distances from a node
to all the other nodes, reflects its communication independence
or efficiency. Borgatti (2005) conceptualized a typology of centrality measures based on the ways that traffic flows through
the network. Two characteristics – the route the traffic follows
(geodesics, paths, trails, or walks) and the method of propagation (parallel duplication, serial duplication, or transfer) – define
the two-dimensional typology. Each measure of centrality makes
assumptions about the importance of the various types of traffic flow, and hence, each measure of centrality can be assessed
by where it falls in the typology. For example, betweenness centrality is perfect for networks featuring flows along geodesics.
A node with high betweenness centrality is essentially a traffic
checkpoint that can shut down the flow. At the same time, betweenness is an inappropriate measure in networks where flow is not
constrained to follow geodesics. Non-geodesic paths avoid the
checkpoints altogether, making an alternative measure essential.
Over the years, researchers have proposed a number of different
centrality measures, including eigenvector centrality (Bonacich,
1972), information centrality (Stephenson and Zelen, 1989), subgraph centrality (Estrada and Rodriguez-Velazquez, 2005), alpha
centrality (Bonacich and Lloyd, 2001), etc. However, their meaning
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
with respect to Borgatti’s typology have not always been clearly
defined or analyzed.
Tutzauer (2007) began to address this issue and proposed a centrality measure for networks characterized by path-based transfer
flows. The path-based transfer model assumes that an object travels
from a particular node (the one whose centrality is being evaluated) to a destination (the node itself or one of its neighbors) along
a random path. More specifically, a path is sequentially built: if
the flow originating node is randomly selected to be the next in
the sequence, then the flow is over before it begins; otherwise, the
object is randomly passed to one of the original node’s immediate
neighbors. Given that the object has arrived to the new node, the
next transfer step destination is then randomly chosen from among
its neighbors (including the current node, but not including any of
the previously visited nodes), and again the flow either stops (if
the current node in the sequence is selected) or continues on in
the same fashion (if a different node is selected). For the described
transfer model, the centrality of a given node can be defined as the
entropy of the transfer’s final destination. In other words, it can be
expressed via the probabilities of transfer paths from the node to
each of the other nodes. Despite the fact that the motivation for
this entropy-based measure is intuitively and technically clear, the
research community has been slow to adopt it for application purposes, largely due to the need for exhaustive path enumeration in
evaluating the defined centrality.
This paper develops the idea of Tutzauer (2007), and presents
a new, high-utility entropy centrality measure based on a discrete
Markovian transfer process. In the presented model, a transferred
object randomly walks through a network; then, the resulting measure – the walk destination entropy – can be efficiently computed,
which opens new ways for insightful, computationally efficient
analyses of networks. The structure of the paper is as follows.
Section 2 introduces essential notation and the fundamentals of
path-transfer flow process, builds a Markov model for the study
of this process, presents an expression for the entropy centrality
measure, and offers an illustrative computational example. Section 3 uses entropy centrality to design an algorithm for community
detection in networks, and reports computational results with the
algorithm applied to clustering problems with known ground truth.
Section 4 offers discussion and concluding remarks.
2. Model description
2.1. Mathematical preliminaries
The mathematical representation of a network is a directed
or undirected graph G = (V, E), where V = {1, 2, . . ., N} is a finite,
nonempty set of nodes (vertices), and E is a relation (a tie configuration) on V. The elements of E are called edges. The edge (i, j) ∈ E is
incident with the vertices i and j, and i and j are incident with the
edge (i, j) ∈ E. Moreover, (i, j) ∈ E is a link if i =
/ j and a loop if i = j.
The incidence matrix of G has elements (bij ), i = 1, 2, . . ., N, j = 1, 2,
. . ., N such that bij = 1 if nodes i and j in the network are connected
with an edge and 0 otherwise.
2.2. Centrality and entropy connection
To motivate the connection between the centrality of a given
node and the concept of entropy, consider a network of friends
transferring an object among themselves. The more central the
original node is, the more difficult it is to predict the object’s final
destination. If the node is central, the object has a greater probability of traveling far in multiple potential directions. In contrast,
a less central node has a more limited choice of immediate transfer options and the process is more likely to stop (be absorbed)
155
before the number of transfer options increases, which makes its
destination more predictable.
This idea can be more easily understood if one considers an
extreme example of a network of one extrovert person and many
introverts. An introvert is a node in the network with no or very few
incident links, while an extrovert is a node adjacent to many nodes
in the network. Assume that, according to a random rule, an object
transfer process can terminate after the object is passed from one
node to another, i.e., the object will eventually be absorbed by some
node, termed destination node. In the case of high absorption probabilities, if the object transfer process originates from the extrovert
(following the transfer process described above), the probability
that it ends up at any given node is close to 1/N. In contrast, if the
transfer process originates from the introvert, then the flow first
needs to reach the hub to go beyond it, limiting the likelihood that
“far-away” nodes are reached at all.
The level of uncertainty of object destination, as a function of
its origin, can be captured as destination entropy. The concept of
entropy was first introduced in physics, and later, developed in
information and communication sciences; entropy enjoys distinct
and intuitive interpretations in multiple applied domains. In adopting it for the use in social network analysis, one avoids having to
assess a node’s position with respect to paths connecting all node
pairs, and instead, focuses on the node’s potential to diversify flow
propagation.
2.3. Path transfer and random walk flows as foundations for
entropy centrality computation
In assessing the value of node position using network flow,
researchers have historically focused on paths as channels that flow
may follow. Entropy centrality does not explicitly measure the ability of a node to interfere with path-based exchanges between other
nodes; instead, it views a node of interest as flow originator.
The treatment of paths and flow types, relevant to the concept
of entropy centrality, deserves a more in-depth discussion. This
paper’s contribution to centrality theory is akin to that of Newman
(2005), who first proposed to use walks, instead of only shortest paths, for betweenness measurement. In entropy centrality
calculation, the idea of analyzing random walks is further developed, by allowing walks to be randomly interrupted; the longer a
given planned object route, i.e., the more exchanges (transfers) it
requires, the less likely it is to be completed. To further illustrate
this point, a review and discussion of path-transfer flows is in order.
Examples of path-transfer flows are aplenty among trading and
smuggling networks (Tutzauer, 2007), especially when the traded
or smuggled commodity is discrete such as the case of exotic animals, nuclear weapons material and parts, fossils, artworks and
antiquities, and even trafficking humans. For a more peaceful example, consider a group of people linked by friendship ties, with one
of them having a specific object. To model a path-transfer process, think of the object being passed from one person to another.
The flow (i.e., object transfer) originates at a particular person in
the group (i.e., a node in the graph). If that person does not pass
the object to any one of their immediate friends, the flow is over
before it begins; otherwise, the object flows (i.e., is transferred) to a
randomly selected person. The next person then chooses whether
to pass the object to their immediate friends, and again the flow
either stops or continues. The object thus traverses a path in the
network, traveling along the links, stopping when the process is
absorbed at some node or if the object’s trajectory completes a loop.
According to the original model formulation, each of the eligible
neighbors is assumed to be selected with equal likelihood, although
this assumption can be relaxed without loss of generality. The main
restriction in the path-transfer process is that the object cannot be
passed to the nodes it has already visited.
156
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
This paper relaxes the restriction for flow to follow paths in the
entropy centrality definition. Instead, it develops a model based
on a special case of random walk, where each node has a positive
probability of absorbing the flow for good (Newman, 2005; Noh and
Rieger, 2004). The motivation for this alternative definition of the
entropy centrality computation mechanism is two-fold. First, the
path-based centrality is extremely difficult to use in practice. The
necessity for complete path enumeration in its computation makes
the original measure (Tutzauer, 2007) not suitable for the analysis of well-connected networks containing over ten nodes. On the
contrary, the relevant transfer and absorption probabilities for random walks can be easily calculated using matrix-analytic methods.
Second, note an important nuance in the entropy centrality concept that can be utilized with its definition based on random walks.
Entropy centrality is calculated using a serial transfer model, however, because multiple transfer destination probabilities enter the
entropy expression simultaneously, it may be more conducive to
analyzing serial duplication processes. An iterative, step-by-step
analysis of a random walk originating from a given node would
inform one of the temporal (periodic) dynamics of the flow destination entropy, i.e., indicate how fast (in how many periods) the
network can be informed/conquered if the spread of influence is
initiated from the given node. Consider modeling a community
becoming engaged into discussing a pertinent topic/issue picked
up by one of its members from a news outlet. All the sequences in
which people can converse come into play, and some conversations
can occur simultaneously, as long as multiple community members
are informed of the topic/issue by their neighbor(s). Thoughtful, as
opposed to gossip-generating, conversations between people are
rarely broadcast, they take place sequentially: the same news can
be discussed by the same two individuals multiple times (think
more about mulling over a political situation, rather than sharing a
news of a rock-star making appearance at a night club).
In summary, walk-based entropy centrality can be most useful for identifying influential community members with respect
to serial duplication process. This observation defines the measure’s place in Borgatti’s typology, reaffirming the motivation for
introducing it.
2.4. Markov model and entropy centrality
Consider a connected network represented by graph G = (V, E),
with V being a set of N nodes indexed 1 through N, and with E being
a relation on V. Refer to Fig. 1 for an illustrative example of a small
network with N = 6 nodes and |E| = 8 edges. In a random walk based
flow process, the immediate destination of an object transferred
from an object-holder depends only on the current object position,
and not on the sequence of nodes that the object visited prior to the
current state, therefore, its position over time (in time periods) can
be modeled as a Markovian process, or a Markov chain. For example, an object being transferred over the network in Fig. 1 could
move from node 1 to node 4, and then in the next period, back
from node 4 to node 1. It is also assumed that each node has an
option to hold the object to itself in any given period even though
it is connected to other nodes, thus taking a pause in communication. Additionally, each node can stop the flow for good, with the
probability of such an event referred to as absorption probability
av , v ∈ V . Fig. 2 depicts the node absorption probabilities fixed at
a = 0.2, which implies that in a single period node 1 can transfer the
object to three nodes (i.e., self, node 2 or node 4) with the same
probability of (1 − 0.2)/3 = 0.27. Fig. 2 also adds auxiliary nodes to
the original network: labeled with apostrophes, these nodes represent absorbing states of the Markov chain. Note that in order to
avoid cluttering in Fig. 2, the loop transitions are not depicted on
it. Consider a stochastic process with the state diagram as given
by Figs. 1 and 2 combined (including both the loops and absorbing
Fig. 1. A schematic representation of an example network depicting the object
transfer (transition) diagram.
states). This process is a Markov chain with transition probability
matrix denoted by P, with elements pij , i ∈ {1, 2, . . ., N, 1 , 2 , . . .,
N }, j ∈ {1, 2, . . ., N, 1 , 2 , . . ., N }, as given in Table 1. The measure
of centrality for node i = 1, 2, . . ., N, quantified by the entropy of
the object destination, given that the transferred object originates
from node i and experiences t transitions is defined as
N
Hit = −
(t)
(t)
ij
(t)
(t)
ij
(pij + p ) log(pij + p ).
(1)
j=1
If the base of the logarithm in formula (1) is chosen to be 2,
then the entropy centrality is measured in bits; meanwhile, the
results in the subsequent sections of this paper are reported using
the more conventional natural logarithm. The expression in (1)
(t)
(t)
involves terms of the form (pij + p ) – one such term gives the
ij
probability that the object originates at node i, and as t time periods
elapse, finds itself in possession of node j. The closer these probabilities are for nodes j ∈ {1, 2, . . ., N}, the more difficult it is to
Fig. 2. An expanded state diagram for a Markovian transfer process, with auxiliary
nodes for absorbing states.
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
157
Table 1
Transition probability matrix for the Markovian transfer process.
Node
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
0.267
0.2
0
0.16
0
0
0
0
0
0
0
0
0.267
0.2
0.2
0.16
0
0
0
0
0
0
0
0
0
0.2
0.2
0.16
0
0.2
0
0
0
0
0
0
0.267
0.2
0.2
0.16
0
0.2
0
0
0
0
0
0
0
0
0
0
0.4
0.2
0
0
0
0
0
0
0
0
0.2
0.16
0.4
0.2
0
0
0
0
0
0
0.2
0
0
0
0
0
1
0
0
0
0
0
0
0.2
0
0
0
0
0
1
0
0
0
0
0
0
0.2
0
0
0
0
0
1
0
0
0
0
0
0
0.2
0
0
0
0
0
1
0
0
0
0
0
0
0.2
0
0
0
0
0
1
0
0
0
0
0
0
0.2
0
0
0
0
0
1
predict/guess the object’s position (at time t), and the larger is the
entropy.
Note that the number of transitions t, which can be fixed at
any integer, defines the desired locality of centrality analysis: it
is thus termed transfer locality. In particular, a node in a network
may have a high relative centrality for small t but low relative centrality for large t. Also, as t increases, the centrality measure for any
node approaches a constant, depending on how fast the process is
expected to be absorbed.
2.5. The effect of transfer locality adjustment
Given a locality value t, one evaluates a node’s centrality with
respect to a part of the network that is likely to be reached from
the node by a transfer process in a limited time, i.e., in t steps of a
random walk. In other words, by adjusting transfer locality, one
“magnifies” the local neighborhood surrounding the node, thus
reducing the impact of “far away”, hard-to-reach nodes on the
resulting entropy centrality value. When t is large, entropy centrality describes nodes’ network positions on a global (whole network)
scale.
In order to understand the implications of varying transfer locality in centrality analyses, consider a social network of Zackary’s
karate club. In a classical study, 34 members of a karate club were
observed over a 2-year period. A network of friendships between
the club members was constructed using a variety of measures to
estimate the strength of ties. An unweighted version of the club
network is given in Fig. 3; the following analysis focuses on the six
nodes labeled 1, 5, 12, 29, 33 and 34 – these appear in bold circles
in the figure. Fig. 4 reports entropy centrality values for the six
nodes, with varied levels of t. With the increasing transfer locality,
each node’s centrality value monotonically increases, implying
that, given more time, the node can reach more peers (remember,
the node for which the centrality is computed is viewed as flow
originator). Importantly, observe that the rates of entropy increase
as a function of t are different for different nodes. As such, nodes
5 and 29 see their centrality values dramatically increase with the
growing t, indicating that such nodes can become influential only if
the transfer process they originate does not die early. Meanwhile,
nodes 1, 33 and 34 located at “the heart” of well-connected
clusters (small or large) do not see their centralities grow by much.
Interestingly, node 29 has low centrality in its local neighborhood
and high centrality with respect to the whole network, surpassing
locally central nodes 5 and 33. The sensitivity of entropy centrality
to a node’s position with respect to network clusters leads to the
idea of fixing the transfer locality value in such a way that clusters
can be identified in any network.
3. Community structure detection
This section describes how entropy centrality can be used to
reveal community structure in networks. The presented idea of a
community detection algorithm is inspired by the algorithm proposed by Girvan and Newman (2002), which iteratively removes
high-betweenness edges in an hierarchical clustering procedure.
The algorithm proposed in this paper also removes one edge at a
time and re-computes the corresponding transition and absorption
probabilities for each node.
Fig. 3. Zackary’s karate club social network.
158
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
Fig. 4. Entropy centrality vs transfer locality plot for karate club problem.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
INPUT: Number of nodes in the network N; transition probability
matrix P for the Markov Chain with auxiliary nodes; transfer
locality t; number of algorithm iterations K.
For k = 1 to K
For i = 1 to N and j = 1 to N, i =
/ j
Remove the link between nodes i and j, if it exists
Revise the probabilities for transitions from nodes i and j
Compute the average entropy over all the nodes using (1)
Remember / Update the link for which the entropy decrease is
maximum
End
Remove the link for which the entropy decrease is maximum
End
Sort to identify the obtained clusters
OUTPUT: The clusters.
Given an undirected graph and a fixed value of absorption probability for all the nodes, the transition probability matrix P for a
Markov chain with auxiliary states is created first, as explained in
Section 2. The desired centrality locality t is chosen next; during
the experimentation, it was empirically discovered that locality
values close to the diameter of a given network, and the absorption probability values in the range [0.1, 0.2] are convenient choices
for successful global community detection. The algorithm proceeds
by identifying and removing network edges such that the average entropy centrality over all the nodes is reduced the most (the
algorithm pseudocode is given). Empirical investigations with the
designed community detection algorithm to discover clusters in
networks with known ground truth are reported next.
3.1. The Zachary’s karate club network
Returning to the Zackary’s karate club experiment, recall the
part of the club’s story that made it famous in the social network analysis circles: during the 2-year observational study, a
split occurred between the club members. A disagreement, which
developed between the administrator of the club and the club’s
instructor, ultimately resulted in the instructor’s leaving and starting a new club, taking about a half of the original club’s members
with him. The node colors in Fig. 3 indicate how exactly the two
factions ended up splitting.
Fig. 5 presents the results of the entropy centrality-based
community detection algorithm, executed with the karate club network. The algorithm discovers the two main club communities,
offering a strict refinement of the community structure reported
in (Girvan and Newman, 2002) and agreeing with the findings of
Medus et al. (2005). For finding the two-community division, 25
iterations of the algorithm were executed with locality t = 5. This
partition corresponds almost exactly to the actual factions in the
club, with an only exception of some “outliers”, the nodes with the
lowest degree values. The outliers, nodes 5, 10, 11, 12 and 29, were
Fig. 5. Entropy centrality algorithm results for karate club problem depicting the sequence of community formation.
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
159
Table 2
Clustering algorithms comparison – karate club data (34 nodes, 78 edges).
Number of iterations
10
15
25
Girvan–Newman algorithm
Entropy-based algorithm
Clusters
Outliers
Time (s)
Clusters
Outliers
Time (s)
1
2
4
0
1
2
5.6
7.4
9.5
2
3
4
2
4
5
0.5
0.6
0.77
Table 3
Clustering algorithms comparison – football network data (115 nodes, 613 edges).
Number of iterations
50
100
150
200
250
Girvan–Newman algorithm
Entropy-based algorithm
Clusters
Outliers
Time (s)
Clusters
Outliers
Time (s)
1
3
6
12
14
0
0
0
3
4
577.5
904.4
1046.1
1094.0
1134.7
1
3
7
12
12
0
0
0
3
13
205.3
388.2
553.1
673.6
738.9
detected first, which is a desirable property for a community detection algorithm that looks to find closely connected groups. Note also
that increasing the number of algorithm iterations produces a more
granular clusters (perhaps, smaller friendship groups or families),
however, any further refinements could not have been validated
due to the lack of data.
Table 2 reports the number of clusters identified by
Girvan–Newman algorithm (Girvan and Newman, 2002) and the
Fig. 6. Clusters for NCAA Division I-A football teams.
160
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
Fig. 7. Clusters for the Dolphin Network.
presented entropy-based algorithm as a function of the number
of iterations, together with the respective computational times.
In each algorithm, an iteration constitutes a removal of a single
edge from the network: in identifying an edge to be removed, the
former algorithm computes betweenness centralities for all the
edges, while the latter computes entropy centralities for all the
nodes and the changes in these centralities that would result from
the removal of every edge. Thus, the number of centrality evaluations, required in each iteration of the presented algorithm, is O(N)
times greater than that in the Girvan–Newman algorithm. Yet, the
observed algorithm runtimes differ by an order of magnitude in
favor of the entropy-based algorithm, due to high efficiency in the
computation of entropy centrality. Note that both algorithms are
hierarchical, in that they will continue breaking communities apart
until the last edge has been removed from the network at hand; an
analyst is free to stop this process at any point. In Table 2, and in the
tables corresponding to the subsequent experiments, cluster-based
metrics are reported for multiple breakpoints in the algorithms’
execution, for illustrative purposes. All the experimental results
presented in this paper have been obtained using MATLAB version
R2012b on a desktop with Intel i3-2120 processor (3.3 GHz, 2 cores)
and 8 GB RAM.
3.2. The US Division I football network
Another example is based on the structure of a US college football league (football here is American football, not soccer). The
network under study is a representation of the schedule of Division
I games in year 2000, with nodes representing teams (identified by
their college names) and edges representing regular-season games
between the teams they connect. What makes this network interesting is that the true community structure is also available. The
teams are divided into conferences containing around 8–12 teams
each. Games are more frequent between members of the same
conference than between members of different conferences, with
teams playing an average of about seven intra-conference games
and four inter-conference games in the season. Inter-conference
play is not uniformly distributed; teams that are geographically
close to one another but belong to different conferences are more
likely to play one another than teams separated by large geographic
distances (Girvan and Newman, 2002). The entropy centrality based
community detection algorithm was applied to this network to
identify the conference structure. The algorithm was executed
for 200 iterations with transfer locality t = 5. The results are presented in Fig. 6. Almost all teams were correctly grouped; a few
Table 4
Clustering algorithms comparison – dolphin network data (62 nodes, 159 edges).
Number of iterations
45
75
100
Girvan–Newman algorithm
Entropy-based algorithm
Clusters
Outliers
Time (s)
Clusters
Outliers
Time (s)
8
11
12
0
12
20
39.2
48.0
51.0
4
6
6
16
22
31
8.7
11.5
13.8
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
161
Fig. 8. Clusters for the “Les Miserables” weighted network (100 iterations, 31 outliers).
independent teams that did not belong to any conference were also
successfully identified. Overall, only four teams were misclassified:
Boise State in actuality belongs in the Western Athletic Conference,
Western Michigan and Marshall in the Mid-American Conference,
while Utah State is a conference-independent team.
Table 3 offers a comparison of computational times of
Girvan–Newman and entropy-based algorithms executed on
the football network dataset. Naturally, as the number of iterations
increases, the network becomes more sparse. It is worth noting that
the number of shortest paths remaining in the network decreases
sharply, which explains why an iteration cost in a Girvan–Newman
algorithm drops faster than that in the entropy-based
algorithm.
3.3. The Dolphin Network
Finally, the algorithm was run on a classic network called “The
Dolphin Network” (Lusseau, 2003). The network represents a community of 62 bottlenose dolphins in Doubtful Sound, New Zealand.
The algorithm was executed for 45 iterations with locality t = 5. The
results are presented in Fig. 7 and Table 4: the dolphin community
is split into 6 main clusters identified by the entropy centrality community detection algorithm. These results match well with those
reported by previously existing clustering algorithms. The singledout outliers, nodes 5, 12, 13, 32 and 36, were detected first, while
nodes 23, 49, 61 were not separated into an isolated cluster. In its
current design, the entropy centrality algorithm cannot not be used
to distinguish overlapping communities, which is another community detection task typically explored based on this dataset.
Note that the interpretation of entropy centrality emphasizes
diversity in multi-way information exchanges between nodes, as
opposed to emphasizing connectivity. Therefore, the presented
clustering algorithm first and foremost seeks to achieve high cohesion within each discovered group, and this is why outliers tend to
be removed early in all the attempted experiments.
4. Discussion and conclusion
This paper introduces a measure of node centrality as the
entropy of flow destination in a walk-based transfer process with
Markovian property. Entropy centrality can be particularly useful in large social network analysis, where the multitude of paths
between node pairs makes the differences in the typically-used
betweenness centrality values almost negligible. Entropy centrality is well-interpretable and easy to compute exactly using matrix
multiplication.
By design, entropy centrality can be interpreted as a measure
of node potential for information spread: the more diverse set of
destinations a node can engage, the higher centrality it boasts.
Moreover, by adjusting the settings of the Markovian transfer process, one is able to measure entropy centrality at different localities,
establishing the value of every node’s position with respect to its
local neighbors, or globally, with respect to the whole network. The
notion of locality in entropy centrality definition is akin to that of
reach, used to define reach centrality, however, in the stochastic
transfer process context, these two are not quite the same.
Entropy centrality is conducive to quantifying the properties of
a serial duplication network flow process, thus taking a particular spot in Borgatti’s typology of social network processes/metrics.
This observation motivates further investigations into entropy centrality utility for viral marketing studies, where spread of ideas or
products takes place simultaneously over multiple network paths
or walks. A profit-sharing product distribution strategy where
distributors are constantly recruited by the existing distributors
directly from the consumer population is a good example of such a
potential analysis application.
Entropy centrality can be useful for network visualization, with
globally central nodes placed into a canvas first, uniformly spaced,
and with the surrounding other nodes becoming more distant as
their local centrality drops. Such a visualization would emphasize information exchange capabilities of nodes at multiple levels,
instead of relying exclusively on local network structure captured
162
A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162
Table 5
Entropy-based clustering algorithm – Les Miserables data (77 nodes, 254 edges,
weighted).
Number of iterations
25
50
100
150
Entropy-based algorithm
Clusters
Outliers
Time (s)
2
4
5
4
20
25
31
48
14.6
27.2
47.9
58.8
by its edges. This paper also explores how entropy centrality can
be utilized by an iterative algorithm to effectively detect communities. Computational experiments on networks with known
cluster ground truth showcase the effectiveness of the presented
method.
Additional insights, drawn from the experiments with the
entropy centrality-based clustering algorithm, are notable. First,
entropy centrality appears to remain informative with weighted
networks (the matrix-based way of computing entropy centrality values does not require any significant revision). Fig. 8 and
Table 5 give the results and computational times for the presented
clustering algorithm application to a weighted dataset of character co-appearances in the text of “Les Miserables”: the discovered
communities comply with the results previously reported in the
literature. Second, it is observed that in its current form, entropy
centrality may not be as useful for analyzing directed networks.
When applied to prisoner relationship data (67 nodes, 182 edges),
the entropy centrality-based clustering algorithm failed to discover well-interpretable clusters. This is due to the fact that in a
directed network, many actors have very limited options for information spread, and are isolated as “outliers” early in the algorithm’s
execution. Also, experiments with larger networks revealed that
runtime-wise, the clustering algorithm’s applicability has similar
limitations as Girvan–Newman algorithm does: more specifically,
the former can work with datasets with up to 500 nodes, whereas
the latter becomes slow clustering just 100 nodes. Meanwhile, computational efficiency of one-time evaluation of entropy centrality
values over all network nodes remains very high, as expected.
On the final note, future research on the use of entropy centrality for social network analysis can focus on evaluating strategic
network positions of groups of nodes. Having directly computed
all the absorption probabilities (i.e., from state i ∈ N into state j ∈ N)
for the Markovian transfer process (i.e., from state i ∈ N into state
j ∈ N), one can search for subsets of strategically positioned nodes at
various localities. Other research directions include increasing the
computational efficiency of the presented methods, and devising
methods for detecting overlapping network communities.
Acknowledgment
This research has been supported in part by the National Science
Foundation (Award #62288) and by a Multidisciplinary University
Research Initiative (MURI) grant (#W911NF-09-1-0392) for “Unified Research on Network-based Hard/Soft Information Fusion”,
issued by the US Army Research Office (ARO) under the program
management of Dr. John Lavery.
References
Bonacich, P., 1972. Factoring and weighting approaches to status scores and clique
identification. J. Math. Sociol. 2 (1), 113–120.
Bonacich, P., Lloyd, P., 2001. Eigenvector-like measures of centrality for asymmetric
relations. Soc. Netw. 23 (3), 191–201.
Borgatti, S.P., 2005. Centrality and network flow. Soc. Netw. 27 (5), 5–71.
Estrada, E., Rodriguez-Velazquez, J.A., 2005. Subgraph centrality in complex
networks. Phys. Rev. E 71 (5), 056103.
Freeman, L.C., 1979. Centrality in social networks: conceptual clarification. Soc.
Netw. 1 (21), 5–239.
Girvan, M., Newman, M.E.J., 2002. Community structure in social and biological
networks. Proc. Natl. Acad. Sci. U. S. A. 99 (782), 1–7826.
Lusseau, D., 2003. The emergent properties of a dolphin social network. Proc. Biol.
Sci. 270, S186–S188.
Medus, A., Acuna, G., Dorso, C., 2005. Detection of community structures in networks
via global optimization. Phys. A 358, 593–604.
Newman, M.E.J., 2005. A measure of betweenness centrality based on random walks.
Soc. Netw. 27 (3), 9–54.
Noh, J.D., Rieger, H., 2004. Random walks on complex networks. Phys. Rev. Lett. 92,
118701-1-4.
Stephenson, K., Zelen, M., 1989. Rethinking centrality: methods and examples. Soc.
Netw. 11, 1–37.
Tutzauer, F., 2007. Entropy as a measure of centrality in networks characterized by
path-transfer flow. Soc. Netw. 29 (2), 249–265.
Engagement Capacity and Engaging Team Formation for
Reach Maximization of Online Social Media Platforms
∗
Alexander Nikolaev
University at Buffalo
312 Bell Hall
Buffalo, New York 14260
[email protected]
Shounak Gore
Venu Govindaraju
University at Buffalo
113 Davis Hall
Buffalo, New York 14260
University at Buffalo
113 Davis Hall
Buffalo, New York 14260
[email protected]
ABSTRACT
The need to assess the “health” of online social media platforms and strategically grow them guides the efforts of researchers and practitioners. For those platforms that primarily rely on user-generated content, the reach – the degree
of participation referring to the percentage and involvement
of users – is a key indicator of success. This paper lays a
theoretical foundation for measuring engagement as a driver
of reach that achieves growth via positive externality effects.
The paper takes a game theoretic approach to quantifying
engagement, viewing a platform’s social support capital as a
cooperatively created value and finding a fair distribution of
this value among the contributors. It introduces engagement
capacity, a measure of the ability of users and user groups
to engage peers, and formulates the Engaging Team Formation Problem (EngTFP) to identify sets of users that “make
the platform go”. We distinguish our analyses, which underlie the reach maximization efforts, from the pre-existing
influence maximization work and compare the engagement
capacity with network-based metrics. Computational investigations with MedHelp and Twitter data reveal the properties of engagement capacity and the utility of EngTFP.
Categories and Subject Descriptors
A3 [Crowdsourcing Systems and Social Media]; A4
[Economics and Markets]; A8 [Social Networks and
Graph Analysis]
Keywords
social networks, engagement, reach, team formation problem, influence maximization
1.
INTRODUCTION
Measuring social influence online is becoming a more sophisticated, refined and granular process. A variety of professional services like Klout, PeerIndex, etc., have come up
∗Corresponding author.
[email protected]
lately that aim at measuring the influence with as much
granularity as possible. Social media influence can be defined as an individual’s ability to affect the thinking or actions of peers; in this setting, it is of interest to identify the
most influential persons with the objective to strategically
change political preferences in a community, or advertise
products / services to increase their sales.
However, influence measurement looses meaning when an
online platform does not seek to exploit its userbase, but instead, simply works to maintain the user activity. The whole
purpose of activities including sharing pictures or posts, contributing comments or “likes” in response to peers’ posts or
exchanging opinions in social forum threads is to share experiences and maintain (friendly) communication between
fellow users of the network. This form of communication
keeps the users engaged, and ultimately, defines the influential power of the platform that they use. Take as an
example the downfall of MySpace and Orkut with the success of Facebook. As more and more users migrated from
Orkut to Facebook, the utilization of Orkut declined, leading to its eventual abandonment. The ability to detect such
turns and strategically grow online platforms is of interest
to many practitioners, putting the question of assessing the
platform’s “health” up on the research agenda.
Taking a page from RE-AIM, a widely used systematic program evaluation framework, we use the term “reach” when
referring to an online platform’s ability to attract new users
[39, 13], and define “engagement” as any existing user’s degree of involvement in the platform’s activities. The interplay between engagement and reach can be understood by
leveraging on the research that for decades sought to understand the structure, stagnation, and growth of consumer
markets. Studies of cascade emergence in demand-side economics have found that the customers’ willingness to choose
(adopt) a product grows with to the number of people that
have already adopted it; this increase in value, otherwise
known as positive network externality, occurs with each sale
of a product unit [43, 18, 20]. On an online platform, positive network externalities occur in two instances: (1) when a
new user joins the network and authors a new post, and (2)
when an existing user contributes new user-generated content [43]. Thus, the internal growth of a platform (achieved
through added user-generated content) leads to its external growth (the increase in the number of newly registered
users): i.e., higher engagement leads to higher reach.
(a)
(b)
(c)
Figure 1: Examples of communication threads of different structure.
We posit that engagement occurs when a user contributes
content in response to someone else’s contribution(s). A network of directed relationships reflecting “which post attracts
which” captures the sequence and structure of engagement
(see Figure 1). An online forum grows through its users:
every post (even “weather talk”) fosters user “bonding” and
creates a positive externality effect, even more so when it
is read and responded to. The propagator – the creator of
seed content – is said to engage the responder, while the
responder is said to engage into the forum’s activity.
Filling the methodological gap in the development of theorysupported methods for quantifying engagement, we introduce a new term - “engagement capacity”, - interpreted as a
user’s ability to engage peers and measured as their share in
the platform / forum engagement power. We view the value
generated by a massive online social network community as
a direct product of communication between its members, attributed to the submitted contributions’ content and structure (order), and only in part to the volume (note that irrelevant interactions are promptly removed from pro-health
sites, a majority of which are moderated).
In order to assess any individual user’s engagement capacity, we employ cooperative game theory. Cooperative game
theory is a branch of science that calculates “fair-share” equilibria in settings with agents that form coalitions to achieve
a common goal. Its first fundamental advance is due to
Shapley [37], who introduced a method for calculating fair
contributed values for “players” forming unstructured coalitions, where any contributor can interact or team up with
any other contributor with the objective of generating synergistic value. This is a perfect setup for the engagement ability measurement: the un-normalized and normalized (e.g.,
by the number of contributions or the amount of time spent
online) engagement capacity can allow for fair comparison
between users as they contribute to the growth of a platform.
It is important to see the potential in studying “engagement”
as opposed to studying “influence”. The latter requires one
to build a fixed time snap-shot of a social network that will
define how users can affect each other’s decisions with respect to a particular query; the former is query-independent.
Yet, influence maximization research and centrality analyses are the closest in spirit to the presented work, so we
briefly review those literatures in Section 2. Section 3 then
discusses the details of using cooperative game theory for
measuring engagement capacity. Section 4 covers data collection and reports some computational results. Section 5
gives concluding remarks and discusses future work.
2.
BACKGROUND AND RELATED WORK
Studies of social influence mechanisms have aimed to explain the patterns behind the spread of ideas, technologies,
and viral product adoption in social networks. The earliest
efforts made “word-of-mouth” an established term in social
influence research [23, 45, 17] and developed diffusion-based
models to describe innovation adoption [7], disease spread
[27], and other phenomena in sociology [22, 41] and politics [9]. Metrics for quantifying people’s ability to influence
peers’ decision-making, based on their structural network
properties, have also been developed - they are known as
centrality metrics [6].
It should be noted that most of such methods were proposed
and used in the pre-big data era. Later, with the advent of
the Internet and the spread and growth of online social networks, various ranking-based algorithms have gained popularity [32]. In the studies of Question-and-Answer forums,
the question of measuring the motivation of a user to contribute to a forum first came up [3, 33]. The simple “ad-hoc”
measures, e.g., the number of upvotes on websites like StackOverflow, or the number of “likes” on Facebook and Twitter,
were used to inform forum utilization and signal user interest. Natural language processing (NLP) works do distill
the specifics of a post to understand its success in attracting responses [15, 4, 21, 44]. Note that that NLP methods
that focus on the post content, rather than the dynamics of
users creating posts in interacting with each other, require
much human supervision, and tend to be slow in processing big data. Moreover, as intuition suggests, none of the
above-mentioned measures quantifies how much merit a user
deserves for engaging others through interaction.
Influence maximization is a branch of social influence research worth a separate discussion. Works in this area formulate and solve the optimization problem of selecting a
group of the initiators (also called seeds or opinions leaders) to generate the largest influence cascade - the largest
number of product or idea adopters [10, 19]. Identifying the
teract [28, 35]. We extend and adapt this line of work to
render it useful in thread-based communication context, by
allowing for multiple interactions between the users (as players) in forum threads, accounting for contribution (forum
post) order.
Figure 2: Engagement capacity and network-based metric
values for the users of a forum with three threads as depicted
in Figure 1 (best viewed in color).
most influential subset of the user base of an online resource
turns out to be valuable for viral marketing [34, 14], delivering personalized recommendations [38], microblogging
effectiveness [5], and health forum analysis [40].
The conventional influence maximization formulations, however, are not suitable for determining how to best keep an
online platform active, i.e., answering the question “What
set of users is the most important for keeping the whole
user base engaged?”. To this end, new methods for measuring engagement are in order.
Little research has been done on measuring engagement thus
far. The terms “engagingness” and “responsiveneness” were
introduced in modeling email message chains [30, 29]. In a
study of Twitter, the count of re-tweets was taken as a measure of engagingness [1]. However, all these works were based
on only direct responses from one user to another, with indirect communication (engaging one user through another)
not accounted for.
3.
MEASURING THE ENGAGEMENT CAPACITY OF A USER
This section presents a method to quantify users’ ability to
engage peers, taking into account the flow of communication
between them. It uses cooperative game theory, a branch of
science that calculates “fair-share” equilibria in settings with
agents that form coalitions to achieve a common goal. The
first fundamental advance in this theory is due to Shapley [37], who introduced a method for calculating fair contributed values for agents forming unstructured coalitions,
where any contributing agent can interact or team up with
any other contributor, with the objective of generating synergistic value. Indeed, the value generated by a coalition
(team) typically exceeds the total value that could be generated by the coalition members individually (in isolation of
each other) – such synergy is characteristic of most collective
human efforts. Shapley’s work has been extended by Myerson [25, 26], and later, by other researchers, to problems
with constraints on cooperation structure and frequency [31,
28, 46, 8].
The most relevant to our proposed work are the extensions
that recognize that in certain situations, the value generated
by a coalition may depend on the order in which players in-
Consider a setting where forum users cooperatively generate engagement value: if they contribute posts that attract
more posts, then they are promoting the forum growth and
increasing the forum’s “engagement power” achieved through
positive externality effects. We define the engagement capacity of a user as their share in the overall forum’s ability to
attract posts, based on all threads the user has contributed
to. It is assumed that engagement capacity of a user can be
positive if and only if the user has been active in at least
one thread; that is, the passive readers (e.g., forum lurkers)
can not add engagement value to the forum. Moreover, the
users who contribute but are never responded to also have
zero engagement capacity; indeed, if a forum consisted of
only such members, it would generate no communication at
all.
Based on a single thread, the engagement capacity of a user
depends on the position(s) of the user’s contribution(s) in
the thread’s flow. As such, a thread starter is (in part)
responsible for engaging all the users that have replied to
the thread: a user whose post generates further posts gets
partial engagement value credit from all such contributions.
On the other hand, the posts immediately preceding a newly
contributed post should get more credit for engaging it than
the posts appearing earlier in the thread: otherwise, the new
contribution would be expected to have occured earlier or
be directly responding to those older posts.
Let us look closely at the examples in Figure 1: assume that
these three forum threads have been created by the same
five users. User A originates each thread. User B replies to
user A’s original messages in all the three threads. In the
first thread (Figure 1a), users C, D and E reply to B, instead
of directly replying to user A. In the second thread (Figure
1b), users C and D also reply to A’s question directly. The
third thread (Figure 1c) consists of a single “line” thread
branch. Based on the directed networks in Figure 1, the
averages of some standard centrality metrics are reported in
Figure 2 (the detailed explanation of how the engagement
capacity metric is computed in the figure and the discussion
of how the metrics relate to each other will be given in the
subsequent sections).
A methodological approach to computing engagement capacity is presented next, followed by a discussion of some
computational aspects of engagement analysis.
3.1
Cooperative Games on k-Coalitions
Nowak and Radzik [28] and Sánchez et al. [35] explain that
for a transferable utility (cooperative) game in a directed
graph, the worth of a coalition can depend not only on the individual properties of coalition members but also on the order in which they interact in a coalition. Nowak and Radzik
[28] define the value ΨN R that generalizes the Shapley value
for transferable utility games. For a game with player set N
and value function v, where player subsets S ⊂ N form ordered coalitions T ≡ (i1 , ..., i|T | ) from the set of all possible
ordered coalitions π(S), this value for player i amounts to
X
X (|N | − |T | − 1)!
R
(v(T ∪i)−v(T )),
ΨN
(N, v) =
i
|N |!
S⊂N \{i} T ∈π(S)
which, with Ω(N ) as the set of all possible ordered subsets
of N , H(T ) as the set of players in coalition T , and i(T )
as the position of player i in coalition T , can be concisely
re-written as
X
∆∗v (T )
R
ΨN
(N, v) =
,
i
|T |!
T ∈Ω(N ),i∈H(T ),i(T )=|T |
∆∗v (T )T ∈Ω(N ),T 6=∅
where
are termed the generalized coefficients of v, also known as the coordinates of v in the generalized unanimity basis [16]. Sanchez and Bergantinos [35]
define another Shapley value extension, distinguishing coalitions by the players’ positions within them,
X
∆∗v (T )
.
ΨSB
i (N, v) =
|T | (|T |!)
T ∈Ω(N ),i∈H(T )
More recently, del Pozo et al. [8] defined a generalization of
ΨN R and ΨSB as a parametric family of functions {Ψα }α∈[0,1] ,
where the value generated by an ordered coalition is shared
proportionally to the positions of its members,
Ψα
i (N, v) =
X
∆∗v (T )
T ∈Ω(N ),i∈H(T )
α|T |−i(T )
P |−1 j .
|T |! |T
j=0 α
We extend and adapt the line of work on ordered transferable utility games [28, 35, 8] to render it useful in the threadbased communication context, where interactions take place
between online platform users (as players) as they contribute
posts to forum threads in response to each other, in sequence, and possibly, multiple times to the same thread.
To this end, define k-coalition as a connected ordered sequence of player appearances. As with the coalitions used to
define {Ψα }α∈[0,1] , a k-coalition is distinguished not only by
its membership, but also, by its ordering. Similarly to i(T ),
we let i(T, k), k = 1, 2, ..., K, denote the position of the k-th
appearance of player i ∈ H(T ) in k-coalition T ∈ ΩK (N ),
with ΩK (N ) denoting the set of all k-coalitions in which
any given player can appear at most K times. The value
generated by a k-coalition is shared proportionally to the
positions of the appearances of the coalition members,
ΨK-α
(N, v) =
i
X
T ∈ΩK (N ),i∈H(T ),k=1,..,K
∆∗v (T )
α|T |−i(T,k)
P |−1 j .
|T |! |T
j=0 α
(1)
The family of parametric functions {ΨK-α }K∈I + , α∈[0,1] encompasses, as special cases, the conventional Shapley value
as well as the functions ΨN R , ΨSB , and {Ψα }α∈[0,1] .
The concept of engagement capacity is now introduced, in
conjunction with the term engaging subthread in forum communication. Define subthread as an uninterrupted chain (sequence) of posts of a thread that contains (and thus, begins
with) the first post of the thread; in a directed tree graph,
representing a forum thread, every path that begins at the
root is a subthread. A subthread is called engaging if it
is succeeded by at least one post in its respective thread.
Therefore, the number of forum posts contributed in response to or following up on another post (or a sequence
of posts) constitutes a total engagement value generated by
the forum users. The share of each user in this value is
called engagement capacity: it measures each user’s ability to engage peers, computed retrospectively, i.e., based on
their past activity records.
Given an online forum, let N denote the set of all the users
in the forum’s userbase. Define (N , v, P ) as a game on the
set of all the forum’s subthreads P . A subthread-restricted
game (N , vU , p) can be interpreted as the game where users
U ⊂ N contribute posts to form p ∈ P ; this setup is similar,
but not exactly the same, to a game in a communication “situation” described in [8]. Given a forum’s snapshot (historical
data), set K to be the largest number of posts contributed
by the same user to any subthread, and set ∆∗v (T ) to return
a total number of posts immediately succeeding such engaging subthreads p ∈ P that have the same membership, size
and structure as k-coalition T ∈ ΩK (N ). The engagement
capacity of forum user i ∈ N is the value that (1) returns for
this user as a solution of the game (N , v, P ) and is hereafter
denoted by ηi , or by ηi,F to specify that the computation is
based on a particular set F of forum threads.
Note that the coefficient α ∈ [0, 1] in {ΨK-α }K∈I + , α∈[0,1]
captures the engagement share tradeoff between thread contributors. As such, if a new post is submitted in response
to multiple (preceding) consecutive posts, then its immediate predecessor gets more credit for attracting it, with the
credit to the earlier predecessors discounted by the factors
of α, α2 , α3 , etc., respectively.
3.2
Calculating Engagement Capacity
Consider the example in Figure 1a, and denote the thread
depicted on it by “(a)”. Using (1), the engagement capacity
of user A is found to be
α
,
ηA,(a) = 1 + 3 ∗
α+1
where user A gets the engagement value of 1 by contributing
3α
to engaging subthread A and α+1
by contributing to engaging subthread AB; meanwhile, subthreads ABC, ABD and
ABE are not engaging. Similarly, the engagement capacity
of user B amounts to
1
ηB,(a) = 3 ∗
.
1+α
Note that users C, D and E have zero engagement capacity
in this example; also, the total engagement value generated
and shared is equal to four, i.e., the number of posts contributed by users in response to their peers. Table 1 gives the
engagement capacity values for each of the threads in Figure 1, with α = 1, i.e., giving all the subthread propagators
equal credit for engaging a responder. In these examples,
User
A
B
C
D
E
(a)
2.5
1.5
0
0
0
(b)
4.5
0.5
0.5
0.5
0
(c)
2.083
1.083
0.583
0.25
0
Table 1: Engagement capacity values computed for the
threads in Figure 1.
Figure 3: Distribution of engagement capacity values for MedHelp and Twitter users (best viewed in color).
the early contributors to a thread have higher engagement
capacities than the late ones. However, in general, this is
not necessarily the case; for example, with α = 0, ηA,(a) = 1
and ηB,(a) = 3.
The engagement capacity values can be interpreted directly,
or upon normalization. One sensible way is to normalize by
the number of contributed posts to identify the users whose
contribution content is engaging irrespective of the volume.
The case where a forum post is immediately followed (in
the same thread) by another post of the same user is worth
a special discussion. In general, the treatment of such occurrences is up to the researcher: e.g., one may choose to
“merge” such posts and treat them as one. However, in a
moderated forum, one may assume that every contribution
is distinct, i.e., makes a new point; since every post grows
the network and creates positive externalities, rewarding the
user (with a share in the total engagement value) for both
posts makes sense. Another issue that becomes apparent in
the considered example is that a user may gain a high engagement capacity by engaging themselves (as opposed to
others): this issue will be addressed in Section 4, with the
introduction of “targeted engagement capacity.”
Engagement capacities of users can be, rather conveniently,
dynamically computed and updated in real-time, as new
content is added, without the need to redo the analysis
for the whole history of the forum/platform every time it
changes. Recall that each new user post submitted in response to another post, or sequence of posts, brings in one
unit of engagement value to the communication thread and
to the platform as a whole. A newly added post increases
the engagement capacity values of all the users contibuting
to the subthread leading to the new post (but not including
it). Consider a k-coalition T 0 ∈ ΩK (N ) with the same membership, size and structure as this subthread. The increase
in the engagement capacity of user i ∈ H(T 0 ), resulting from
the addition of the new post, is given by
0
∆i =
X
k=1,2,...,K
α|T |−i(T
P|T 0 |−1
j=0
0
,k)
αj
.
(2)
Equation (2) specifies how every new contribution changes
the engagement capacity values of forum users. This equation can be used in real time to efficiently track the con-
tributed engagement dynamics. Note that the expression in
(2) can be evaluated in O(n) time: its denominator requires
finding the subthread succeeded by a new post, and its numerator requires the information about the positions of user
appearances in this subthread.
4.
TARGETED ENGAGEMENT CAPACITY
AND ENGAGING TEAM FORMATION
Equation (2) of Section 3 specifies how to fairly distribute
a unit engagement value, brought to the platform with any
new post submitted into an existing thread, between the
thread contributors. Naturally, some users may be succesfull
in engaging certain peers and not so successful in engaging
others. This realization highlights the value of solving a
game of engaging one given user j: the peers of j would
split the engagement value generated by the j’s responses
to their posts or post sequences. In this case, only those
subthreads that are followed up on by the posts of j would
be considered engaging.
More generally, one can address the question of evaluating
the ability of a given set of users, V ∈ N , to engage another
set of users, W ∈ N , in forum communication. To this
end, we introduce the term targeted engagement capacity:
denoted by ηV →W , it is defined as the sum of the shares
allocated to the members of V in the game of engaging the
members of W . In the setting of the game on k-coalitions,
described in Section 3.1, one has
ηV →W =
X
X
i∈V,j∈W i∈H(T ),k=1,..,K
∆∗v (T )
α|T |−i(T,k)
P |−1 j , (3)
|T |! |T
j=0 α
∆∗v (T )
where
returns a total number of posts by the members of W immediately succeeding such engaging subthreads
p ∈ P that have the same membership, size and structure as
k-coalition T ∈ ΩK (N ). As an interesting special case example, note that ηi 6= ηi→N \i : the difference between these
quantities is ηi→i – it indicates how much user i tends to
engage in back-and-forth conversations as opposed to occasionally contributing to multiple forum threads. Note that
targeted engagement capacity can be updated dynamically
in the same manner as the originally defined engagement
capacity, by a trivial extension of (2).
The ability to measure how successful a particular user is
(a)
(b)
Figure 4: Engagement capacity per user contribution, for the users ordered by the increasing contribution volume.
in engaging other particular users allows us to attack the
following question: ”What group (team) of active users is
most engaging?” This question is of special interest to any
growing online platform, since such teams of users can be rewarded, encouraged, and assisted in further increasing peer
engagement and retention. To help this cause, the Engaging
Team Formation Problem (EngTFP) is introduced.
The EngTFP objective is to select a subset of users with a
maximal targeted engagement capacity towards all the other
users: maxU ηU →N \U . The problem can be formulated with
additional constraints, e.g., those specifying which historical
engagement data to take into account and/or the maximum
size of the subset to be selected (e.g., |U | ≤ b).
To set up an instance of EngTFP, first, use (3) to compute all the pairwise targeted engagement capacity values,
{ηi→j }i∈N,j∈N,i6=j . Note that, in general, ηi→j 6= ηj→i for
i 6= j. Let Xi , i ∈ N , be binary decision variables such
that for any i ∈ N , Xi = 1 if i is selected into U , and zero
otherwise. Let Yij , i ∈ N , j ∈ N , i 6= j, be auxiliary binary variables that would set to 1 if and only if Xi = 1 and
Xj = 0. Then, EngTFP is given,
X X
max
ηi→j Yij
i∈N j∈N,j6=i
s.t.
Yij ≤ Xi ,
∀ i, j,
Yij ≤ 1 − Xj , ∀ i, j,
Xi , Yij ∈ {0, 1}, ∀ i, j.
A non-linear formulation of EngTFP
P would not require any
auxiliary variables and maximize i∈N,j∈N,j6=i ηi→j (Xi2 −
Xi Xj ); this formulation is quadratic but not convex, and
does not allow for a dominant convex decomposition [24],
confirming that EngTFP is combinatorially challenging.
EngTFP is a special case of MAX2SAT problem that is
known to be NP-Hard [12]; for a review of the algorithmic
work on MAX2SAT, see [11]. The complexity of EngTFP
lies in the fact that, once user i is selected into a team, the
team members can no longer be rewarded for engaging i.
Indeed, if a platform decides to pay some users for helping
grow it, the users to be paid should be good at attracting
other users but not each other.
Again, it is important to underline the difference between
the EngTFP and the influence maximization problem. Influence maximization strives to enhance a certain effect (e.g.,
change in political opinion, product adoption, etc.) throughout an existing and known user network. On the other hand,
EngTFP solutions aim at helping a platform maintain or
build its network or non-network userbase. The EngTFP
informs a decision-maker what users should be rewarded,
virtually (e.g., via badges, titles, points) or physically (e.g.,
via gift cards, discount codes, cash), for igniting communication, which the users achieve through content generation,
question asking, social support provision, information exchange, etc.
5.
COMPUTATIONAL INVESTIGATIONS
This section reports the engagement capacity analyses conducted with the data from two active online communication
platforms differing in purpose. We begin with an account of
the data collection, and then, present the numerical findings.
5.1
Data Collection
We collected the forum contribution records from an online healthcare platform MedHelp, one of the biggest active
and freely accessible online sites for pro-health social networking. It has about 200 social support forums and about
as many “ask an expert” forums. The website has close to
3 Million active and inactive threads and attracts about 8
Million visitors every month. The Medhelp users interact
through discussion boards, contribute personal journal entries exploiting weight and mood tracker features, and post
notes on their friends’ home pages. The data most relevant
to the present study are those of the users’ interactions on
discussion forums: such forums allow the users to give each
other social support.
A web crawler was developed and used to collect the questionanswer type data from the “cholesterol control” and “weightloss and dieting” forums. The weight-loss & dieting forum has about 7000 threads dating back to the early 1999.
The cholesterol control forum has about 280 threads. Each
thread consists of a single question followed by answers and/or
relevant comments. Note that not all the threads have
replies; the unanswered threads do not affect the engagement capacity computation. After cleaning up threads with
no replies we end up with a database of 4296 unique users for
the MedHelp data. Note that the users are allowed to contribute to any forum and hence some users have been active
on both the considered forums while others have contributed
to only one of them.
A data tree is built for each analyzed thread. In each such
tree, every user contribution is represented by a node; a
directed link is drawn from node A to node B, if B replies to
node A. This tree captures the relations between the users
who interact in a particular thread discussion. The tree
structure also tells us who initiates a particular discussion
by posting a question, who engages the maximum number of
people by furthering a discussion with good comments and
who interacts frequently in a particular conversation. The
engagement capacities computed and presented hereafter are
based on both forums (all their threads) together.
Another dataset was created using Twitter, based on about
20,000 tweets, with their re-tweets, related to the 2014 FIFA
World Cup. A tweet thread consists of a tweet and all of its
re-tweets. A directed link is drawn from node A to node B,
if user B re-tweets user A’s tweet. Thus, we have a number
of communication threads where every root is a particular
original tweet and the other nodes are re-tweets. This tree
structure, like the MedHelp tree, allows us to find the initiator, the most engaging user and the person who interacts
frequently in a particular conversation. A total of 31,467
unique Twitter users contributed to these threads.
5.2
Numerical Results
Using the dynamic approach to engagement capacity measurement with the collected data, we assess how the different
influence-related measures behave as compared to the proposed scheme of measuring user engagement.
Figure 2 reports the values of the different metrics for the
example in Figure 1, considering that these three threads
constitute the whole forum. In Figure 2, the users are arranged over the horizontal axis in the descending order of
their engagement capacity values. The other metrics’ values
were calculated using the standard networkx [36] package in
Python. All the experiments were conducted on a Macbook
Pro machine with an Intel-i5 2.3GHz processor.
As expected, the engagement capacity values are observed
to depend on the number of users engaged, the length of the
communication branches, and the frequency of communication. The deeper a branch goes, the more value each of the
upstream users gains. User A, involved in both engaging a
lot of users (Figure 1a) and seeding deeper communication
branches (Figure 1c), has the highest engagement value of
all the users. User B also exhibits a high engagement value
because it engages a fair share of users directly (Figure 1b)
and also gets partial credit whenever the depth of the branch
increases (Figure 1c). User E has zero engagement value due
to failing to engage any peers – this user always stops the
communication.
The results in Figure 2 suggest that the engagement capacity
A
B
C
D
E
F
G
H
A
3.9
2.9
1.9
2.5
1.4
1.2
0
2
B
2.3
2.8
3.8
4.4
4.4
1
2.4
0
C
4.3
3.5
4.6
5
1.1
2.2
3
1.5
D
4.5
1.9
0
0
1.7
1
4.6
0.8
E
1.2
0.9
0
2.4
1.5
4
4.7
2.7
F
3.8
1.4
1.7
2.6
4.7
5
2.2
4.5
G
1.3
2.5
3.4
0.6
2.1
0.6
2.4
3.5
H
1.3
3.4
0
2.3
0.2
0.5
0.9
1.6
Table 2: Targeted engagement values for top 8 engaging
MedHelp users
A
B
C
D
E
F
G
H
A
8.1
7.8
4.7
2
5.3
6.3
0.1
0.3
B
4.1
8.9
7.8
8.1
6.1
4.1
1.4
3.5
C
2.7
5.8
8.8
2.5
7.3
4.5
1.3
3.2
D
2.7
4.7
6.8
0.9
1.3
1.9
1.3
6.3
E
6.7
2.7
3.1
8.6
2.1
6.1
2.4
1.7
F
2.4
7.9
3.9
1.8
7.4
5
2.6
1.2
G
6.7
4.9
5.1
7.6
6.5
1.1
7.6
9.2
H
4.6
6.4
7.8
2.6
8.4
4.4
3.8
8.7
Table 3: Targeted engagement values for top 10 engaging
Twitter users
calculation works as expected, in line with the intuition. It
can be seen that engagement capacity positively correlates
with out-degree: this makes sense since high engagement capacity signals a high ability of a user to spread information,
and thus, highly engaging users are expected to be connected
to more people. Engagement capacity also correlates with
betweenness centrality, since the latter indicates how capable user are of transferring information between otherwise
disconnected subgroups. The page rank and in-degree behave very differently than engagement capacity since they
focus on describing the information flows into a user, not
the other way around.
We now turn to the collected Medhelp and Twitter data.
In all the subsequent analyses, the (targeted) engagement
capacities were computed with α = 0.99.
Figure 3(a) shows that the distribution of engagement capacity is Gaussian for the considered MedHelp user base, and
also, Twitter user base. The engagement values achieved by
Twitter users are generally higher than those of the Medhelp
users because of the higher overall level of communication
(i.e., retweets contributed vs. posts submitted). Figure 3(b)
shows that the distribution of the number of contributions
per user on each platform is a power law (the probability
density functions look like straight lines on the log-log scale,
with the imperfections due to small sizes): this is a common
observation in social media analyses, which has also been
recently found for pro-health forums [42]. Figure 3 reveals
something very important about the nature of the engagement capacity metric: it informs us more of the personality
of a user (indeed, personal characteristics/abilities/ talents,
e.g., IQ, are typically normally distributed in humans) as
opposed to the measures characterizing behavioral patterns
or activity levels. It appears that engagement capacity truly
(a)
(b)
(d)
(c)
(e)
Figure 5: Comparison of different metrics for the Twitter users arranged according to the increasing engagement capacity.
meets the objective of measuring the innate ability of a user
to ignite cascades and attract peers to respond.
Figure 4 depicts the engagement capacity values, normalized
by the number of contributions, for the users in the MedHelp
forum (Figure 4(a)) or on Twitter (Figure 4(b)); a horizontal
shift is applied to some points in the plots for better visualization. The plots verify that the engagement value is not
entirely dependent on the number of contributions made by
the corresponding user; moreover, the correlation between
these two quantities is positive on Twitter but negative in
Medhelp, signaling that social support provision is difficult
to maintain just by increasing the activity. Also, there is
a larger variance in the engagement per contribution in the
Twitter data, which can be attributed to the shorter and
broader communication trees that get formed on Twitter as
compared to MedHelp. The plots also show that some users
manage to be very engaging even though they contribute to
relatively fewer communication trees.
Next, the various metrics used for measuring influence are
compared against the proposed engagement capacity metric
in Figure 5, based on the Twitter data. The Twitter users
are first arranged in the increasing order of their engagement
capacity values. Then, they are partitioned by the bins of
equal width separated by the percentile points (forming a
total of 100 bins). The horizontal axis in Figure 5 contains
the percentile values. The highest value in each bin (for each
particular metric) was used to plot the graphs in Figure 5.
The vertical axis shows the metric values. Figures 5(a)-5(e)
presents the engagement capacity, page rank, eigenvalue cen-
trality, betweenness centrality and degree centrality values,
respectively, for the Twitter users. Note that the MedHelp
data revealed similar results, which are omitted for brevity.
As it can be seen from Figure 5, none of these metrics seem
to have a significant correlation with the proposed engagement value. In order to find any possible correlation between
engagement value and the other metrics, we perform a simple linear regression. The linear regression returned a very
high R2 value – around 87% for each of the data sets. This
shows that engagement capacity captures multiple aspects
of what it means for a node to be important/central in a
directed communication tree, as judged by the metrics traditionally used to measure influence.
Now let us take a look at the top 10 engaging users (users
with the highest engagement value) for both MedHelp and
Twitter. A MedHelp user can up-vote a particular answer in
the given thread if she feels the answer is good. This upvote
is denoted by “star” in MedHelp. Depending on the number
of such stars, users are then recognized as top-contributors
for various forums. Four of the top 10 engaging users turn
out to be among the top contributors on MedHelp. This
shows that knowledgeable and widely accepted users can end
up having a high engagement value but without a guarantee.
We also find the top 10 tweet creators in the Twitter dataset.
Two of the ten such users were the accounts associated with
news related agencies. For an event like the FIFA world cup,
a news agency seems like the right place where other users
would go for information. Three of the remaining eight users
have a large number of followers and re-tweets. Engagement
capacity can thus reveal engaging users both in terms of the
k
2
3
4
5
unconstrained
# of users in top 10
2
2
2
3
5
Time
∼ 8 hours
∼ 9.5 hours
∼ 10 hours
∼ 11 hours
∼ 25 hours
Table 4: EngTFP results for the best team of size k.
content they serve as well as the followers attracted.
Next, we use the top eight of the above users to show what
targeted engagement capacity can capture. Using equation
(3), the results for the MedHelp data are presented in Table
2 while those for the Twitter data are presented in Table 3.
Users A through H are the top eight most engaging users in
each forum, with A being the most engaging, B the second
most engaging and so on. The columns in these tables are
for the the users whose engagement is calculated and the
rows are for the targeted users. It can be observed that
indeed, ηi→j 6= ηj→i for i 6= j, i.e., the targeted engagement
value from one user i to another user j is not the same as
that from user j to user i. Some of the targeted engagement
values are zero signaling no interaction between the user
whose engagement capacity was evaluated and the targeted
user – in this order. For example, in Table 2, the value
for row C and column D is zero. This means there was no
communication in the MedHelp forums where D replied to
C directly or indirectly.
Last but not the least is presented the study of EngTFP
applied to the MedHelp data. A total of 1,000 random users
were selected from the MedHelp dataset, with the objective
of finding the most engaging subset of size k among them.
The results for several EngTFP instances, solved as integer
programs in the SCIP optimization suite [2], are summarized in Table 4. In the instances, the value of k was varied
between two and five, and in one instance, k remained unrestricted. Comparing the best teams’ members against the
list of top 10 most engaging MedHelp users, it can be seen
the EngTFP does select some highly engaging users but not
all. This is because by selecting a pair of users that manage
to engage each other, one loses much targeted engagement.
For the same reason, the unconstrained problem optimizes
at k as low as 14. Out of the 14 users selected, five are among
the top ten most engaging users. Four of the remaining nine
have quite low engagement value, which indicates that the
EngTFP selects some obscure users in order to maximize the
reach. The last five users of the 14 have mid-ranged engagement values. Table 4 also shows the time taken to solve each
EngTFP instance. It should be noted that the high run time
is attributed to the fact that each run included finding out
the pairwise targeted engagement capacity values for every
pair and then solving the EngTFP.
6.
CONCLUSION
This paper introduces engagement capacity – a metric which
serves a well-defined purpose of measuring the ability of online platform users to engage each other in communication
on the platform, creating more user-generated content and
increasing the platform’s reach through positive externality
effects. We present two methods for valuating engagement
capacity: the basic method, rooted in co-operative game theory, and the dynamic method, which extends to performing
the same computations incrementally, eliminating the need
to re-calculate the engagement value from scratch every time
the communication structure changes, e.g., with the addition
of new threads and posts to a forum.
The reported experimental results show how the new metric
compares against the previously existing network metrics,
typically used to assess the influential power of nodes. The
regression results show that the proposed engagement value
captures different aspects of those pre-existing network metrics in a single value. The engagement capacity reveals the
different dynamics of communication and engagement in two
social media, differing in purposes, in MedHelp and Twitter.
The experimental results show how the targeted engagement
capacity works, and how it can be used to evaluate the ability of one user to engage another user. Future research into
the expansion and utility of this concept will allow one to
analyze how and why certain users manage to engage others
and facilitate the research into the mechanisms of engagement.
Finally, we show through one sample study how one can
formulate and solve an Engaging Team Formation Problem
(EngTFP) to identify the users who are critical to a platform’s success. This extension is expected to be practically
valuable, as many services and organizations can then reward such users in calculated ways.
7.
REFERENCES
[1] P. Achananuparp, E.-P. Lim, J. Jiang, and T.-A.
Hoang. Who is retweeting the tweeters? modeling,
originating, and promoting behaviors in the twitter
network. ACM Transactions on Management
Information Systems (TMIS), 3(3):13, 2012.
[2] T. Achterberg. Scip: Solving constraint integer
programs. Mathematical Programming Computation,
1(1):1–41, 2009.
http://mpc.zib.de/index.php/MPC/article/view/4.
[3] L. A. Adamic, J. Zhang, E. Bakshy, and M. S.
Ackerman. Knowledge sharing and yahoo answers:
everyone knows something. In Proceedings of the 17th
international conference on World Wide Web, pages
665–674. ACM, 2008.
[4] E. Agichtein, Y. Liu, and J. Bian. Modeling
information-seeker satisfaction in community question
answering. ACM Transactions on Knowledge
Discovery from Data (TKDD), 3(2):10, 2009.
[5] F. Bonchi, C. Castillo, and D. Ienco. The meme
ranking problem: Maximizing microblogging virality.
Journal of Intelligent Information Systems, page 29,
2013.
[6] S. P. Borgatti. Identifying sets of key players in a
social network. Computational & Mathematical
Organization Theory, 12(1):21–34, 2006.
[7] J. S. Coleman, E. Katz, H. Menzel, et al. Medical
innovation: A diffusion study. Bobbs-Merrill
Indianapolis, 1966.
[8] M. del Pozo, C. Manuel, E. González-Arangüena, and
G. Owen. Centrality in directed social networks. a
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
game theoretic approach. Social Networks,
33(3):191–200, 2011.
F. DeroıÌĹan. Formation of social networks and
diffusion of innovations. Research Policy,
31(5):835–846, 2002.
P. Domingos and M. Richardson. Mining the network
value of customers. In Proceedings of the seventh ACM
SIGKDD international conference on Knowledge
discovery and data mining, pages 57–66. ACM, 2001.
M. Fürer and S. P. Kasiviswanathan. Exact Max
2-Sat: Easier and Faster. Springer, 2007.
M. R. Garey and D. S. Johnson. Computers and
intractability: a guide to np-completeness, 1979.
R. E. Glasgow, T. M. Vogt, and S. M. Boles.
Evaluating the public health impact of health
promotion interventions: the re-aim framework.
American journal of public health, 89(9):1322–1327,
1999.
A. Goyal, F. Bonchi, L. V. Lakshmanan, and
S. Venkatasubramanian. On minimizing budget and
time in influence propagation over social networks.
Social Network Analysis and Mining, pages 1–14, 2012.
F. M. Harper, D. Raban, S. Rafaeli, and J. A.
Konstan. Predictors of answer quality in online q&a
sites. In Proceedings of the SIGCHI Conference on
Human Factors in Computing Systems, pages 865–874.
ACM, 2008.
J. C. Harsanyi. A simplified bargaining model for the
n-person cooperative game. International Economic
Review, 4(2):194–220, 1963.
E. Katz. The two-step flow of communication: An
up-to-date report on an hypothesis. Public Opinion
Quarterly, 21(1):61–78, 1957.
M. L. Katz and C. Shapiro. Network externalities,
competition, and compatibility. The American
economic review, pages 424–440, 1985.
D. Kempe, J. Kleinberg, and É. Tardos. Maximizing
the spread of influence through a social network. In
Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 137–146. ACM, 2003.
S. J. Liebowitz and S. E. Margolis. Network
externality: An uncommon tragedy. The Journal of
Economic Perspectives, pages 133–150, 1994.
Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich,
Y. Maarek, D. Pelleg, and I. Szpektor. Predicting web
searcher satisfaction with existing community-based
answers. In Proceedings of the 34th international ACM
SIGIR conference on Research and development in
Information Retrieval, pages 415–424. ACM, 2011.
M. W. Macy. Chains of cooperation: Threshold effects
in collective action. American Sociological Review,
pages 730–747, 1991.
R. K. Merton. Selected problems of field work in the
planned community. American Sociological Review,
pages 304–312, 1947.
C. C. Moallemi and B. Van Roy. Convergence of
min-sum message-passing for convex optimization.
IEEE Transactions on Information Theory,
56(4):2041–2050, 2010.
R. B. Myerson. Graphs and cooperation in games.
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
Mathematics of operations research, 2(3):225–229,
1977.
R. B. Myerson. Conference structures and fair
allocation rules. International Journal of Game
Theory, 9(3):169–182, 1980.
M. E. Newman. Spread of epidemic disease on
networks. Physical Review E, 66(1):016128, 2002.
A. S. Nowak and T. Radzik. The shapley value for
n-person games in generalized characteristic function
form. Games and Economic Behavior, 6(1):150–161,
1994.
B.-W. On, E.-P. Lim, J. Jiang, A. Purandare, and
L.-N. Teow. Mining interaction behaviors for email
reply order prediction. In Advances in Social Networks
Analysis and Mining (ASONAM), 2010 International
Conference on, pages 306–310. IEEE, 2010.
B.-W. On, E.-P. Lim, J. Jiang, and L.-N. Teow.
Engagingness and Responsiveness Behavior Models on
the Enron Email Network and Its Application to Email
Reply Order Prediction. Springer, 2013.
G. Owen. Values of games with a priori unions. In
Mathematical economics and game theory, pages
76–88. Springer, 1977.
L. Page, S. Brin, R. Motwani, and T. Winograd. The
pagerank citation ranking: bringing order to the web.
1999.
J. Preece, B. Nonnecke, and D. Andrews. The top five
reasons for lurking: improving community experiences
for everyone. Computers in human behavior,
20(2):201–223, 2004.
M. Samadi, A. Nikolaev, and R. Nagi. A subjective
evidence model for influence maximization in social
networks. Omega, 2015.
E. Sanchez and G. Bergantiños. On values for
generalized characteristic functions.
Operations-Research-Spektrum, 19(3):229–234, 1997.
D. A. Schult and P. Swart. Exploring network
structure, dynamics, and function using networkx. In
Proceedings of the 7th Python in Science Conferences
(SciPy 2008), volume 2008, pages 11–16, 2008.
L. S. Shapley. A value for n-person games. Technical
report, DTIC Document, 1952.
X. Song, Y. Chi, K. Hino, and B. L. Tseng.
Information flow modeling based on diffusion rate for
prediction and ranking. In Proceedings of the 16th
international conference on World Wide Web, pages
191–200. ACM, 2007.
M. Stearns, S. Nambiar, A. Nikolaev, A. Semenov,
and S. McIntosh. Towards evaluating and enhancing
the reach of online health forums for smoking
cessation. Network Modeling Analysis in Health
Informatics and Bioinformatics, 3(1):1–11, 2014.
X. Tang and C. C. Yang. Identifing influential users in
an online healthcare social network. In Intelligence
and Security Informatics (ISI), 2010 IEEE
International Conference on, pages 43–48. IEEE, 2010.
T. W. Valente, S. Frautschi, R. Lee, C. OKeefe,
L. Schultz, R. Steketee, L. Chitsulo, A. Macheso,
Y. Nyasulu, M. Ettling, et al. Network models of the
diffusion of innovations. Nursing Times, 90(35):52–3,
1994.
[42] T. van Mierlo. The 1% rule in four digital health
social networks: an observational study. Journal of
medical Internet research, 16(2), 2014.
[43] T. van Mierlo, D. Hyatt, and A. T. Ching. Mapping
power law distributions in digital health social
networks: Methods, interpretations, and practical
implications. Journal of medical Internet research,
17(6), 2015.
[44] X. Wang, K. Zhao, and W. Street. Predicting user
engagement in online health communities based on
social support activities. In Ninth INFORMS
Workshop on Data Mining and Analytics, San
Francisco, CA, 2014.
[45] W. H. Whyte Jr. The web of word of mouth. Fortune,
50(1954):140–143, 1954.
[46] E. Winter. The shapley value. Handbook of game
theory with economic applications, 3:2025–2054, 2002.

Download Report

Operations Research Technical Note—Stochastic Sequential

Paperzz.com

Your Paperzz