This article was downloaded by: [128.205.28.166] On: 18 November 2014, At: 07:35 Publisher: Institute for Operations Research and the Management Sciences (INFORMS) INFORMS is located in Maryland, USA Operations Research Publication details, including instructions for authors and subscription information: http://pubsonline.informs.org Technical Note—Stochastic Sequential Decision-Making with a Random Number of Jobs Alexander G. Nikolaev, Sheldon H. Jacobson, To cite this article: Alexander G. Nikolaev, Sheldon H. Jacobson, (2010) Technical Note—Stochastic Sequential Decision-Making with a Random Number of Jobs. Operations Research 58(4-part-1):1023-1027. http://dx.doi.org/10.1287/opre.1090.0778 Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions This article may be used only for the purposes of research, teaching, and/or private study. Commercial use or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher approval, unless otherwise noted. For more information, contact [email protected]. The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or support of claims made of that product, publication, or service. Copyright © 2010, INFORMS Please scroll down for article—it is on subsequent pages INFORMS is the largest professional society in the world for professionals in the fields of operations research, management science, and analytics. For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org OPERATIONS RESEARCH informs Vol. 58, No. 4, Part 1 of 2, July–August 2010, pp. 1023–1027 issn 0030-364X eissn 1526-5463 10 5804 1023 ® doi 10.1287/opre.1090.0778 © 2010 INFORMS Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved. TECHNICAL NOTE Stochastic Sequential Decision-Making with a Random Number of Jobs Alexander G. Nikolaev Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois 60201, [email protected] Sheldon H. Jacobson Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected] This paper addresses a class of problems in which available resources need to be optimally allocated to a random number of jobs with stochastic parameters. Optimal policies are presented for variations of the sequential stochastic assignment problem and the dynamic stochastic knapsack problem, in which the number of arriving jobs is unknown until after the final arrival, and the job parameters are assumed to be independent but not identically distributed random variables. Subject classifications: sequential assignment; dynamic stochastic knapsack; policy. Area of review: Stochastic Models. History: Received June 2008; revision received April 2009; accepted August 2009. Published online in Articles in Advance February 5, 2010. 1. Introduction DSKP was first defined by Papastavrou et al. (1996): Given a limited fixed resource capacity, and jobs with i.i.d. weights and reward values that arrive sequentially, one at a time, how should the available resource be allocated by nonanticipatively accepting or rejecting jobs? Papastavrou et al. (1996) analyze the case of DSKP formulated for a time horizon of a given number of discrete periods, with a fixed constant probability of a job arrival in each such period, for different forms of a joint probability distribution function for job weights and values. Kleywegt and Papastavrou (1998, 2001) consider Poisson arrivals in DSKP. Other variations and applications of DSKP have been discussed by Prastakos (1983), Lu et al. (1999), and Van Slyke and Young (2000). This paper uses conditioning arguments and the results from Kennedy (1986) to consider extensions to SSAP and DSKP, where the number of jobs is unknown until after the final arrival, but follows a (given) discrete distribution that has either finite or infinite support. The arriving jobs are assumed to be independent but not necessarily identically distributed. Note that an optimal policy for a more restricted version of SSAP was presented by Sakaguchi (1983) in a different form and without a formal proof; this paper formally proves and extends those results. Also, the original finite-horizon formulation of DSKP (Papastavrou et al. 1996) considers discrete arrivals with a random number of jobs that follows a binomial distribution; this paper generalizes these results to include other discrete distributions. Sequential resource allocation problems with uncertainty have received much attention in the literature. This paper focuses on two problems in this field: the sequential stochastic assignment problem (SSAP), and the dynamic stochastic knapsack problem (DSKP). Derman et al. (1972) introduced the SSAP: Given a known finite number of jobs with independent and identically distributed (i.i.d.) reward values that arrive sequentially, one at a time, how should these jobs be assigned to workers with known finite success rates, where the assignment of each job should be determined nonanticipatively at the time the job arrives? Derman et al. (1972) establish an optimal policy that maximizes the total expected reward, where the reward is the sum of products of job values and worker success rates over all assignments. Theoretical extensions to the investigation by Derman et al. (1972) include scenarios in which various continuous distributions of job arrival times are considered (Albright 1974, Sakaguchi 1972, Righter 1987). Other variations and applications of SSAP have been addressed by Derman et al. (1975), Nakai (1986), Su and Zenios (2005), Nikolaev et al. (2007), and McLay et al. (2009). Kennedy (1986) established the most general result for SSAP by removing the assumption of independence and proving that threshold policies are optimal for any problem of this type, although the thresholds that define such policies may be random variables and difficult to compute. 1023 Nikolaev and Jacobson: Stochastic Sequential Decision-Making Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved. 1024 Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS The paper is organized as follows. Section 2 shows how the results by Kennedy (1986) can be used to address the SSAP extension. Section 3 presents a dynamic programming (DP) algorithm to solve the DSKP extension. Section 4 offers concluding comments. in BP and AP have the same values. Also, the values of the subsequent jobs j = N + 1 N + 2 Nmax in AP are equal to zero, and no additional reward can be earned (i.e., event Job “j” does not arrive in BP ≡ Xj = 0 in AP). Theorem 1 establishes that if an optimal policy for AP is available, then an optimal policy for BP can be obtained. 2. SSAP with a Random Number of Jobs Theorem 1. Let ∗ be an optimal policy for AP. Then, an optimal policy for BP, ∗ , is obtained using rules: (1) whenever a job arrives and ∗ assigns a worker with success rate zero, discard the job, and (2) whenever a job arrives and ∗ assigns a worker with success rate p > 0, assign a worker with the same success rate. This section addresses the SSAP with a random number of jobs. First, the case in which the distribution of the number of jobs has finite support is considered. This result is then extended to the infinite support case. 2.1. Finite Case The base problem (BP) is formally stated. Given. M ∈ + workers available to perform N jobs; a fixed success rate pw associated with worker w = 1 2 M; probability mass function (pmf) Pn for the number of jobs with independent values arriving sequentially, one at a time; for each job j = 1 2 N , a job value cumulative distribution function (cdf) Fj xj . Objective. Find a policy ∗ that determines the assignment of jobs to workers, Awj ∈ 0 1 w = 1 N 2 M, j = 1 2 N , such that j=1 Awj 1, M w = 1 2 M, A 1, j = 1 2 N , and M N w=1 wj ∗ EP F N w=1 j=1 pw Awj Xj is maximized. n j j=1 The main challenge presented by BP is the randomness in the number of arriving jobs. To address this challenge, an auxiliary problem (AP) can be created where the number of jobs is fixed, but the job values are dependent. Using BP, the AP is created as follows. Fix the number of workers Nmax at Nmax , Pn = 1). the largest value that N can take on (i.e., n=0 If Nmax M, set pi = pi for i = 1 2 Nmax . If Nmax > M, set pi = pi for i = 1 2 M and pM+1 = pM+2 = ··· = pN max = 0. Also, let X1 = X1 , and for any j = 2 Nmax , set ⎧ Nmax ⎪ i=j Pi ⎪ ⎪ if Xj−1 >0 ⎪Xj with probability Nmax ⎪ ⎪ P i=j−1 i ⎪ ⎪ ⎨ Pj−1 Xj = (1) if Xj−1 >0 0 with probability Nmax ⎪ ⎪ ⎪ ⎪ i=j−1 Pi ⎪ ⎪ ⎪ ⎪ ⎩ = 0 0 if Xj−1 The AP is now formally stated. Given. Nmax workers available to perform Nmax jobs; a fixed success rate pw associated with worker w = 1 2 Nmax ; Nmax jobs with independent values arriving sequentially, one at a time; for each job j = 1 2 Nmax , a job value cdf Fj xj . Objective. Find a policy ∗ that determines the assignment of jobs to workers, Awj ∈ 0 1, w = 1 Nmax Awj 1, 2 Nmax , j = 1 2 Nmax , such that j=1 M w = 1 2 Nmax , w=1 Awj 1, j = 1 2 Nmax , and Nmax Nmax ∗ w=1 EF j=1 pw Awj Xj is maximized. Nmax j j=1 By design, BP and AP are closely related. Because Xj > 0 for any j = 1 2 N , then by (1), the first N jobs Proof. See e-companion. An electronic companion to this paper is available as part of the online version that can be found at http://or.journal.informs.org/. To determine an optimal policy ∗ for AP, the result by Kennedy (1986) can be applied. Using the notations introduced for AP, let job values Xj , j = 1 2 Nmax , be any (not necessarily i.i.d.) random variables. For any n = 1 2 Nmax and m = 0 1 Nmax , define random variNmax Nmax ables Zm n such that (1) Z0 n ≡ +, for 1 n Nmax ; N Nmax (2) Zm n ≡ −, for m > Nmax − n + 1; (3) Z1max Nmax = XNmax ; N N max max Nmax and (4) Zm n = Xn ∨ EZm n+1 n ∧ EZm−1 n+1 n , for 1 m Nmax − n + 1, n Nmax − 1, where n , n = 1 2 Nmax − 1, is a sigma-field over all possible realizations of vector Xi ni=1 , n = 1 2 Nmax − 1, ∨ denotes the maximum, and ∧ denotes the minimum. For any n = 1 2 Nmax and m = 1 Nmax , the ranNmax dom variable Zm n represents the expected value of a job to which the mth most skilled (mth best) worker is expected to be assigned upon the arrival and assignment of job n. At the time when job n with value xn arrives, the following hold: • If job n is assigned to the mth best worker, then the Nmax value of Zm n is equal to xn . • If job n is assigned to a more skilled worker than the mth best, then the mth best worker becomes the m − 1th Nmax Nmax best, and Zm n is equal to EZm−1 n+1 n . • If job n is assigned to a less skilled worker than the mth best, then the mth best worker remains the mth best, Nmax Nmax and Zm n is equal to EZm n+1 n . Theorem 2 shows that it makes sense to assign job n to a more skilled worker than the mth best only if xn is greater Nmax than the value of EZm−1 n+1 n , and to assign job n to a less skilled worker than the mth best only if xn is less than N the value of EZmmax n+1 n . Theorem 2 (Kennedy 1986). Whenever job n = 1 2 Nmax − 1 arrives, the line segment − + ⊂ is partitioned into Nmax − n + 1 random intervals defined N Nmax by the breakpoints + EZ1max n+1 n EZ2 n+1 n Nmax · · · EZNmax −n n+1 n −. Then, the optimal assignment policy is to assign the nth job to the worker with the mth highest success rate (available at the time of the assignment) if xn lies in the mth highest of these Nmax intervals, or, equivalently, if Zm n = xn . Nikolaev and Jacobson: Stochastic Sequential Decision-Making 1025 Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved. Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS Theorem 2 establishes the form of an optimal policy for any problem, where the objective function is given as the expectation of a summation of products. However, this result has seen limited use because finding the conditional expectations of recursively defined ranNmax dom variables Zm n , 1 m Nmax − n + 1, n Nmax − 1, is computationally intractable in many cases, especially when Xj , j = 1 2 Nmax are dependent. For any n = 1 2 Nmax − 1, conditioning on the sigma-field n implies that the interval breakpoints depend on a sequence of values of jobs 1 through n, and hence, for any such sequence, the breakpoint values may be different. However, if the nature of the dependency is as defined in (1), then a closed form optimal assignment policy for AP can be obtained. Theorem 3. Whenever job n = 1 2 Nmax − 1 arrives in AP, the optimal assignment policy is to assign the nth job to the worker with the mth highest success rate (available at the time the assignment decision has to be made) if xn lies in the mth highest of the intervals, defined by the fixed breakpoints EZ1max n+1 Xn > 0 N max EZ2max n+1 Xn > 0 EZNmax −n n+1 Xn > 0 N N These breakpoints are computed recursively: EZmmax n+1 Xn > 0 Nmax P N = i N Fn+1 EZmmax = i=n+1 n+2 Xn+1 > 0 Nmax i=n P N = i Nmax EZm−1 n+1 Xn+1 >0 Nmax · EZm n+2 Xn+1 > 0 + x dFn+1 x N infinite support, solves BP. Kennedy (1986) establishes the form of an optimal policy for such problems, as summarized in Theorem 4. Theorem 4 (Kennedy 1986). Assume that Esupn Xn < +, and limn→+ Xn = 0. Then, an infinite sequence Nmax + Zm n Nmax =1 converges to a finite limit Zm n ≡ lim Nmax →+ Nmax Zm n and Theorem 2 holds with the breakpoints expressed as +, EZ1 n+1 n , EZ2 n+1 n −. Theorem 4 establishes that finding an optimal policy for AP (and, using Theorem 1, BP), where the pmf for the number of jobs has infinite support, can be approached by considering a sequence of AP’s with fixed (bounded) Nmax , and letting Nmax → +. Note that the distributions of job values in such APs (see (1)) depend on the pmfs of the number of jobs, and hence it is necessary to define the pmf P Nmax of the number of jobs for each AP with Nmax = 1 2 . To match the set-up described in Kennedy (1986), the distribution of the value of job j = 1 2 has to be the same in each of those APs with Nmax N=max1 2 . To N Pk for any satisfy this requirement, set Pi max = Pi / k=1 i = 1 2 Nmax , Nmax = 1 2 . 2.3. Illustrative Example N EZmmax n+2 Xn+1 >0 max + 1 − Fn+1 EZm−1 n+1 Xn+1 > 0 N max · EZm−1 n+1 Xn+1 > 0 N Theorems 3 and 4 describe the necessary computations involved in deriving optimal policies for SSAP with a random number of jobs. An example is provided to illustrate how these computations are performed. Example. Given M = 4. P1 = P2 = P3 = P4 = 1/4. Fj x = x, 0 x 1, for j = 1 2 3 4. N (2) Proof. See e-companion. The backward recursion (2) begins with the last Nmax th job, for which the breakpoints are 0, + (therefore, the job is assigned to the best remaining worker available). Next, the breakpoints for job (Nmax − 1) are 0, PNmax / PNmax −1 + PNmax EXNmax , +. To compute the breakpoints for all Nmax jobs, proceed in the same manner, down to job 1. 2.2. Infinite Case The results of Theorems 1 and 3 can be extended to the case in which the pmf for the number of jobs in BP has infinite support. In this case, the proof of Theorem 1 is unchanged. The rewards earned in BP and AP by making assignments for any pair of sequences s and s (see Theorem 1), respectively, remain the same, because every such sequence has only a finite number of jobs. Therefore, solving AP, where the pmf of the number of jobs has N max For this example, Nmax = 4. Define bmmax n+1 = EZm n+1 > 0 for 1 m Nmax − n + 1 and 1 n Nmax − 1. By (2), Xn 4 + 1/4 1 1 i=4 Pi 4 ·2= = x dF4 x = n = 3 b1 4 4 2/4 4 − P i=3 i 1/4 4 i=3 Pi 4 4 4 n = 2 b2 = x dF x + 1 − F b b 4 3 3 1 4 3 1 4 − i=2 Pi 2 1 3 1 7 = · + · = 3 32 4 4 48 4 1/4 17 i=3 Pi 4 4 4 b1 3 = 4 F3 b1 4 b1 4 + x dF3 x = 48 − i=2 Pi 4 4 4 n = 1 b3 2 ≈ 01014 b2 2 ≈ 02266 b1 2 ≈ 0422 The derived optimal policy can be compared with the policy that would be optimal if the number of jobs was not random. According to an optimal policy for SSAP (Derman et al. 1972) with N = 4, with job values distributed as in the example, the interval breakpoints that determine the Nikolaev and Jacobson: Stochastic Sequential Decision-Making 1026 Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS assignments for the third, second, and first arriving jobs (respectively) would be 18 4 30 a = 48 1 3 48 n = 1 a43 2 ≈ 03047 a42 2 = 05 a41 2 ≈ 06953 Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved. n = 3 a41 4 = 05 n = 2 a42 3 = The interval breakpoint values obtained in the example’s solution are smaller, which means that workers with higher success rates are used earlier than in the solution to the respective instance of SSAP with the known number of arrivals, at all assignment stages. 3. DSKP with a Random Number of Jobs This section analyzes the DSKP with a random number of jobs and presents a dynamic program that leads to the derivation of an optimal assignment policy. The DSKP is formally stated. Given. Resource of capacity C available for allocation to N jobs; pmf Pn for the number of jobs with independent weights and values arriving sequentially, one at a time; for each job j = 1 2 N , a joint cdf Fj w x for the job weight and value. Objective. Find a policy ∗ that determines the assign ments, Aj ∈ 0 1, j = 1 2 N , such that Nj=1 Aj 1, N ∗ Nj=1 Aj Xj is maximized. j=1 Aj wj C, and EPn Fj N j=1 For any j = 1 2 and c ∈ 0 C , let Vjc denote the optimal accumulated reward from the allocation of resource capacity c to jobs j j + 1 N , and let EVjc denote the optimal conditional expected accumulated reward from the allocation of resource capacity c to jobs j j + 1 N , given that job j − 1 has arrived. By definition, EV1C = EP∗ F N Nj=1 Aj Xj . Theorem 5 establishes an assignn j j=1 ment policy that guarantees the optimal expected resource allocation. Theorem 5. Suppose that the remaining resource capacity is c, and job j with weight wj and value xj arrives. Then, it is optimal to set Aj = ⎧ ⎨1 ⎩ 0 if c−w xj + EVj+1 j c EVj+1 and wj c (3) c−w c or wj > c if xj + EVj+1 j < EVj+1 c−w Note that the quantity xj + EVj+1 j depends on the parameters (weight and value) of job j. These parameters are known at the time the assignment decision for job j is to be made. Therefore, each optimal assignment decision, c−w c described by (3), is determined by EVj+1 and EVj+1 j . Theorem 5 follows from the fundamental argument of DP: Each assignment must maximize the sum of an immediate reward and the expected future reward. Note that rule (3) is of the same form as in Papastavrou et al. (1996), except that EVjc , j = 1 2 , c ∈ 0 C , are conditional. This allows one to include the consideration of the pmf of N into the DP formulation and hence determine the optimal allocation policy for the case with a random number of jobs. The expected values EVjc , j = 1 2 , c ∈ 0 C can be computed using a DP recursion. However, the recursion and its boundary conditions depend on the number of arriving jobs, which is random. First, the case where the pmf of N has finite support is considered. Then the result is extended to the case where the pmf of N has infinite support. 3.1. Finite Case Theorem 6. The optimal expected accumulated reward EV1C can be computed using the recursion Nmax EVjc i=j = Nmax Pi i=j−1 Pi c−W c · P Wj c Rj + EVj+1 j EVj+1 c−W c−W c × E Rj + EVj+1 j Wj c Rj + EVj+1 j EVj+1 c−W c + P Rj + EVj+1 j < EVj+1 Wj c c + P Wj > c EVj+1 (4) with boundary conditions EVjc = 0 for any c and j Nmax . Proof. See e-companion. 3.2. Infinite Case The result of Theorem 5 can be extended to the case where the pmf for the number of jobs in DSKP has infinite support. For any j = 1 2 c ∈ 0 C , and Nmax = 1 2 let EVjc Nmax denote the optimal conditional expected accumulated reward from the allocation of resource capacity c to jobs j j + 1 Nmax , given that job j − 1 has arrived, in the DSKP with the pmf of the number of jobs Nmax N Pk for any i = 1 2 Nmax . Pi max = Pi / k=1 Theorem 7. Assume that B ≡ Esupj Xj /Wj < +, and P N < + = 1. Then for any j = 1 2 and c ∈ 0 C , the infinite sequence EVjc Nmax + Nmax =1 converges to the finite limit EVjc ≡ limNmax →+ EVjc Nmax , and Theorem 5 establishes an optimal policy for DSKP, where the pmf of the number of jobs has infinite support, with EVjc replaced by EVjc , j = 1 2 c ∈ 0 C . Proof. See e-companion. Theorem 7 establishes that an optimal policy for DSKP, where the pmf of the number of jobs has infinite support, can be obtained by sequentially solving a sequence of DSKPs with finite support. First, consider only two jobs, then three, and so on. Then evaluate the limits limNmax →+ EVjc Nmax , j = 1 2 c ∈ 0 C . Finally, apply Theorem 5 to establish an optimal resource allocation policy. Nikolaev and Jacobson: Stochastic Sequential Decision-Making 1027 Operations Research 58(4, Part 1 of 2), pp. 1023–1027, © 2010 INFORMS Downloaded from informs.org by [128.205.28.166] on 18 November 2014, at 07:35 . For personal use only, all rights reserved. 4. Conclusion This paper analyzes SSAP and DSKP under the assumption that the number of arriving jobs is random and follows a given discrete distribution. Optimal assignment policies with proofs are provided. Conditioning arguments are key to the solutions to both problems. Note that the complexity of the proposed algorithms is the same as the complexity of the original algorithms introduced by Derman et al. (1972) and Papastavrou et al. (1996). Note that DSKP, where the pmf of the number of jobs has infinite support, can be solved by alternative methods such as a total reward Markov decision process. Further research is required to assess and compare the performance of these methods. Other challenges include discrete sequential assignment problems in which job values are dependent on each other and/or the workers to whom the jobs are assigned. Also, the proposed models assume that the sequences in which the jobs with their respective value cdfs arrive are fixed and known. Identifying optimal resource allocation policies for the cases in which such sequences could be random is another hard yet important problem. 5. Electronic Companion An electronic companion to this paper is available as part of the online version that can be found at http://or.journal. informs.org/. Acknowledgments This research has been supported by the U.S. Air Force Office of Scientific Research under grant FA9550-07-10232, and the National Science Foundation under grant CMMI-0900226. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the United States Government, the U.S. Air Force Office of Scientific Research, or the National Science Foundation. References Albright, S. C. 1974. Optimal sequential assignments with random arrival times. Management Sci. 21(1) 60–67. Derman, C., G. J. Lieberman, S. M. Ross. 1972. A sequential stochastic assignment problem. Management Sci. 18(7) 349–355. Derman, C., G. J. Lieberman, S. M. Ross. 1975. A stochastic sequential allocation model. Oper. Res. 23(6) 1120–1130. Kennedy, D. P. 1986. Optimal sequential assignment. Math. Oper. Res. 11(4) 619–626. Kleywegt, A. J., J. D. Papastavrou. 1998. The dynamic and stochastic knapsack problem. Oper. Res. 46(1) 17–35. Kleywegt, A. J., J. D. Papastavrou. 2001. The dynamic and stochastic knapsack problem with random sized items. Oper. Res. 49(1) 26–41. Lu, L. L., S. Y. Chiu, L. A. Cox Jr. 1999. Optimal project selection: Stochastic knapsack with finite time horizon. J. Oper. Res. Soc. 50(6) 645–650. McLay, L. A., S. H. Jacobson, A. G. Nikolaev. 2009. A sequential stochastic passenger screening problem for aviation security. IIE Trans. 41(6) 575–591. Nakai, T. 1986. A sequential stochastic assignment problem in a partially observable Markov chain. Math. Oper. Res. 11(2) 230–240. Nikolaev, A. G., S. H. Jacobson, L. A. McLay. 2007. A sequential stochastic security system design problem for aviation security. Transportation Sci. 41(2) 182–194. Papastavrou, J. D., S. Rajagopalan, A. J. Kleywegt. 1996. The dynamic and stochastic knapsack problem with deadlines. Management Sci. 42(12) 1706–1718. Prastakos, G. P. 1983. Optimal sequential investment decisions under conditions of uncertainty. Management Sci. 29(1) 118–134. Righter, R. L. 1987. The stochastic sequential assignment problem with random deadlines. Probab. Engrg. Inform. Sci. 1(2) 189–202. Sakaguchi, M. 1972. A sequential assignment problem for randomly arriving jobs. Rep. Statist. Appl. Res. 19 99–109. Sakaguchi, M. 1983. A sequential stochastic assignment problem with an unknown number of jobs. Mathematika Japonica 29(2) 141–152. Su, X., S. A. Zenios. 2005. Patient choice in kidney allocation: A sequential stochastic assignment model. Oper. Res. 53(3) 443–455. Van Slyke, R., Y. Young. 2000. Finite horizon stochastic knapsacks with applications to yield management. Oper. Res. 48(1) 155–172. Transportation Research Part A 44 (2010) 182–193 Contents lists available at ScienceDirect Transportation Research Part A journal homepage: www.elsevier.com/locate/tra Evaluating the impact of legislation prohibiting hand-held cell phone use while driving Alexander G. Nikolaev a, Matthew J. Robbins b, Sheldon H. Jacobson c,* a Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, United States Department of Industrial and Enterprise Systems Engineering, University of Illinois, Urbana, IL, United States c Department of Computer Science, University of Illinois, Urbana, IL, United States b a r t i c l e i n f o Article history: Received 9 April 2009 Received in revised form 21 December 2009 Accepted 16 January 2010 Keywords: Cell phone Driving safety Legislation Automobile accident a b s t r a c t As of November 2008, the number of cell phone subscribers in the US exceeded 267 million, nearly three times more than the 97 million subscribers in June 2000. This rapid growth in cell phone use has led to concerns regarding their impact on driver performance and road safety. Numerous legislative efforts are under way to restrict hand-held cell phone use while driving. Since 1999, every state has considered such legislation, but few have passed primary enforcement laws. As of 2008, six states, the District of Columbia (DC), and the Virgin Islands have laws banning the use of hand-held cell phones while driving. A review of the literature suggests that in laboratory settings, hand-held cell phone use impairs driver performance by increasing tension, delaying reaction time, and decreasing awareness. However, there exists insufficient evidence to prove that hand-held cell phone use increases automobile-accident-risk. In contrast to other research in this area that uses questionnaires, tests, and simulators, this study analyzes the impact of hand-held cell phone use on driving safety based on historical automobile-accident-risk-related data and statistics, which would be of interest to transportation policy-makers. To this end, a pre-law and post-law comparison of automobile accident rate measures provides one way to assess the effect of hand-held cell phone bans on driving safety; this paper provides such an analysis using public domain data sources. A discussion of what additional data are required to build convincing arguments in support of or against legislation is also provided. Ó 2010 Elsevier Ltd. All rights reserved. 1. Introduction As of 2008, the Cellular Telecommunications and Internet Association (CTIA) reported that the number of cell phone subscribers in the US exceeded 267 million. The latest data available from the National Highway Traffic Safety Administration (NHTSA) estimated that in 2007, about 11% of the population used a phone while driving at some point during the day, as reported in USA Today (O’Donnell, 2009). Earlier studies revealed that approximately one-half of interviewed drivers reported using cell phones while driving, either to make outgoing calls or take incoming calls, spending an average of 4.5 min per call (Royal, 2003). Hand-held cell phones are believed to be an important factor in driver distraction (Williams, 2007). Driver distraction is thought to be the cause of nearly 80% of automobile accidents and 65% of near-accidents (Klauer et al., 2006), resulting in approximately 2600 deaths, 330,000 moderate to critical injuries, and 1.5 million instances of property damage annually in the US (Cohen and Graham, 2003). Nonetheless, car cell phones have been marketed for nearly half a century and continue to be viewed by many as a high-profile product, as evidenced by a recent article in New York Times * Corresponding author. E-mail address: [email protected] (S.H. Jacobson). 0965-8564/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.tra.2010.01.006 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 183 (Richtel, 2009). Indeed, these facts are drawing a significant amount of public attention to the issue of hand-held cell phone use while driving. Hand-held cell phone use while driving imposes no less than three tasks upon drivers: locating/glancing at the phone, which draws the eyes away from the road; reaching for the phone and dialing, which impairs control of the vehicle; and conversing via the phone, which distracts attention from driving (Klauer et al., 2006). Dialing a hand-held cell phone is a particularly dangerous task that forces a driver to take their eyes off the road, and thereby, increases the risk of accidents and near-accidents. The CTIA safe driving tips include never dialing a telephone or taking notes while driving (CTIA, 2008a). Cell phone use while driving has been considered and studied as a primary factor in automobile accidents, due to the high frequency of this activity (NHTSA, 1997). Numerous investigations have been undertaken to determine whether hand-held cell phone use impairs driver performance. Such efforts are typically based on simulators, tests, questionnaires, telephone surveys, and observations. Redelemeier and Tibshirani (1997) associate hand-held cell phone use with automobile accidents by analyzing questionnaire responses of 699 drivers as well as phone and police records. They suggest that automobile-accident-risk is equivalent to impairment resulting from legal intoxication. Caird et al. (2008) and Horrey and Wickens (2006) show that the costs associated with cell phone use while driving are seen in reaction time tasks, with smaller costs in performance on lane keeping and tracking tasks. Strayer and Drews (2004) report that hand-held cell phone use while driving increases braking times by 18%, increases following distances by 12%, and increases the time for speed resumption after braking by 17%. The NHTSA used a driver simulator to investigate the effects of hand-held cell phone use while performing four tasks: car following, lead-vehicle braking, lead-vehicle cut in, and merging. They observed that hand-held cell phone use while driving impairs driver performance, increases the response to lead-vehicle speed changes during car following, and degrades automobile control (Ranney, 2005). The growing use of cell phones and the associated research on how they impact driver performance have led many, including some state legislators, to question their safety while driving. Royal (2003) claims that 71% of drivers support restrictions on hand-held cell phones and 57% approve a ban on hand-held cell phone use while driving, although most drivers that do use cell phones oppose such outright bans or traffic fines on hands-free cell phones. Acknowledging a potential negative impact of handheld cell phone use while driving, a number of legislative initiatives have passed that ban hand-held cell phone use while driving. In fact, since 1999, every state has considered such legislation (Sundeen, 2004). In 2001, New York became the first state to enact such a law. Since that time, similar bans have taken effect in New Jersey, DC, Connecticut, Utah, California, Washington, and the Virgin Islands, with all primary enforcement laws (except Utah where the law is primary only in regards to text messaging), which allows law enforcement officers to ticket drivers for using a hand-held cell phone while driving without any other traffic violation (Governors Highway Safety Association, 2008). A number of states (e.g., Illinois) restrict hand-held cell phone use by requiring sound to travel unimpaired to at least one ear or to have at least one hand on the steering wheel at all times (Sundeen, 2001). In addition to state statutes, local ordinances have been passed that prohibit hand-held cell phone use while driving in certain counties, cities, towns, and municipalities. For example, Chicago, Illinois, implemented such a policy in 2005. There are a total of 28 jurisdictions that enforced such local ordinances in Florida, Illinois, Massachusetts, Michigan, New Jersey, New Mexico, New York, Ohio, Pennsylvania, and Utah (Cellular News, 2008). However, no state or local ordinance completely bans all types of cell phones (hand-held and hands-free) while driving, though many prohibit cell phone use by certain segments of the population (Glassbrenner and Ye, 2007). For example, California enforces an all-type cell phone ban for school bus drivers and drivers under 18 years of age (AAA Auto Insurance, 2008). While proponents believe that laws banning hand-held cell phone use while driving may reduce driver distraction and improve driver performance, opponents of such laws believe that it is premature to act. Although research suggests that multi-tasking impairs driver performance, there is still insufficient evidence to definitively prove that hand-held cell phone use increases automobile-accident-risk (McCartt et al., 2006; Williams, 2007; Olson, 2003). Note that in this domain, definitive proofs are practically impossible to obtain, given the inability of researchers to conduct controlled experiments where the dependent variables are accidents, property damage, personal injuries and even death. A study on distracted driving, released by the NHTSA and the Virginia Tech Transportation Institute (Dingus et al., 2006; Klauer et al., 2006), suggests that drivers talking or listening to a wireless device are no more likely to be involved in an accident or near-accident, than those not involved in such activities. Of course, the safety and highway travel benefits provided by cell phones, especially for public health and safety considerations, cannot be overlooked (Lissy et al., 2000). For example, cell phones can reduce emergency response time to automobile accidents (Savage et al., 2000). Moreover, given that legislation narrowly aimed toward cell phone use does not adequately address the larger issue of driver distraction, the CTIA believes that education is a more effective approach to enhance drivers’ awareness and responsibility (CTIA, 2008b). A number of safety and elected officials agree with this sentiment, including the Chairman of the Governors Highway Safety Administration (CTIA, 2008b). To prove this point, in 2008, CTIA along with Sprint Nextel, Cingular Wireless, Dobson Cellular Systems, and other wireless companies, developed programs and sponsored public service announcement campaigns designed to educate drivers on distraction while operating vehicles. In addition to education, the cell phone industry has focused on enhancing driving safety beyond the issue of hands-free operation, by eliminating in-hand manipulation and reducing distractions while driving (Goodman et al., 1997). Recent research and technological advances in this area are providing innovative solutions to the problem of distracted drivers, such as hands-free car kits and the ‘‘Polite Phone” prototype, using ground-breaking Bluetooth technology to provide a voice-command interface between a car and a cell phone and enable hands-free voice dialing, answering, and hanging up (Auto News, 2006; Funponsel Network, 2005). However, early reports failed to observe a significant risk reduction due to the use of this new technology (Strayer et al., 2003; McEvoy et al., 2005). 184 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 An important question to ask is: are bans on hand-held cell phone use while driving effective for reducing automobileaccident-risk, and do such laws make the roads safer? Although a significant amount of research has investigated the effect of hand-held cell phone use on automobile-accident-risk, there are no definitive conclusions on the issue. This paper focuses specifically on traffic safety both before and after hand-held cell phone bans, to explore whether such laws have any meaningful effect. Note that the issue of compliance is very important for such a study. In the paper, it is assumed, just as lawmakers assume, that the bans do make many drivers refrain from using hand-held cell phones while driving. The main contribution of this paper is to provide statistical measures in support of or against laws banning hand-held cell phone use while driving, based on their historical (statistical) impact on road safety, and to suggest what additional data are necessary to establish such connections. The paper is organized as follows: Section 2 describes the available data and the statistical methods that can be used to conduct comparative studies of automobile accident rates in selected territorial units between pre-law and post-law time periods. Section 3 presents the observed results on the effects of law enforcement on improving driving safety. Section 4 summarizes the findings, discusses the limitations of the presented analysis of the effects of law enforcement on improving driving safety and points out possible directions for further research on this issue. 2. Methods This section describes the data and the tools that can be used to compare automobile accident rates in selected territorial units for the time periods before and after hand-held cell phone ban laws were enacted in these units. There is a dearth of systematically collected data on automobile accident rates in the United States that can be used to study the consequences of hand-held cell phone ban laws. Most territorial units have passed such laws just recently, and hence, can not be used as reliable testbeds for drawing any significant, long-lasting conclusions. In other cases, the ban laws have been passed individually by only a limited number of minor territorial units (e.g., isolated single counties), which makes it difficult to put the observed corresponding accident rate changes in a meaningful perspective. This paper looks to conduct a statistically significant, comprehensive analysis of pre-law and post-law periods, and focuses on the data for New York State, where a state-wide ban on hand-held cell phone use while driving began in November of 2001 (first in the US) and has been in effect for over 8 years. For the aforementioned reasons, New York data represent the only reliable source for evaluating the effect of hand-held cell phone ban laws in the United States. Due to a change in the definition of property damage automobile accidents in New York State regulations in 1997 and again in 2003, the number of property damage automobile accidents, and hence, the total number of automobile accidents can not be used as a measure for evaluating the effectiveness of the ban. Therefore, for all 62 counties in New York State, the measures of traffic safety adopted in this study are the number of fatal automobile accidents per 100,000 licensed drivers per year and the number of personal injury accidents per 1000 licensed drivers per year. To allow for a proper comparison between time periods, 1997–2001 is treated as the pre-law time period and 2002–2007 is treated as the post-law time period. Note that these two accident rate measures are positively correlated, yet differ by the severity of the tallied accidents’ consequences. Note also that some counties passed local ban laws prior to the enactment of the state-wide law. However, this consideration makes the results for any such county where the accident rates are found to have dropped, even stronger. The main portion of the analysis is conducted by testing the hypothesis that the New York state-wide hand-held cell phone ban had no impact on the described measures. A one-tailed t-test is applied to determine whether the expected values for these measures show a statistically significant decrease after the law was enacted. First, to ensure that the data used are normally distributed, the Shapiro–Wilk test is conducted. Second, in order to determine a proper statistical test to be applied, the variances of the compared populations (the data collected over the two time periods) must be the same for each of the three measures. To assess this, a two-sided F-test is used. Third, for those localities when the null hypothesis of equal variances is not rejected at a 5% significance level, a one-sided t-test for samples with equal variances is used to determine whether the measures described above have the same means in the two time periods versus having larger means before hand-held cell phone ban laws were enacted. On the other hand, for those localities when the null hypothesis of equal variances is rejected at a 5% significance level, a one-sided t-test for samples with unequal variances is used. 3. Results This section reports the results of the comparisons of two automobile accident measures in all New York State counties for the time periods before and after hand-held cell phone ban laws were enacted. The automobile accident rates data as well as the number of licensed drivers by county are all published by the New York State Department of Motor Vehicles (2008a–c). The relevant data for each individual county of New York State are summarized in Tables 1 and 2. In particular, the two measures of interest, fatal accidents per 100,000 licensed drivers and personal injury accidents per 1000 licensed drivers, are reported for years 1997 through 2007. The counties are arranged in decreasing order by licensed driver density, computed as the number of licensed drivers per square mile (averaged over the 11 years comprising the pre-law and post-law periods). The last columns in Tables 1 and 2 give the p-values for the hypothesis test of equal variances in the pre-law and post-law accident rate measures, for each county. 185 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 Table 1 Fatal accidents per 100,000 licensed drivers for New York counties. Index County Driver density 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 pvalue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 New York Kings Bronx Queens Richmond Nassau Westchester Rockland Suffolk Monroe Erie Schenectady Onondaga Albany Putnam Niagara Orange Dutchess Broome Saratoga Rensselaer Chemung Oneida Tompkins Ontario Ulster Wayne Seneca Oswego Genesee Montgomery Chautauqua Fulton Cayuga Orleans Madison Columbia Tioga Livingston Cortland Warren Greene Sullivan Jefferson Clinton Washington Steuben Wyoming Yates Cattaraugus Otsego Chenango Schuyler Schoharie Herkimer Allegany Saint Lawrence Delaware Franklin Essex Lewis Hamilton 28,659 11,450 9617 9363 4807 3410 1468 1175 1149 769 613 546 405 373 323 301 286 251 201 190 165 153 132 131 116 116 114 93 90 89 89 89 80 79 76 76 74 73 71 65 57 57 56 55 52 52 52 50 50 45 44 42 42 38 32 32 28 13.812 17.044 18.982 12.616 5.484 11.696 8.782 13.201 14.84 8.45 12.119 5.421 12.584 11.29 18.59 11.564 18.452 16.589 9.92 7.745 9.606 16.228 14.413 19.402 15.485 9.759 23.856 25.735 23.795 32.214 11.25 29.249 7.779 29.745 23.814 12.466 28.978 5.414 16.16 25.135 23.675 51.821 19.333 11.695 24.982 14.629 27.115 37.657 24.316 20.932 21.451 29.848 15.064 23.144 13.289 18.571 19.003 9.879 14.551 12.816 11.373 8.514 10.55 8.442 13.135 16.311 7.262 11.889 5.481 12.008 10.207 17.025 13.569 14.545 14.373 15.688 13.278 17.386 13.036 10.121 10.591 23.871 16.22 20.872 25.775 27.533 25.378 17.038 19.938 12.933 14.958 34.124 35.295 31.133 10.834 18.422 22.104 6.415 25.845 33.009 16.203 11.592 26.801 12.858 34.172 0 17.466 23.776 13.604 7.431 35.815 13.335 18.622 21.812 10.769 14.265 15.258 11.778 5.708 12.811 8.24 11.055 15.257 9.906 11.935 6.444 11.049 7.483 21.034 11.654 19.884 15.271 15.067 15.861 10.632 17.929 19.673 8.752 20.963 16.116 13.403 8.572 28.665 30.064 2.838 10.547 7.746 18.618 23.918 20.7 17.704 8.144 9.157 25.402 16.971 19.949 23.165 16.261 28.862 19.408 22.784 27.245 48.603 24.437 21.272 21.711 29.595 26.711 11.119 40.497 31.354 10.96 12.777 12.236 9.698 8.865 9.639 8.844 8.868 14.706 9.536 9.882 11.801 6.396 12.045 13.66 8.308 13.448 14.862 13.432 10.744 14.251 9.6 11.254 10.268 19.145 16.574 16.101 25.261 28.248 25.054 16.813 18.729 5.06 18.316 6.772 18.341 23.875 10.709 17.976 9.341 14.454 19.524 20.783 18.947 28.333 26.077 18.166 37.133 29.826 22.344 20.959 24.057 7.266 8.739 21.955 12.274 28.322 8.371 13.431 13.457 9.954 9.745 9.442 8.439 9.731 15.656 10.857 11.859 9.876 9.505 10.322 13.428 15.884 15.71 19.581 9.132 11.841 14.102 9.564 12.427 11.74 24.269 20.948 30.527 16.657 21.003 43.008 22.327 13.475 7.557 9.082 30.322 24.266 6.41 26.491 17.862 18.537 14.29 22.018 18.526 24.553 24.127 18.748 16.583 30.181 17.724 22.165 25.312 23.897 7.219 25.843 15.302 27.47 21.473 7.426 11.437 14.27 10.506 7.129 11.226 11.242 12.218 14.278 9.88 9.401 6.251 11.076 9.181 15.951 15.932 18.226 7.389 9.168 9.618 16.731 11.291 16.835 13.228 17.293 15.268 20.384 41.418 26.824 15.876 13.972 12.544 17.534 9.074 23.742 14.103 25.275 13.276 40.178 18.513 10.013 32.378 21.745 21.532 14.655 25.441 26.395 23.606 23.687 8.58 15.963 18.578 7.224 8.502 13.147 36.623 22.919 9.133 10.116 10.839 9.652 7.459 10.377 7.874 8.45 14.253 7.149 9.993 4.373 10.844 0.998 15.331 11.273 19.146 15.262 8.942 9.979 9.105 17.336 12.281 14.167 20.856 17.126 17.144 24.323 21.769 37.913 24.858 13.281 27.195 14.294 39.953 21.735 18.572 20.633 15.305 3.028 19.529 21.189 33.329 11.223 10.713 18.117 17.69 26.445 17.431 16.817 13.452 26.036 14.169 20.917 17.359 20.911 17.203 6.757 10.421 12.651 7.574 7.83 11.082 9.074 13.801 14.578 11.703 10.093 10.511 10.619 8.52 10.294 10.752 15.817 18. 622 9.749 11.167 12. 764 9.6 14.254 20.79 23. 469 18.626 21.542 16.392 19.66 11.242 13. 921 16. 615 12.424 27.058 16.782 29.719 35.276 10.387 15. 416 12.267 5.847 13.202 24. 532 15. 525 14.25 13. 481 11.036 30.03 17.592 18. 731 15.754 28.904 0 16.649 24.09 30.246 13.34 9.392 12. 093 10. 366 8. 811 7.208 10. 128 6. 746 10. 664 13. 909 9.294 8.036 6. 17 10. 099 6.57 14. 379 8. 949 19. 174 14.898 11.27 12. 424 11.927 12.99 8. 809 14. 654 22.321 18. 765 18. 943 29. 01 24. 493 9. 138 22.521 17.898 35. 086 9. 116 13. 614 11. 977 31. 372 10. 493 20.088 27. 967 27. 319 23. 843 22.989 22.835 12. 554 13.561 19. 586 23. 625 5.923 20.777 13. 589 5. 296 21.825 20. 896 15. 55 21. 451 14. 877 9. 722 10.632 10.505 8.13 7. 634 8.56 8.341 13.653 13.391 9. 743 8. 69 9.493 9.328 11. 899 17.129 13.819 17.771 8.578 13. 225 14.017 10.776 17. 685 16.654 11.385 15.449 20.773 15.702 12.285 18.398 31.356 16.489 18.854 14.659 16.098 19.906 23.576 10.284 15.589 19.702 18.304 15.284 20.749 33.414 18.296 19.408 15.448 20.645 19.939 17. 57 20.549 20.183 10.343 7. 11 20.39 10.852 36.325 14.563 8.054 11.218 6.646 6.588 5.781 9.025 5.566 8.176 14.358 7.47 8.096 5.932 10.399 10.237 18.185 9.894 15.028 10.279 9.596 8.344 10.557 7.913 12.148 11.145 21.424 24.058 12.678 48.469 18.146 30.908 16.221 14.482 19.232 12.356 19.693 11.602 8.081 22.979 10.798 18.034 15.036 17.84 13.855 11.036 20.811 15.156 21.654 26.374 23.115 20.375 33.131 22.891 0 8.044 0 17.923 24.734 0.1434 0.0549 0.4385 0.4052 0.0274 0.2781 0.0008 0.3339 0.1813 0.3621 0.4706 0.3303 0.0048 0.0751 0.3556 0.4333 0.1807 0.087 0.1108 0.2127 0.3562 0.4691 0.3281 0.3492 0.3646 0.2807 0.0719 0.1306 0.4712 0.1737 0.163 0.0221 0.0276 0.391 0.3916 0.375 0.4286 0.1767 0.0402 0.346 0.3687 0.0651 0.3321 0.4609 0.1183 0.3789 0.4139 0.2812 0.023 0.1336 0.0101 0.2016 0.3781 0.165 0.1143 0.2736 0.4164 26 21 16 15 3 16.496 21.306 18.691 21.055 0 16.576 6.103 18.623 26.348 63.264 35.834 21.373 14.841 31.45 20.877 37.901 27.089 32.639 31.035 41.212 10.704 20.805 28.627 15.397 40.984 21.084 20.62 24.773 30.714 41.152 23.326 11.572 27.644 25.183 20.321 49. 643 23.238 27.773 15.156 0 21.208 20. 49 28. 001 30. 713 20.678 21.018 11.492 48.349 55.041 20. 82 31.205 5.649 23.822 29.516 0 0.4119 0.3884 0.3678 0.115 0.184 58 59 60 61 62 Tables 3 and 4 present the results of the t-tests for each individual county, reporting the test type, the standardized t-statistic values, and the p-values. A drop in the number of fatal accidents per 100,000 licensed drivers per year has been observed from the selected pre-law period to the post-law period in 46 counties. A drop in the number of personal injury 186 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 Table 2 Personal injury accidents per 1000 licensed drivers for New York counties. Index County Driver density 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 pvalue 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 New York Kings Bronx Queens Richmond Nassau Westchester Rockland Suffolk Monroe Erie Schenectady Onondaga Albany Putnam Niagara Orange Dutchess Broome Saratoga Rensselaer Chemung Oneida Tompkins Ontario Ulster Wayne Seneca Oswego Genesee Montgomery Chautauqua Fulton Cayuga Orleans Madison Columbia Tioga Livingston Cortland Warren Greene Sullivan Jefferson Clinton Washington Steuben Wyoming Yates Cattaraugus Otsego Chenango Schuyler Schoharie Herkimer Allegany Saint Lawrence Delaware Franklin Essex Lewis Hamilton 28,659 11,450 9617 9363 4807 3410 1468 1175 1149 769 613 546 405 373 323 301 286 251 201 190 165 153 132 131 116 116 114 93 90 89 89 89 80 79 76 76 74 73 71 65 57 57 56 55 52 52 52 50 50 45 44 42 42 38 32 32 28 26.789 32.509 28.747 22.410 16.188 17.573 13.949 15.222 14.796 12.766 12.646 11.185 14.313 17.139 14.772 12.399 15.727 15.139 12.201 10.659 12.285 10.370 13.937 12.435 12.726 14.849 10.124 10.894 12.707 14.404 13.866 11.887 12.318 12.530 10.104 11.926 13.218 9.582 10.666 14.327 15.432 12.552 15.640 12.776 11.991 11.460 10.347 10.441 8.085 11.617 11.988 11.125 9.641 10.692 10.410 10.585 10.180 25.523 33.809 28.956 23.517 16.269 17.688 14.037 14.393 15.076 12.176 12.220 10.249 14.734 16.057 12.911 11.075 15.783 14.086 11.367 9.644 11.774 11.358 13.846 10.997 10.700 14.330 10.287 10.009 12.174 13.335 12.523 14.276 11.666 11.387 9.896 10.215 12.387 8.613 9.833 12.378 14.583 10.740 13.903 11.607 11.997 10.550 9.258 10.627 6.983 10.759 11.365 10.067 9.066 10.028 8.890 9.280 9.325 23.951 33.173 30.929 23.830 14.491 17.374 13.586 14.723 14.580 12.266 12.516 10.725 14.653 15.922 13.378 12.243 16.235 14.839 12.096 9.806 11.512 11.425 13.803 12.515 11.097 13.876 9.531 12.130 11.860 13.390 11.238 11.170 11.928 11.506 8.405 10.184 11.685 9.528 10.210 13.082 14.574 10.943 13.899 12.122 11.102 11.426 10.552 11.102 6.622 10.630 13.401 11.398 10.136 9.927 9.696 10.405 9.447 22.898 33.088 32.947 23.721 14.055 17.744 13.696 15.568 14.835 13.108 13.057 11.084 14.171 16.433 13.100 12.104 16.415 14.119 13.106 9.542 11.867 11.039 13.767 12.664 11.719 14.206 9.471 11.452 12.747 14.258 10.424 11.352 12.044 12.455 8.871 11.249 10.527 9.290 10.202 11.956 14.000 10.934 13.452 11.018 11.390 10.194 11.193 10.465 6.920 11.138 11.737 11.040 10.682 10.880 10.165 9.114 10.007 21.396 30.526 30.453 22.451 14.351 16.781 13.168 14.548 14.372 12.038 11.882 10.648 12.794 14.533 12.757 11.074 15.378 13.250 11.415 8.789 9.975 10.010 12.980 10.734 10.220 12.499 8.489 10.369 12.089 12.336 9.712 10.542 10.681 10.444 8.356 9.322 10.982 7.338 10.271 11.678 13.004 10.046 12.838 10.977 10.449 10.405 9.577 9.624 7.267 9.633 10.355 10.515 9.241 11.457 8.569 8.790 9.703 19.812 29.461 29.470 22.038 13.749 17.307 13.161 14.515 14.735 12.673 12.746 11.912 13.202 14.935 11.618 11.018 16.238 13.055 11.397 8.579 10.680 10.307 12.670 11.095 11.028 13.482 8.284 11.390 12.118 14.447 11.177 11.248 12.249 10.852 9.124 11.182 10.763 8.178 10.111 13.082 13.397 11.467 13.989 10.723 11.504 10.408 10.030 10.386 8.290 11.240 12.246 10.191 9.824 10.925 10.364 9.827 9.545 16.887 24.973 25.819 18.333 12.586 16.137 12.175 13.745 13.696 11.410 12.123 10.696 12.985 16.230 11.268 11.166 15.077 13.077 11.020 8.881 9.988 8.778 12.724 10.704 10.806 13.105 8.329 10.662 11.549 13.760 10.772 11.238 11.100 10.649 8.424 10.077 10.214 7.763 9.380 13.262 12.538 12.396 13.788 9.540 10.159 9.285 9.662 9.884 6.798 9.552 10.919 9.816 9.423 10.542 9.200 9.559 9.117 15.650 22.261 21.688 16.159 11. 786 15.608 11. 135 13. 578 13. 564 11.403 12.083 9. 740 11. 940 14.048 11.645 10. 158 14.929 12. 825 10.362 8.356 9.691 9.680 11.943 10. 219 9.609 12.777 8.114 9.262 9.899 12.434 10. 190 10. 634 11.703 11.040 7.821 9.688 9.628 8.180 9.030 11. 991 12.746 10.878 13. 510 11. 362 9.369 9.482 9.229 9.776 6.157 9.570 10.871 9.354 9.742 9.823 9.373 8.832 8.671 15. 110 21. 002 20. 270 15. 725 11.156 14. 899 11.081 13.355 12.728 11. 189 11. 996 10.622 11.488 14. 469 11. 360 10.732 14. 062 12.288 9. 397 8.287 9. 881 9. 613 11. 634 9.948 9. 558 13. 015 6.980 10. 402 9. 074 13. 638 9.769 10.296 9. 799 10. 009 7.964 9. 661 9. 809 7.713 8. 861 10.938 12. 118 11. 524 13.369 10.861 9. 272 8. 973 8. 758 8. 674 5.864 9. 678 9. 671 7. 997 8. 439 9. 821 8. 375 9.101 7.641 15.197 20.094 20.424 14.862 9.550 14.031 10.582 11.615 12.673 9.809 11.248 9.260 10.255 12.459 11.147 8.580 12.754 10.665 8.729 7. 563 8.405 8.456 9.937 9.108 8.549 11.099 6. 538 8.845 8.797 9.922 9.179 9.490 7.989 9.677 6. 337 8.841 8.289 6. 314 7.246 10.739 10.087 8.715 11.537 9.767 8.981 7.923 8.465 8.108 6. 559 9.281 9.105 8.688 10.451 7.911 6.815 7. 749 7. 639 14.202 19.150 20.792 14.511 9.776 13.628 9.986 11.591 12.568 9.876 11.266 8.991 10.558 12.294 11.028 8.577 12.837 10.447 8.959 6.818 8.454 8.514 9.463 9.601 8.733 11.183 6.593 9.209 9.130 12.032 9.652 9.330 8.558 8.949 7.057 8.373 8.626 7.328 8.444 10.219 10.788 8.716 11.863 10.098 8.758 8.206 7.796 8.539 8.148 9.287 9.873 8.164 9.391 8.608 7.990 9.171 7.668 0.4379 0.0255 0.0753 0.0089 0.2190 0.0153 0.0185 0.0539 0.0227 0.0574 0.3205 0.0299 0.2077 0.1992 0.0118 0.1356 0.0195 0.1829 0.2019 0.4166 0.4981 0.3709 0.0145 0.2998 0.4704 0.3920 0.3630 0.3825 0.0142 0.1087 0.0541 0.1293 0.0363 0.4282 0.3764 0.4580 0.3788 0.2642 0.0201 0.3707 0.2642 0.1671 0.4854 0.4204 0.2053 0.2192 0.4719 0.1632 0.1323 0.5109 0.4968 0.1686 0.5134 0.1360 0.2076 0.3996 0.0623 26 21 16 15 3 11.932 11.414 13.196 9.685 12.234 10.249 11.320 11.733 8.273 8.224 10.861 11.786 12.911 9.540 12.526 11.370 10.625 11.750 9.466 12.776 10.089 9.957 11.272 8.366 8.607 10.516 10.516 11.148 10.289 12.551 8.864 9.489 10.159 8.864 10.567 10.033 9.498 11.838 9.245 13.769 10. 100 8. 664 10. 361 7.371 12. 614 8.066 9.337 8.254 7. 205 8.745 8.503 9.632 9.019 6.789 6.160 0.3166 0.3307 0.1951 0.1006 0.3326 58 59 60 61 62 accidents per 1000 licensed drivers per year has been observed in all 62 counties. According to Table 3, which looks at the number of fatal automobile accidents per year per 100,000 licensed drivers, a total of 10 out of 62 counties have p-values lower than 0.05 in the t-tests, providing sufficient evidence for the rejection of the ‘‘no effect” hypotheses at the 5% level 187 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 Table 3 Post-law and pre-law comparison – fatal injury accidents per 100,000 licensed drivers. Index County Driver density Test type T p-value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 New York Kings Bronx Queens Richmond Nassau Westchester Rockland Suffolk Monroe Erie Schenectady Onondaga Albany Putnam Niagara Orange Dutchess Broome Saratoga Rensselaer Chemung Oneida Tompkins Ontario Ulster Wayne Seneca Oswego Genesee Montgomery Chautauqua Fulton Cayuga Orleans Madison Columbia Tioga Livingston Cortland Warren Greene Sullivan Jefferson Clinton Washington Steuben Wyoming Yates Cattaraugus Otsego Chenango Schuyler Schoharie Herkimer Allegany Saint Lawrence Delaware Franklin Essex Lewis Hamilton 28,659 11,450 9617 9363 4807 3410 1468 1175 1149 769 613 546 405 373 323 301 286 251 201 190 165 153 132 131 116 116 114 93 90 89 89 89 80 79 76 76 74 73 71 65 57 57 56 55 52 52 52 50 50 45 44 42 42 38 32 32 28 26 21 16 15 3 Pooled Pooled Pooled Pooled Not pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Not pooled Not pooled Pooled Pooled Pooled Pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Not pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled 2.4282 4.6581 2.297 3.1277 0.5757 1.0108 0.5062 0.0272 3.7777 0.0043 4.4579 0.4335 0.0754 1.2737 0.849 0.2607 0.8355 1.6785 1.6448 0.6341 0.7025 0.1973 0.0393 0.8955 0.302 1.5012 1.0435 1.1665 2.071 1.3517 1.0848 0.822 3.1774 0.812 0.2531 0.7186 0.0223 0.794 0.8802 0.8181 0.0818 1.0117 0.4882 0.2667 2.4495 1.4584 0.0001 3.4703 0.7844 1.6268 1.2213 0.8245 0.9009 1.6895 0.3764 0.651 2.1479 0.6149 0.8606 1.4389 0.9157 1.352 0.019 0.0006 0.0236 0.0061 0.2895 0.1693 0.3167 0.4894 0.0022 0.5017 0.0008 0.3374 0.5284 0.1173 0.2089 0.4001 0.7875 0.0638 0.0672 0.2709 0.2501 0.424 0.4848 0.8031 0.3848 0.9162 0.162 0.8633 0.0341 0.1047 0.8469 0.2248 0.9944 0.2189 0.4029 0.2453 0.4914 0.7762 0.7992 0.2172 0.5317 0.1691 0.6815 0.3979 0.0184 0.0894 0.5 0.0035 0.2346 0.0691 0.1348 0.2155 0.1955 0.0627 0.3577 0.7343 0.0301 0.7231 0.2059 0.908 0.8081 0.1047 of significance. According to Table 4, which looks at the number of personal injury automobile accidents per year per 1000 licensed drivers, a total of 46 out of 62 counties have p-values lower than 0.05 in the t-tests. Fig. 1 presents the personal injury accident rate standardized t-statistic values for the hypothesis tests for all counties, respectively, plotted against licensed driver density. 188 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 Table 4 Post-law and pre-law comparison – personal injury accidents per 1000 licensed drivers. Index County Driver density Test type T p-value 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 New York Kings Bronx Queens Richmond Nassau Westchester Rockland Suffolk Monroe Erie Schenectady Onondaga Albany Putnam Niagara Orange Dutchess Broome Saratoga Rensselaer Chemung Oneida Tompkins Ontario Ulster Wayne Seneca Oswego Genesee Montgomery Chautauqua Fulton Cayuga Orleans Madison Columbia Tioga Livingston Cortland Warren Greene Sullivan Jefferson Clinton Washington Steuben Wyoming Yates Cattaraugus Otsego Chenango Schuyler Schoharie Herkimer Allegany Saint Lawrence Delaware Franklin Essex Lewis Hamilton 28,659 11,450 9617 9363 4807 3410 1468 1175 1149 769 613 546 405 373 323 301 286 251 201 190 165 153 132 131 116 116 114 93 90 89 89 89 80 79 76 76 74 73 71 65 57 57 56 55 52 52 52 50 50 45 44 42 42 38 32 32 28 26 21 16 15 3 Pooled Not pooled Pooled Not pooled Pooled Not pooled Not pooled Pooled Not pooled Pooled Pooled Not pooled Pooled Pooled Not pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Not pooled Pooled Pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Not pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled Pooled 6.4023 5.4407 4.0087 5.2087 4.2763 3.6932 4.7516 3.1667 3.862 2.7131 1.7638 1.1196 3.7802 2.4899 5.4204 2.9254 2.7064 3.6016 3.5655 3.6897 3.6164 3.8226 3.8372 3.5403 2.5992 2.5664 4.375 1.7797 3.6964 1.0372 1.9045 2.1273 1.817 2.9287 2.4076 1.5606 3.6246 2.6183 3.3509 1.3652 3.5244 0.5391 1.485 2.957 3.2393 3.7247 2.4693 2.6474 0.4013 2.2114 1.9531 3.9199 0.5177 1.7181 1.3353 1.289 3.3032 2.8361 3.7613 2.979 1.1327 0.0879 0.0001 0.0002 0.0015 0.0012 0.001 0.0052 0.0015 0.0057 0.0039 0.0119 0.0558 0.1459 0.0022 0.0172 0.0018 0.0084 0.0174 0.0029 0.003 0.0025 0.0028 0.002 0.0044 0.0032 0.0144 0.0152 0.0009 0.0544 0.0052 0.1633 0.0446 0.0311 0.0513 0.0084 0.0197 0.0765 0.0028 0.0139 0.0075 0.1027 0.0032 0.3015 0.0858 0.008 0.0051 0.0024 0.0178 0.0133 0.3488 0.0272 0.0413 0.0018 0.3086 0.06 0.1073 0.1148 0.0046 0.0098 0.0022 0.0077 0.1433 0.4659 A condensed version of further results is given in Tables 5 and 6, where a summary of the t-test results is presented for three different cases of pooled groups of counties. In the first case, the measures of all the counties in New York are pooled in order to obtain a statewide result. In the second case, the measures of the counties are pooled according to geopolitical designation in order to examine results for New York City and upstate New York. In the third case, the measures of the counties 189 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 Fig. 1. Personal injury accident t-test statistic by county, pre-law to post-law. Table 5 Post-law and pre-law comparison – fatal injury accidents per 100,000 licensed drivers. Group X post X pre Sp npre ; npost T p-value NY state (1–62) NY city (1–5) Upstate (6–62) NY county (1) Kings (2) Bronx–Queens (3–4) Richmond (5) Nassau (6) Westchester–Suffolk (7–9) Monroe–Schenectady (10–12) Onondaga–Dutchess (13–18) Broome–Wayne (19–27) Seneca–Hamilton (28–62) 1.399 2.4942 1.3029 2.3441 3.4271 3.1052 0.4895 0.7614 0.5574 1.0548 1.1261 0.4038 1.6651 8.516 8.6093 1.5942 1.2150 2.5026 1.4041 1.2439 3.0764 2.1419 3.9705 4.6819 9.2897 310, 372 25, 30 285, 342 5, 6 5, 6 10, 12 5, 6 5, 6 15, 18 15, 18 30, 36 45, 54 175, 210 2.1362 3.3787 1.8869 2.4282 4.6581 2.8979 0.5757 1.0108 0.5182 1.4086 1.1473 0.4273 1.7512 0.0165 0.0008 0.0298 0.0190 0.0006 0.0044 0.2895 0.1693 0.3040 0.0845 0.1278 0.3351 0.0404 Table 6 Post-law and pre-law comparison – personal injury accidents per 1000 licensed drivers. Group X post X pre Sp npre ; npost T p-value NY state (1–62) NY city (1–5) Upstate (6–62) NY county (1) Kings (2) Bronx–Queens (3–4) Richmond (5) Nassau (6) Westchester–Suffolk (7–9) Monroe–Schenectady (10–12) Onondaga–Dutchess (13–18) Broome–Wayne (19–27) Seneca–Hamilton (28–62) 1.8870 6.9960 1.4388 7.9685 9.7975 6.7886 3.6370 2.1637 1.8542 0.8465 1.9896 1.8302 1.2382 5.8108 2.0508 2.0554 2.9739 4.2875 1.4046 1.0439 1.7729 1.6593 1.6556 310, 372 25, 30 285, 342 5, 6 5, 6 10, 12 5, 6 5, 6 15, 18 15, 18 30, 36 45, 54 175, 210 6.1656 4.4459 8.7473 6.4023 5.4407 3.6979 4.2763 3.6932 5.1424 2.3196 4.5398 5.4646 7.3070 0.0000 0.0000 0.0000 0.0001 0.0002 0.0007 0.0010 0.0052 0.0000 0.0136 0.0000 0.0000 0.0000 are pooled according to licensed driver density values. In particular, a k-means clustering algorithm (Seber, 1984) is used to form 10 groups of counties with similar licensed driver density values. The algorithm selects group membership in order to minimize the total intra-group Euclidean distance between a county’s density value and its group’s mean density value. Each table reports the difference in its respective measure from the selected pre-law period to the post-law period, the pooled sample standard deviation (when appropriate), the number of data points in the samples, the values of the test statistic (T is distributed t npre þnpost 2 Þ and the p-values. For most of the pooled groups, the hypothesis of equal variances of accident rate measures between the pre-law and post-law periods was not rejected. Those groups that rejected the hypothesis of equal variances have a in the Sp column. 190 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 4. Discussion As the number of drivers that use cell phones while driving grows, the interest in linking hand-held cell phone use while driving and road safety increases. As more technologies, including cameras, music, text messaging, and internet browsing become available from mobile devices, they may pose an even greater cause of driver distraction. As of 2009, more than 250 bills prohibiting or restricting cell phone use while driving are pending in 42 state legislatures, despite disagreement over the risks cell phones pose and the effectiveness of enforcement (O’Donnell, 2009). This paper conducts a comparative analysis of two automobile accident rate measures in the counties of New York State for the periods before and after the state-wide hand-held cell phone ban law was enacted. Section 4.1 summarizes the findings, Section 4.2 discusses the limitations of the presented analysis of the effects of law enforcement on improving driving safety, and Section 4.3 points out the possible directions for further research on this subject. 4.1. Summary The results presented in Section 3 indicate that after banning hand-held cell phone use while driving, 46 counties in New York experienced lower fatal automobile accident rates, 10 of which did so at a statistically significant level, and all 62 counties experienced lower personal injury automobile accident rates, 46 of which did so at a statistically significant level. The analysis strongly suggests that the mean fatal accident rate measure decreased at a significant level for New York State (p-value of 0.0165, see Table 5), for New York City and upstate New York (p-values of 0.0008 and 0.0298, respectively, see Table 5), and for four of the 10 groups partitioned by similar licensed driver density. Three of these four groups contained high density New York City counties (New York County, Bronx, and Queens with p-values of 0.0190, 0.0006, and 0.0044, respectively, see Table 5). The fourth group contained the lowest density subset of upstate New York counties (Seneca–Hamilton, with a p-value of 0.0404, see Table 5). The mean personal injury accident rate measure decreased at a significant level for all groups in each of the three cases examined (see Table 6). Moreover, it has been observed that, in general, the personal injury accident rate decrease is more substantive in counties with a high density of licensed drivers (see Fig. 1). Overall, the personal injury accident rate proved to be a more appropriate measure than the fatal accident rate for the analysis. 4.2. Limitations There exist several issues that limit the statistical validity of the presented analysis and hamper one’s ability to definitively establish the effect of laws banning hand-held cell phone use while driving on automobile accident rates using publicly available, historical data as the basis for analysis. First, one should take care not to project the results of this analysis based only on New York data to the national level, given that each state, county, city, and town has their own unique highway and roadway transportation network that, by their very design, must be considered. Second, hand-held cell phone ban legislation may not be the only way to affect automobile accident rates. This observation makes it difficult to judge whether the changes in automobile accident rates in counties with hand-held cell phone bans are primarily attributed to the bans, or to some other factors, including but not limited to road construction, safety education, introduction of new automobile safety features, and/or changes in alcohol and illegal substance control policies. Considerations of such confounding factors should ideally be included in the analysis, but unfortunately, the relevant data are unavailable due to their absence in the public records as well as proprietary concerns of the companies that use such data for their business interests. Also, the impact of traffic safety improvement thrusts, such as the ‘‘Safe Streets NYC” program in New York City (Bloomberg and Sadik-Khan, 2007), should be taken into account. Third, proper enforcement of hand-held cell phone ban laws, and hence, driver compliance is an important issue. McCartt and Geary (2003, 2004) reported that the hand-held cell phone user rate while driving in New York dropped from 2.3% (before the law was enacted) to 1.1% in the first few months immediately following the enactment of the law. However, this rate rebounded back up to 2.1% about a year later. Since the initial drop in hand-held cell phone use while driving was not sustained, it is possible that the reduction in automobile accident rates in New York may be due to other factors. Fourth, data linking the number of cell phone subscribers to automobile accident rates suggest that increased cell phone use does not translate into increased automobile accident rates. In particular, there has been an exponential growth in the number of cell phone subscribers from the late-1980s, while automobile accident rates in the US during this same time period have remained at a fairly constant level (see Fig. 2). Driving statistics from the National Center for Statistics and Analysis of the NHTSA reveals that from 1994 to 2004, the number of cell phone subscribers increased 655%, with their average monthly minutes-of-use increasing 3900%, while annual automobile accident rates reported decreased by approximately 5% over the same time period (Information Please Database, 2007; NHTSA, 2008). These facts should not go unnoticed, even though it is likely that the changes in the transportation policy and the advances in safety in the automotive industry between 1988 and 2006 have influenced the accident rates. As of February 2007, sixteen states had published data on the number of automobile accidents that cited hand-held cell phones or radios as a causal factor. These data indicate that hand-held cell phone use is reported as a factor in less than 1% of automobile accidents (Sundeen, 2007). Although such data are controversial and potentially unreliable, due to the challenge 191 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 25 250 20 200 15 150 10 100 5 50 0 1988 1990 1992 1994 1996 1998 Year 2000 2002 2004 2006 Cell Phone Subscribers (x106) Automobile Accident Rate (x106) Automobile Accident Rate Cell Phone Subscribers 0 Fig. 2. Automobile accident rates and the number of cell phone subscribers in the US, 1988–2006. in knowing the precise cause of accidents and how such information is reported, it does suggest that hand-held cell phone use may account for a negligible percentage of automobile accidents, which means that if such accidents could be completely eliminated by hand-held cell phone ban laws, there would be only a slight reduction in the total number of automobile accidents. 4.3. Future research directions A large body of literature suggests that hand-held cell phone use while driving impairs driver performance (Ranney, 2005; Strayer et al., 2006; Sundeen, 2001; Redelemeier and Tibshirani, 1997). Drivers using hand-held cell phones have slower reaction times, longer following distances, and longer speed resume time compared to those drivers who do not use hand-held cell phones (Strayer and Drews, 2004). Although studies using driving simulators and test tracks indicate that hand-held cell phone use negatively impacts driver performance, the results drawn from experiments in such controlled environments cannot directly measure the impact of hand-held cell phone use on accident rates (Hedlund, 2006). Indeed, there is insufficient evidence to broadly assert that hand-held cell phone use results in higher accident rates or that hand-held cell phone bans decrease accident rates (Williams, 2007). Several organizations, including CTIA and AAA Auto Insurance, believe it is premature to ban hand-held cell phone use while driving. They argue that road safety can be improved more effectively through education and ease-of-use cell phone designs, rather than legislation. Studies conducted in actual driving conditions, not only in laboratory environments, are needed to provide convincing evidence that hand-held cell phone use while driving impairs driver performance, and hence, increases automobile accident rates. However, staging a set of potentially dangerous situations on the road just to evaluate the driver’s ability to avoid a collision is unthinkable, and hence, the statistical approach taken in this paper may be the only one where data from actual accidents can be used to answer questions regarding cell phone use while driving. Although at this point one should be cautious about drawing conclusions from the current analysis (for reasons described in Section 4.2), the approach taken in this paper looks very promising for providing useful information on the need for hand-held cell phone ban laws. In order to conduct a more substantive and conclusive analysis, the data that would allow for blocking the confounding factors are required. Also, the property damage automobile accident rate should be considered as another, more appropriate measure of safety than fatal or personal injury accident rates. A measure that ideally would replace the density of licensed drivers in the analysis is the daily vehicle throughput per square mile of a county’s land. Moreover, in order to investigate the effects of restricting hand-held cell phone use while driving, wider-coverage data related to cell phone usage and road safety are needed to support additional research on this important problem. Such data could include the fraction of drivers actually using hand-held cell phones while driving, the total amount of time that hand-held cell phones are used while driving, and the fraction of automobile accidents that are directly attributable to hand-held cell phone use. Note that the bonanza of the described data lies in the hands of insurance companies that must be interested in the correct evaluation of the impact of cell phone ban laws on driving safety, albeit only for the sake of gaining a competitive edge over their rivals. Moreover, national and state transportation policy law-makers would welcome a fair and unbiased analysis with such data, to put to rest the growing debate on this issue and allow for appropriate national and state legislation policies and decisions to be made. 192 A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 Given more data, a logical step to take from the statistical point of view is to conduct a time series cross-sectional multivariate regression analysis and employ analysis of variance techniques to establish whether laws prohibiting hand-held cell phone use while driving have a significant effect on the driving environment. The authors do not intend to stop short of finding the truth and actively seek potential collaborations with interested parties. Acknowledgements The computational work was done in the Simulation and Optimization Laboratory housed within the Department of Computer Science at the University of Illinois. The views expressed in this paper are those of the authors and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the United States Government. The authors wish to thank the editor and two anonymous referees for their helpful comments and suggestions, which have significantly improved the manuscript. The authors would also like to thank Qianyi C. Zhao for her contributions during the initial stages of this research. References AAA Auto Insurance, Public Affairs, 2008. State Distracted Driving Laws Chart. %3chttp://www.aaapublicaffairs.com/Assets/Files/200891214340. DistractedDrivingLaws8.08.doc%3e (accessed 28.01.09). Auto News, 2006. Hands-free Phone Kits Help Reduce Cell Phone Distraction. %3chttp://www.motortrend.com/auto_news/112_news060906_hands_ free_cell_phone_use/index.html%3e (accessed 18.02.09). Bloomberg M.R., Sadik-Khan, J., 2007. Safe Streets NYC. Traffic Safety Improvements in New York City. %3chttp://www.SafeNY.com%3e (accessed 18.10.09). Caird, J.K., Willness, C.R., Steel, P., Scialfa, C., 2008. A meta-analysis of the effects of cell phones on driver performance. Accident Analysis and Prevention 40 (4), 1282–1293. Cellular News, 2008. Countries that Ban Cell Phones while Driving. %3chttp://www.cellular-news.com/car_bans/%3e (accessed 28.01.09). Cohen, J.T., Graham, J.D., 2003. A revised economic analysis of restrictions on the use of cell phones while driving. Risk Analysis 23 (1), 5–17. CTIA, 2008a. Safe Driving Tips. %3chttp://www.ctia.org/consumer_info/safety/index.cfm/AID/10369%3e (accessed 28.01.09). CTIA, 2008b. Safe Driving: CTIA Position. %3chttp://www.ctia.org/ad-vocacy/policy_topics/topic.cfm/TID/17%3e (accessed 28.01.09). Dingus, T.A., Klauer, S.G., Neale, V.L., Petersen, A., Lee, S.E., Sudweeks, J., Perez, M.A., Hankey, J., Ramsey, D., Gupta, S., Bucher, C., Doerzaph, Z.R., Jermeland, J., Knipling, R.R., 2006. The 100-car Naturalistic Driving Study, Phase II – Results of the 100-car Field Experiment. NHTSA, DOT HS 810 593. %3chttp:// www.nhtsa.gov/portal/nhtsa_static_file_downloader.jsp?file=/sta-ticfiles/DOT/NHTSA/NRD/Multimedia/PDFs/Crash%20Avoidance/2006/ 100CarMain.pdf%3e (accessed 28.01.09). Funponsel Network, 2005. Motorola Develops ’Polite’ Phone. %3chttp://www.funponsel.com/blog/news/motorola-develops-polite-phone.html%3e (accessed 18.02.09). Glassbrenner, D., Ye, J.Q., 2007. Driver Cell Phone Use in 2006 – Overall Results. NHTSA, DOT HS 810 790 (July). Goodman, M.J., Benel, D., Lerner, N., Wierwille, W., Tijerina, L., Bents, F., 1997. An Investigation of the Safety Implications of Wireless Communications in Vehicles. US Dept. of Transportation, NHTSA. %3chttp://www.nhtsa.dot.gov/people/injury/research/wireless/%3e (accessed 28.01.09). Governors Highway Safety Association, 2008. Cell Phone Driving Laws. %3chttp://www.ghsa.org/html/stateinfo/laws/cellphone_laws.html#4%3e (accessed 28.01.09). Hedlund, J.H., 2006. Countermeasures that Work: A Highway Safety Countermeasures Guide for State Highway Safety Offices. NHTSA, Washington, DC, DOT HS 809 980 (January). %3chttp://www.nhtsa.dot.gov/people/injury/airbags/Countermeasures/pages/0Introduction.htm%3e (accessed 28.01.09). Horrey, W.J., Wickens, C.D., 2006. Examining the impact of cell phone conversations on driving using meta-analytic techniques. Journal of Experimental Psychology: Applied 12 (2), 67–78. Information Please Database, 2007. Cell Phone Subscribers in the US, 1985–2005. %3chttp://www.infoplease.com/ipa/A0933563.html%3e (accessed 28.01.09). Klauer, S.G., Dingus, T.A., Neale, V.L., Sudweeks, J.D., Ramsey, D.J., 2006. The Impact of Driver Inattention on Near-crash/Crash Risk: An Analysis Using the 100-car Naturalistic Driving Study Data. NHTSA, DOT HS 810 5940 (April). %3chttp://www.noys.org/Driver%20Inattention%20Report.pdf%3e (accessed 28.01.09). Lissy, S.K., Cohen, J.T., Park, M.Y., Graham, J.D., 2000. Cellular Phone Use while Driving: Risks and Benefits. Harvard Center for Risk Analysis, Harvard School of Public Health. %3chttp://www-nrd.nhtsa.dot.gov/departments/nrd-13/driver-distraction/PDF/Harvard.PDF%3e (accessed 28.01.09). McCartt, A.T., Geary, L.L., 2003. Drivers’ use of handheld cell phones before and after New York State’s cell phone law. Preventive Medicine 36, 629–635. McCartt, A.T., Geary, L.L., 2004. Long term effects of New York State’s law on drivers’ handheld cell phone use. Injury Prevention 10, 11–15. McCartt, A.T., Hellinga, L.A., Bratiman, K.A., 2006. Cell phones and driving: review of research. Traffic Injury Prevention 7 (2), 89–106. McEvoy, S.P., Stevenson, M.R., McCartt, A.T., Woodward, M., Haworth, C., Palamara, P., Cercarelli, R., 2005. Role of mobile phones in motor vehicle crashes resulting in hospital attendance: a case-crossover study. British Medical Journal 331 (7514), 428–430. NHTSA, 1997. Traffic Safety Facts 1996: Young Drivers. NHTSA, Washington, DC. NHTSA, 2008. Table 2-17: Motor Vehicle Safety Data, National Transportation Statistics. %3chttp://www.bts.gov/publications/national_trans-portation_ statistics/html/table_02_17.html%3e (accessed 28.01.09). New York State Department of Motor Vehicles, 2008a. Ticket and Crash Data Reports by County in 2001–2006. %3chttp://www.nysgtsc.state.ny.us/ hsdata.htm%3e (accessed 28.01.09). New York State Department of Motor Vehicles, 2008b. The Number of All New York States Vehicle Registrations Considered Active in 1997–2006. %3chttp:// www.nydmv.state.ny.us/stats-arc.htm%3e (accessed 28.01.09). New York State Department of Motor Vehicles, 2008c. Driver Licenses and Vehicle Registrations by County in 1997–2006. %3chttp://www.nydmv.state. ny.us/stats-arc.htm%3e (accessed 28.01.09). O’Donnell, J., 2009. Efforts to Limit Cellphone Use while Driving Grow. USA Today, March 30, 2009. Olson, R.K., 2003. Cell Phone Bans for Drivers: Wise Legislation? International Risk Management Institute. %3chttp://www.irmi.com/Expert/Articles/2003/ Olson05-a.aspx%3e (accessed 28.01.09). Ranney, T.A., 2005. Examination of the Distraction Effects of Wireless Phone Interfaces Using the National Advanced Driving Simulator-Final Report on the Freeway Study. NHTSA, DOT HS 809 787. Redelemeier, D.A., Tibshirani, R.J., 1997. Association between cellular-telephone calls and motor vehicle collisions. The New England Journal of Medicine 336, 453–458. Richtel, M., 2009. Promoting the Car Phone, Despite Risks. New York Times, December 7, 2009. Royal, D., 2003. Volume I: Findings National Survey of Distracted and Drowsy Driving Attitudes and Behavior: 2002. NHTSA, DOT HS 809 566. Savage, M.A., Goehring, J.B., Mejeur, J., Reed, J.B., Sundeen, M., 2000. State Traffic Safety Legislative Summary 2000. Transportation Series (National Conference of State Legislatures). %3chttp://www.ncsl.org/programs/transportation/trafsafe00.htm%3e (accessed 28.01.09). Seber, G.A.F., 1984. Multivariate Observations. John Wiley and Sons, Hoboken, NJ. A.G. Nikolaev et al. / Transportation Research Part A 44 (2010) 182–193 193 Strayer, D.L., Drews, F.A., 2004. Profiles in driver distraction: effects of cell phone conversations on younger and old drivers. Human Factors 46 (4), 640–649. Strayer, D.L., Drews, F.A., Johnston, W.A., 2003. Cell phone-induced failures of visual attention during simulated driving. Journal of Experimental Psychology: Applied 1 (9), 23–32. Strayer, D.L., Drews, F.A., Crouch, D.J., 2006. A comparison of the cell phone driver and the drunk driver. Human Factors 48 (2), 381–391. Sundeen, M., 2001. Driving while calling – what’s the legal limit? State Legislatures 27 (9), 24–26. Sundeen, M., 2004. Cell Phones and Highway Safety: 2003 State Legislative Update. National Conference of State Legislatures. %3chttp://www.ncsl.org/print/ transportation/cellphoneupdate12-03.pdf%3e (accessed 18.02.09). Sundeen, M., 2007. Cell Phones and High Way Safety 2006 State Legislative Update. National Conference of State Legislatures. %3chttp://www.ncsl.org/ print/transportation/2006cellphone.pdf%3e (accessed 28.01.09). Williams, A.F., 2007. Contribution of the components of graduated licensing to crash reductions. Journal of Safety Research 38 (2), 177–184. This article was downloaded by: [192.17.144.156] On: 30 June 2014, At: 06:17 Publisher: Institute for Operations Research and the Management Sciences (INFORMS) INFORMS is located in Maryland, USA Operations Research Publication details, including instructions for authors and subscription information: http://pubsonline.informs.org Balance Optimization Subset Selection (BOSS): An Alternative Approach for Causal Inference with Observational Data Alexander G. Nikolaev, Sheldon H. Jacobson, Wendy K. Tam Cho, Jason J. Sauppe, Edward C. Sewell, To cite this article: Alexander G. Nikolaev, Sheldon H. Jacobson, Wendy K. Tam Cho, Jason J. Sauppe, Edward C. Sewell, (2013) Balance Optimization Subset Selection (BOSS): An Alternative Approach for Causal Inference with Observational Data. Operations Research 61(2):398-412. http://dx.doi.org/10.1287/opre.1120.1118 Full terms and conditions of use: http://pubsonline.informs.org/page/terms-and-conditions This article may be used only for the purposes of research, teaching, and/or private study. Commercial use or systematic downloading (by robots or other automatic processes) is prohibited without explicit Publisher approval, unless otherwise noted. For more information, contact [email protected]. The Publisher does not warrant or guarantee the article’s accuracy, completeness, merchantability, fitness for a particular purpose, or non-infringement. Descriptions of, or references to, products or publications, or inclusion of an advertisement in this article, neither constitutes nor implies a guarantee, endorsement, or support of claims made of that product, publication, or service. Copyright © 2013, INFORMS Please scroll down for article—it is on subsequent pages INFORMS is the largest professional society in the world for professionals in the fields of operations research, management science, and analytics. For more information on INFORMS, its publications, membership, or meetings visit http://www.informs.org OPERATIONS RESEARCH Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Vol. 61, No. 2, March–April 2013, pp. 398–412 ISSN 0030-364X (print) ISSN 1526-5463 (online) http://dx.doi.org/10.1287/opre.1120.1118 © 2013 INFORMS Balance Optimization Subset Selection (BOSS): An Alternative Approach for Causal Inference with Observational Data Alexander G. Nikolaev Department of Industrial and Systems Engineering, University at Buffalo (SUNY), Buffalo, New York 14260, [email protected] Sheldon H. Jacobson Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected] Wendy K. Tam Cho Departments of Political Science and Statistics and the National Center for Supercomputing Applications, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected] Jason J. Sauppe Department of Computer Science, University of Illinois at Urbana–Champaign, Urbana, Illinois 61801, [email protected] Edward C. Sewell Department of Mathematics and Statistics, Southern Illinois University Edwardsville, Edwardsville, Illinois 62026, [email protected] Scientists in all disciplines attempt to identify and document causal relationships. Those not fortunate enough to be able to design and implement randomized control trials must resort to observational studies. To make causal inferences outside the experimental realm, researchers attempt to control for bias sources by postprocessing observational data. Finding the subset of data most conducive to unbiased or least biased treatment effect estimation is a challenging, complex problem. However, the rise in computational power and algorithmic sophistication leads to an operations research solution that circumvents many of the challenges presented by methods employed over the past 30 years. Subject classifications: causal inference; balance optimization; subset selection. Area of review: Optimization. History: Received September 2011; revisions received February 2012, July 2012; accepted September 2012. Published online in Articles in Advance March 19, 2013. 1. Problem Description tools for measuring estimation accuracy (e.g., calculating p-values, confidence intervals), randomization is powerful because it allows the effect of treatment to be isolated from that of confounding factors. There are numerous situations where conducting a randomized experiment is impractical or not even possible (due to ethical dilemmas). For example, to determine whether smoking causes lung cancer, it would not be possible to randomly select people to smoke. Similarly, although it would be beneficial to understand the perils of radiation exposure, randomly choosing people and exposing them to high levels of radiation is unethical. Although experiments cannot be conducted for these pressing and important research queries, one can often collect observational data. So, although we would not expose people to situations that might put their health in peril, because these situations do occur, we can observe people who choose to smoke or find people who have been inadvertently exposed to radiation. This type of data is called observational data because it is observed (rather than created via experiments). Observational data are both more prevalent than experimental data and available for a larger set of important Randomized experiments have been used by a diverse swath of researchers to isolate treatment effects and establish causal relationships. Such experiments have informed our understanding of medicine (e.g., the effect of drugs, the causes of cancer, the benefit of vitamins), and have been instrumental in the implementation of public policy (e.g., shedding insight on the effect of racial campaign appeals, testing the effect of get-out-the-vote appeals, determining the impact of new voting technologies). The randomized experimental framework is best suited for exploring causal inferences. In an experiment, a study population is chosen (ideally) at random, or otherwise, by a careful selection of a convenient sample. Another random process determines whether or not each unit will receive a treatment. Because randomization ensures that the treatment and control units are identical in distribution, save that the treatment units have received a treatment, the treatment effect can then be defined as the difference in response (measurable outcome) between the units in the treatment group and those in the control group. In addition to offering 398 Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Operations Research 61(2), pp. 398–412, © 2013 INFORMS queries. Indeed, there are already many instances of research attempting to make causal inferences using observational data. In the health field, for example, studies have examined the impact of generic substitution of presumptively chemically equivalent drugs (Rubin 1991), the consequences of in utero exposure to phenobarbital on intelligence deficits (Reinisch et al. 1995), and the effect of maternal smoking on birthweight (da Veiga and Wilder 2008). Public policy applications of causal analysis have included the impact of different voting technologies for counting votes (Herron and Wand 2007), the varying role of information on voters in mature versus new democracies (Sekhon 2004), and the effect of electoral rules on the presence of the elderly in national legislatures (Terrie 2008). At the same time, there is no consensus on how best to proceed if one wishes to make causal inferences with observational data. The critical difference between experiments and observational studies is that in experiments, because units are randomly assigned to a treatment, the distributions of their covariates (attributes) in the treatment and control groups are identical, isolating the effect of treatment and permitting its determination in expectation. Although various mechanisms have been proposed for random assignment in the statistical literature to handle such issues (Morris 1985), working with observational data sets requires a different set of tools. It is well recognized that confounding effects in a data set may exist due to both observed (those reflected in the data set) and unobserved covariates. Dealing with unobservable covariates is a fundamental challenge for causal inference and requires additional information to supplement the available data, whereas the effects of observed covariates can be isolated by data postprocessing, which has received significant interest from practitioners as reported above. A large body of literature has been sparked by the works of Rubin and Rosenbaum, the first to present definitions, assumptions, and discussions to arrive at a technically sound formulation of the causal inference problem with observational data (see individual references in the text below). This paper makes a contribution to this already rich literature, offering an alternative approach to causal analysis. In order to analyze observational data, where treatment assignment has already been made (a priori nonrandomly), one must postprocess the data with respect to the observed covariates so as to remove confounding effects by creating treatment and control groups with statistically indistinguishable distributions of their covariates. How to best postprocess observational data and assess the success of this venture is an open question. To transition from a randomized experimental setting to an observational setting, the nuances and similarities of each must be examined. For unit u, let Yu1 (Yu0 ) denote a treated (untreated) response; Tu , a treatment indicator (1 means treated, 0 means not treated); and Xu = 8X1u 1 X2u 1 0 0 0 1 XKu 9, a vector of values for K covariates. 399 In both experimental and observational settings, a population of units is under consideration. For a particular unit u, the causal effect of the treatment (relative to the control) is defined as the difference in response that results from receiving and not receiving the treatment, Yu1 − Yu0 . The fundamental problem of causal inference is that it is impossible to observe both values Yu1 and Yu0 on the same unit u (Holland 1986) (e.g., a person either smokes or does not smoke). The outcome of an observation of a unit is termed the observed response, Tu Yu1 + 41 − Tu 5Yu0 . The Rubin causal model (Rubin 1974, 1978) reconceptualizes this causal inference framework so that the response under either treatment or control, but not both, needs to be observed for each unit. That is, one statistical solution to the fundamental problem of causal inference is to shift to an examination of an average causal effect over all units in the population, E4Yu1 − Yu0 5 = E4Yu1 5 − E4Yu0 5, where E4Yu1 5 is computed from the treatment group and E4Yu0 5 is computed from the control group. An important consideration is how one determines which units will inform the values of Yu1 and Yu0 . In an observational study, one observes some pool of units who have received a treatment, giving E4Yu1 T = 15, and some pool of units who have not received a treatment, giving E4Yu0 T = 05. In general, E4Yu1 5 6= E4Yu1 T = 15 and E4Yu0 5 6= E4Yu0 T = 05. Moreover, the average treatment effect (ATE), E4Yu1 − Yu0 5, is not the same as the average treatment effect for the treated (ATT), E4Yu1 T = 15 − E4Yu0 T = 15. By design, ATE and ATT are interchangeable if the independence assumption holds. That is, if exposure to treatment (T = 1) or control (T = 0) is statistically independent of response and covariate values, then the units have been properly randomized into treatment and control pools, rendering ATE and ATT to be the same. This situation is not typically the case in observational studies because units are not randomly placed into treatment and control pools. Instead, ATT = E4Yu1 T = 15 − E4Yu0 T = 15 = E4Yu1 T = 15 − E4Yu0 T = 05 + B, where selection bias is present, defined as B ≡ E4Yu0 T = 05 − E4Yu0 T = 15. One approach for estimating treatment effects outside the experimental realm relies on multivariate statistical techniques, which fall under the broad rubric of matching methods (Rubin 2006). The core of these methods is to employ tools to match units based on their covariate similarity. This results in each treatment unit being matched with a control unit. If the matching venture is successful, then treatment and control groups are obtained such that the two groups are similar in their covariates, differing only on the treatment indicator value, thereby reducing the bias in the estimation of treatment effects. Although this set of techniques has been widely used, there remains a lack of consensus on how best to achieve matching or how to assess the success of a matching process. However, a generally accepted principle is that balance on the covariates leads to minimal bias in the Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. 400 Operations Research 61(2), pp. 398–412, © 2013 INFORMS estimated treatment effect (Rosenbaum and Rubin 1985). Here, balance has been loosely understood as similarity between distributions of covariates in the treatment and control groups. Therefore, whereas most researchers agree that a reasonable goal of matching procedures is to obtain balance, there remains disagreement on how to measure balance, leading to a difficulty in assessing how a particular matched group compares to other possible matched groups that achieve varying levels of balance. The resulting lack of guidance is a critical omission, because different matched sets can lead to conflicting conclusions. Interestingly, few of the existing matching methods directly attempt to obtain optimal covariate balance despite claiming that covariate balance is the measure by which to judge the success of the matching procedure. Instead, researchers perform some type of matching (e.g., propensity score matching, Mahalanobis matching), check to see if the groups appear to be roughly similar, and, if unsatisfied, modify parameters of the matching procedure (e.g., distance metric weights or regression model specification) and repeat (see Figure 1). The point at which to end this iterative procedure is at the discretion of the researcher. By design, researchers are unable to objectively assess the quality of their final matched groups because the benchmark, the matched groups with optimal balance, is unknown. Recognizing this issue, recent work of Diamond and Sekhon (2010) attempts to streamline the process of “match—check balance—adjust and repeat as needed” by using a genetic algorithm to adjust the parameters and weights used in the matching algorithm in order to obtain matched samples with the best possible balance measure. Other researchers have also begun to move towards the idea of direct optimization of balance within a matched samples framework. In particular, Rosenbaum et al. (2007) introduce the notion of fine balance, which “refers to exactly balancing a nominal variable, often one with many categories, without trying to match individuals on this variable” (Rosenbaum et al. 2007, p. 75). This relaxation from exact individual matches on a covariate to equal proportions of individuals in the treatment and control groups for each value of the covariate is central to the approach proposed in this paper. Whereas Rosenbaum et al. (2007) consider fine balance for one (nominal) covariate, with matches required on the rest, this paper extends this concept to all covariates. Another recent effort introduced entropy balancing (Hainmueller 2012), which uses a maximum entropy Figure 1. reweighting scheme to adjust weights for each of the control individuals in order to meet user-specified balance constraints placed on the moments of the covariate distributions. For more background on the idea of weighting observations in a data set, see Hellerstein and Imbens (1999). Matching treatment and control units on an individual level is one method to achieve covariate balance; however it is not a guarantee. We argue that although the focus in the causal inference literature has been on matching, the matching itself of treatment units to control units is not necessary. Notable publications that support the idea of conducting causal analysis on an aggregate, group level include Abadie and Gardeazabal (2003) and Abadie et al. (2010). Matching is not the only way to reduce selection bias, and arguably not even the best way, because one is not interested in unit matches per se, but in creating control and treatment groups that are statistically indistinguishable in the covariates (i.e., featuring covariate balance). Such an observation suggests that a shift in direction is possible in how treatment and control groups can be created. To realize such a shift, §2 motivates and presents the Balance Optimization Subset Selection (BOSS) approach to the problem of causal inference based on observational data. Section 3 reports computational results from one BOSS algorithm for the estimation of treatment effect in a simulated problem. Section 4 offers concluding remarks, discusses the potential of the BOSS approach, raises some theoretical and practical challenges, and outlines several topics for future investigation within the operations research community. Note that the main contribution of this paper is conceptual and theoretical. The goal of §2 is to present the problem of causal inference in a new light, opening up a field where optimization tools developed within the operations research community can make an impact. By motivating and formalizing an alternative approach to a problem of great importance to multiple domains of modern science, this paper is intended as a seed for more applied, computational-oriented literature. Section 3 is not meant to be comprehensive; instead, it positions itself to illustrate that the proposed theory can shift the problem at hand into the computational realm. It is not intended to deliver comprehensive numerical achievements, but rather supports the call for more intense, goal-driven computational research of BOSS. The electronic companion to this paper is available as supplemental material at http://dx.doi.org/10.1287/ opre.1120.1118. Matching methods logic. Choose/adjust regression/matching model parameters Run a matching algorithm to find a solution Are the covariates in treatment and control groups balanced? Yes No Repeat Report a treatment effect estimate, bootstrap for variance Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 401 Operations Research 61(2), pp. 398–412, © 2013 INFORMS Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. 2. BOSS Approach The presented approach offers an alternative perspective on causal inference using observational data. It exploits the idea that covariate balance leads to minimized bias in the estimated treatment effect by directly optimizing a balance measure without requiring matched samples. As noted in §1, although the success of matching methods is assessed by the degree of balance achieved, very few of the current matching methods directly optimize balance, resorting to different types of optimization problems (e.g., optimal parameter estimation for regression models, optimal assignment for unit matching with calipers). Traditional matching methods simply report balance statistics without a guide to assessing whether the reported balance could be improved upon, is good, or even sufficient. There may be no standard metric to assess the degree of balance achieved; however, a discussion of balance is always presented and perceived as a final verdict, validating a conducted analysis. This simple observation highlights that the problem at hand is a balance optimization problem, not a matching problem. Matching is one method to obtain balance, but it unnecessarily restricts the solution space and lacks a measure of balance optimality. Indeed, the end goal is balance, not matching, and hence, optimizing on balance measures is reasonable and preferred. The BOSS approach to causal inference with observational data reformulates the problem as one of balance optimization (Cho et al. 2011). In so doing, the problem is transformed from matching individual units to a subset selection problem, and exploits operations research methodologies (and in particular, discrete optimization) that are ideally suited to model and address the balance optimization problem. In essence, BOSS inverts the direction of the solution methodology and redefines the problem structure to directly obtain the goal of covariate balance (see Figure 2). Note that the results of this subset selection approach come at a cost of losing qualitative information of individual matches, which may be useful in some practical situations; however, group-based average quantities can be estimated more precisely. Assumption 1 (Strong Ignorability for Groups). Consider a population of all groups of size N , where SN ≡ 8ui 9Ni=1 denotes any such group of N observed units, which are either entirely treated (i.e., 8Tu = 19u∈SN ) or untreated (i.e., 8Tu = 09u∈SN ). For any set of covariates 8Xu 9u∈SN , assume 4ȲS1 N 1 ȲS0 N 5 q 8Tu 9u∈SN 8Xu 9u∈SN 1 0 < P 48Tu = 19u∈SN 8Xu 9u∈SN 5 < 10 Proposition 1. Assume that Assumption 1 holds. From the treatment pool, randomly select treatment group ST N . Next, randomly select groups of size N from the control pool, until control group SC = N is identified such that 8Xu 9u∈SC N 8Xu 9u∈STN . Then, (3) N Proof. The described mechanism for the selection of ST N, and subsequently, SC ensures that N E4ȲS1 T 5 = Ex 8Tu =19u∈S E4ȲS1 N 8Tu = 19u∈SN N N ∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN 1 BOSS logic. Choose balance measure (a statistic for testing distribution fit) (2) Expression (1) means that for any group of units, its average responses are independent of treatment, given the units’ covariate values. The symbol “q” signifies conditional independence (Dawid 1979). This implies that the K observed covariates include all the covariates, dependent on the treatment assignment Tu , that have causal effects on the responses Yu1 and Yu0 , for every unit u. Additionally, by expression (2), each group with a given set of its units’ covariate values is assumed to have a positive probability of appearing in either the treatment pool or control pool. These assumptions are made throughout the statistical literature, albeit for individual units (Rosenbaum and Rubin 1983). Assumption 1 is equivalent to the original assumption of Rosenbaum and Rubin (1983) when N = 1. The following proposition captures the objective of any method of postprocessing observational data for causal inference. N To motivate the subset selection problem and explain balance on covariates and why it is required for unbiased estimation of the treatment effect, a formal problem formulation is presented. (1) and E4ȲS1 T − ȲS0 C 5 = ATT0 2.1. The Value of Covariate Balance Figure 2. Let SN ≡ 8ui 9Ni=1 denote a set of N observed P units. Define the average treated response ȲS1 N = 41/N 5 u∈SN Yu1 P and the average untreated response ȲS0 N = 41/N 5 u∈SN Yu0 . Given a set of units that have received treatment, treatment pool T; a set of units that have not received treatment, control pool C; and a set of K covariates, a pair of subsets for comparison is identified: treatment group ST N ⊂ T and control group SC ⊂ C. To understand the value of covariN ate balance in causal inference, the following assumption is required (Rosenbaum and Rubin 1983). Run BOSS algorithm to find multiple solutions minimizing the balance measure Report the balance and compute the mean and variance of treatment effect Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 402 Operations Research 61(2), pp. 398–412, © 2013 INFORMS E4ȲS0 C 5 = Ex 8Tu =19u∈S E4ȲS0 N 8Tu = 09u∈SN N N ∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN 0 By definition, Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. ATT = E4ȲS1 N − ȲS0 N 8Tu = 19u∈SN 50 By conditioning, ATT = Ex 8Tu =19u∈S E4ȲS1 N − ȲS0 N 8Tu = 19u∈SN N ∩ 8Xu 9u∈SN = x5 1 and under Assumption 1, ATT = Ex 8Tu =19u∈S E4ȲS1 N 8Tu = 19u∈SN N ∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN − Ex 8Tu =19u∈S E4ȲS0 N 8Tu = 09u∈SN N ∩ 8Xu 9u∈SN = x5 8Tu = 19u∈SN 1 which completes the proof. From Proposition 1, the key to causal inference research is the ability to identify control groups with the joint distribution of covariates identical to that of a treatment group. This translates into the property that the probability that (as a group) units in SC N could be treated is the same as the probability that units in ST N are treated. Note that for individual units (i.e., for N = 1), this probability is known as the propensity score. If the distributions of covariates in C groups ST N and SN are the same, then such groups are said to be optimally balanced on the set of the K covariates, rendering P 48Tu = 19u∈SCN 5 = P 48Tu = 19u∈STN 5. The result of Proposition 1 is for groups of units, not C individual units. If groups ST N and SN have one unit each (N = 1), and these units are perfectly matched (Xu∈SCN = Xu∈STN ), then (3) holds. Similarly, in propensity-score based methods (Rosenbaum and Rubin 1983), regression is used to match units with the same estimated probabilities of being treated, again to have P 48Tu = 19u∈SCN 5 = P 48Tu = 19u∈STN 5 for groups of such units. In all these methods, however, a value assessing covariate balance is judged after the data have been postprocessed, with covariate balance not serving as a direct guide for optimal group selection. Although more rigorously designed propensity score models might mitigate this problem to some degree, such potential advances will require deeper statistical design research in the future. 2.2. Modeling and Optimization for Causal Inference BOSS reframes the causal inference problem as a subset selection problem. The goal is to randomly generate ST , a subset of T, and find SC , a subset of C, such that a measure of balance, M4ST 1 SC 5, is optimized. This discrete optimization problem can be addressed using operations research algorithms and heuristics. This formulation, moreover, lays the foundation for the development of a new analytical model that exploits the power of ever-increasing computational resources to assess, inform, and improve data analytic techniques. The BOSS conceptualization is flexible and falls within a general discrete optimization framework. Various measures of balance can be adapted into BOSS. This paper provides a detailed statement of one instance of a balance optimization problem, using a balance measure for a binning model. An intuitive way of comparing distributions is a visual study of histograms based on their probability mass functions (pmf) (Imai 2005). Using goodness-of-fit test statistics based on histograms is a more precise and rigorous way of quantifying the difference between covariate distributions for ST and SC . More formally, for each covariate k = 11 21 0 0 0 1 K, its range 6Lk 1 Uk 7, with Lk = minu∈T∪C Xku and Uk = maxu∈T∪C Xku , can be broken up by thresholds Lk = t0k < k t1k < t2k < · · · < tR4k5 = Uk . The total number of thresholds R4k5 used for covariate k = 11 21 0 0 0 1 K is typically the number of categories for discrete (categorical) variables and some positive integer for continuous variables. This is similar to the coarsening procedure proposed by Iacus et al. (2012) for coarsened exact matching. Let covariate cluster D denote a subset of the set of covariates D ⊆ 811 21 0 0 0 1 K9. For any covariate cluster D = 8k1 1 k2 1 0 0 0 1 km 9 consisting of m covariates, with 1 ¶ k1 < k2 < · · · < km ¶ K, define a set of bins BD as the set of k1 k2 km intervals of the form 6tr−1 1 trk1 7 × 6tr−1 1 trk2 7 × · · · × 6tr−1 1 trkm 7 that spans the entire joint range of values of the covariates in D. Assuming a given fixed ordering of the elements in BD , the individual bins are indexed 8B1D 1 B2D 1 0 0 0 1 BRDm 9, Qm with Rm ≡ j=1 R4kj 5. These bins are used to quantify the difference between the joint distributions of values of covariates in D for groups ST and SC . Let N4S1 BbD 5 denote the number of units in group S with the values of covariates in D contained in bin BbD , or the number of units falling into bin b. The objective of the BOSS optimization problem is to minimize the difference between N4SC 1 BbD 5 and N4ST 1 BbD 5 over all of the bins for all covariate clusters of interest, where any objective function that simultaneously minimizes these differences can be used to evaluate the distribution fit. The Balance Optimization Subset Selection with Bins (BOSS-B) problem is now formally stated: Given: K covariates; a fixed integer N ; set ST , randomly selected from set T of units represented by vectors 8X1u 1 X2u 1 0 0 0 1 XKu 9, u ∈ T, with T = N ; set C of units represented by vectors 8X1u 1 X2u 1 0 0 0 1 XKu 9, u ∈ C, with C > N ; a set of covariate clusters D; bins BD for each D ∈ D. Objective: find subset SC ⊂ C of size N , such that X X 4N4SC 1 BbD 5 − N4ST 1 BbD 552 (4) max4N4ST 1 BbD 51 15 D∈D b=11 21 0001 BD is minimized. Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 403 Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Operations Research 61(2), pp. 398–412, © 2013 INFORMS BOSS-B is a balance optimization problem. It exemplifies how the BOSS approach can be used for causal inference, with one measure of balance M4ST 1 SC 5 expressed by (4). In BOSS-B, assignments of treatment and control units into groups are determined such that a finite number of preselected marginal and/or joint distributions of covariates are optimally balanced, thereby isolating the effect of treatment from marginal and/or joint effects of these covariates and reducing bias in the estimated expected difference between the treatment and the control responses. The objective function (4) is similar in form to the chi-square test statistic, which provides additional meaning to the formulation. As the distributions get simultaneously balanced, which occurs with an increasing number of bins, the more accurate estimates of the treatment effect can be obtained. However, as more bins are used, resulting in the histogram resolution increasing, optimizing (4) becomes more difficult, because fewer and fewer control groups can be identified as similar to the treatment group. Additionally, the number of required bins for a covariate cluster grows exponentially with the number of covariates in that cluster. Fortunately, this exponential growth is mitigated by the fact that the number of occupied bins for any covariate cluster is at most T + C. The decision version of BOSS-B is NP-complete through a polynomial many-one reduction from the “Exact Cover by 3-Sets” problem, which is known to be NP-complete (Garey and Johnson 1979), and hence, the optimization version of BOSS-B is NP-hard (see the online supplement for a formal proof). However, for small-size problem instances, algorithms like simulated annealing are sufficient to deliver good results in reasonable computing time. Note also that many algorithms solving an instance of BOSS often encounter a large number of optimal or nearly optimal solutions, depending on the binning scheme that is used. As one might intuitively guess, there exist multiple subsets of the treatment and control pools (i.e., solutions to a balance optimization problem) that yield optimal or nearly optimal balance. Swapping out a single unit for another often produces only small changes in the balance function. Often even fairly large differences in subsets result in similar balance values. Accordingly, in addition to finding the optimal balance, it is helpful to also examine the subsets that produce similarly balanced covariates and estimate the spread of the distribution of the treatment effect. BOSS-B is a control group that is selected out of a larger control pool of units. Also, for a given instance of BOSS-B, refer to solutions with zero objective function in (4) as perfectly optimized. A perfectly balanced solution (i.e., one that has exactly the same joint distribution of covariates in a control group as in the treatment group) is typically perfectly optimized in any measure of balance, though the reverse is not necessarily true. For example, balance on all of the marginal distributions does not generally imply balance on the joint distribution. Three sources of error are inherent with the application of BOSS-B: error due to noise in the response functions for Y 1 and Y 0 ; error due to bin size or the number of bins used; error due to nonzero objective function (when a perfectly optimized solution is not found or does not exist). The first source of error is present in all problems, resulting from the uncertainty inherent in all processes in nature, and hence cannot be eliminated. However, given Assumption 1, the noise in the response has zero mean, and averages to zero for sufficiently large treatment and control groups. The other two sources of error are not so well behaved. However, under certain assumptions, the impact of these errors can be limited. Ideally, one would like to obtain SC N ⊂ C that feature perfect balance on the joint distribution of all covariates, D = 811 21 0 0 0 1 K9. Note that this condition is equivalent to perfect individual matching, which, if possible, one could find in polynomial time (in the sizes of T and C, and N ) using an assignment algorithm. In practice, however, this is rarely achievable for N large. Therefore, suboptimal solutions may need to be considered, which is why working with observational data is a challenge. Fortunately, perfect balance on the joint distribution of all covariates may not be necessary for accurate inference. This suggests that most real-world causal inference problems can be solved using groups that offer good, albeit not perfect, balance, or using groups that are perfectly balanced on a more limited set of marginal and/or joint distributions of covariates, for making a correct inference. Theorem 1 illustrates the latter point. 2.3. Theoretical Aspects of BOSS-B where random variable 1405 represents noise, with E4 1405 5 1405 = 0. Suppose also that the function hk 4Xku 5 is locally Lipschitz continuous such that for each k = 1, 21 0 0 0 1 K, This section discusses how solutions to a balance optimization problem can be used to obtain estimates for ATT, and how the estimation bias is reduced as a function of covariate clusters in BOSS-B (more specifically, the number of bins) and the quality of solutions achieved for a given measure of balance. Without loss of generality, assume that ST = T. In most real-world observational studies, treated units are rare, and hence, all available such units are included in the treatment group. Therefore, a solution to Theorem 1. Suppose that for any unit u, response Yu1405 can be expressed as a sum of functions of individual covariates, Yu1405 = 1405 X hk 4Xku 5 + 1405 1 (5) k=11 21 0001 K 1405 1405 1405 hk 4x1 5 − hk 4x2 5 ¶ Lk x1 − x2 1 1405 (6) where Lk is a positive Lipschitz constant for the func1405 tion hk , k = 11 21 0 0 0 1 K. Consider an instance of BOSS-B with ST N = T, N = T, and D = 88191 8291 0 0 0 1 8K99. The bias that arises in the estimation of ATT using an estimator Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 404 Operations Research 61(2), pp. 398–412, © 2013 INFORMS ȲS1 T − ȲS0 C , obtained from a perfectly optimized solution N N SC N ⊂ C, then converges to zero as the number of bins in the sets BD , D ∈ D, approaches infinity telescopically (i.e., the number of bins is increased by uniform sequential subpartitioning). Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. 415 SC N , Proof. Consider the control group a perfectly optimized solution to an instance of BOSS-B with fixed sets 425 of bins BD , D ∈ D. Also, consider control group SC N , a perfectly optimized solution to the same instance of the BOSS-B problem, where bin BrD ∈ BD for some D = 8k9 ∈ D, k ∈ 811 21 0 0 0 1 K9, and r ∈ 811 21 0 0 0 1 R4k59 is partitioned to form bins BrD1 and BrD2 such that BrD1 ∩ BrD2 = D and BrD1 ∪ BrD2 = BrD . Define sets Ir = 8i2 i ∈ ST N , Xki ∈ Br 9, 415 C415 D 425 C425 Jr = 8j: j ∈ SN , Xkj ∈ Br 9, and Jr = 8j: j ∈ SN , Xkj ∈ BrD 9. Let ã1 , ã2 , and ã denote the volumes of bins BrD1 , BrD2 , and BrD , respectively. Also, let Z denote the num415 ber of control units in SC falling into bin BrD , and let N 425 Z1 , Z2 denote the number of control units in SC falling N into bins BrD1 and BrD2 , respectively. By design, ã = ã1 + ã2 and Z = Z1 + Z2 , and Jr425 = Z1 , Jr425 = Z2 and Ir = 1 2 415 Jr = Z. Proposition 1 describes an approach to select treatment and control groups to ensure that ȲS1 T − ȲS0 C is an unbiased N N estimator of ATT. Using this notation, observe that 41/Ir 5 · P 1 1 i∈Ir Yi is an unbiased estimator of E4ȲSN 8Ti = 19i∈Ir 5, P 415 by (5). However, in general, 41/Jr 5 i∈Jr415 Yi0 is not an unbiased estimator of E4ȲS0 N 8Ti = 19i∈Ir 5, because the exact values in covariate k for the control units falling into a single bin may be different from the values for treatment units in the same bin. As such, an imbalance is created within bin BrD , because the treatment and control values are not identically distributed within the bin. This imbalance results in a contribution B4BrD 5 to the bias in the estimation 415 of E4ȲN0 8Ti = 19i∈Ir 5 using SC N , 1 X X 1 E4Yj0 5 − E4Yi0 50 B4BrD 5 ≡ 415 Jr 415 Ir i∈Ir j∈Jr From (5) and (6), B4BrD 5 = ¶ 1 E Z 1 Z ! X h0k 4Xkj 5 − X h0k 4Xki 5 i∈Ir 415 j∈Jr h0k 4Xkj 5 − h0k 4Xki 5 ¶ L0k ã ≡ U 415 1 X 415 i∈Ir 1 j∈Jr where U 415 is an upper bound on the bias B4BrD 5. Similarly, by (5), an imbalance within bins BrD1 and BrD2 results in contributions B4BrD1 5 and B4BrD2 5, respectively, to the bias 425 in the estimation of E4ȲN0 8Ti = 19i∈Ir 5 using SNC , with B4BrD1 5 + B4BrD2 5 ¶ 1 Z X 425 + X 425 i∈Ir2 1 j∈Jr2 B4BrD1 5 + B4BrD2 5 ¶ L0k Z1 ã1 + Z2 ã2 ≡ U 425 1 Z1 + Z2 which is an upper bound on the bias B4BrD1 5 + B4BrD2 5. Observe that for Z1 > 0, Z2 > 0, ã1 > 0 and ã2 > 0, Z1 ã1 + Z2 ã2 < Zã, and hence, U 425 < U 415 . Moreover, if bin BrD is subpartitioned uniformly, which implies ã1 = ã2 , then U 425 = U 415 /2. Generalizing this argument to a telescopically increasing number of subpartitioned bins, let U denote the bias in the estimation of E4ȲS0 N 8Tu = 19u∈SN 5 when no optimization is conducted and SC N ≡ C. Observe that because U is finite, B then for a perfectly optimized SNC to the instance S solution D of BOSS-B with bins B = D∈D B , the total bias can be bounded, and converges to zero as the number of bins, B, approaches infinity, 1 X Yu0 − E4ȲN0 8Tu = 19u∈SN 5 B≡ N u∈SC N ¶ X b∈B B4b5 ¶ U → 00 B Theorem 1 assumes that the response function (5) is separable, meaning that it can be represented as a sum of functions of individual covariates. Although such an assumption may appear restrictive, this class of functions subsumes the class of extensively studied separable models given by Yu = 0 + 1 ê4X1u 5 + 2 ê4X2u 5 + · · · + K ê4XKu 5 + 0 Furthermore, in the linear modeling literature, if the response function includes a term that is a function of two or more covariates, say Xk1 u ∗ Xk2 u , then the response function can be converted to a linear model by introducing a new covariate that is the product of covariates k1 and k2 . More generally, if the response function is a function of several covariates, say 4Xk1 u 1 Xk2 u 1 0 0 0 1 Xkd u 5, with 1 ¶ k1 < k2 < · · · < kd ¶ K, then the response function can be transformed to satisfy the assumptions of Theorem 1 by introducing a new covariate that is the joint of Xk1 u 1 Xk2 u 1 0 0 0 1 Xkd u . Theorem 1 shows that under (5) and (6), as the number of bins in BOSS-B problem grows and perfectly optimized solutions are identified, ȲS1 C − ȲS0 C monotonically N N converges to E4ȲN1 − ȲN0 8Tu = 19u∈SN 5, and hence gives the minimally biased estimator of ATT that can be obtained using the available observed data. 3. Computational Analysis h0k 4Xkj 5 − h0k 4Xki 5 i∈Ir1 1 j∈Jr1 Therefore, by (6), ! h0k 4Xkj 5 − h0k 4Xki 5 0 This section illustrates the theory of §2 by presenting a simple numerical example. Note that its contribution to the paper is more illustrative than fundamental. By setting up a computational model for a limited problem and using a Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 405 Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Operations Research 61(2), pp. 398–412, © 2013 INFORMS generic optimization algorithm to attack this problem, the reader can visually inspect the dynamics of the proposed balance optimization and the convergence of the proposed estimator to the treatment effect. It also provides grounds to discuss future computational challenges for BOSS. The simulated experiments presented illustrate that as a balance measure approaches its optimal value, the bias in the estimate of the treatment effect decreases. Additionally, as the number of bins increases, (4) allows for more accurate estimation of the treatment effect. 3.1. Experimental Setup To illustrate the BOSS-B approach, two data sets were created, designated as data3c10k and data10c10k. Each data set consists of a treatment group of 500 units and a control pool of 10,000 units using 3 and 10 covariates, respectively. The data sets were created by first randomly generating a pool of 5,000 potential treatment individuals and a pool of 10,000 control individuals, with the covariate values for each unit drawn from a normal distribution. Once the units were generated, each unit i was assigned a response value using the expression 1405 Yi = 10 + 7X1i + 6X2i + 5X3i − 3X4i + 3X5i + 2X6i + X7i − X8i + 005X9i + 001X10i + i 1 (7) where i ∼ N 401 25. (The extra covariate terms are omitted for data3c10k.) Under this formulation, there is no treatment effect (i.e., exposure to treatment has no effect on the response): ATT = 0. Once the individuals were created, a treatment group of 500 units was drawn randomly but nonuniformly from the pool of potential treatment individuals. Individuals with covariate values in the tails of the covariate distribution were drawn with higher probability than those with values in the center of the distributions, ensuring that the resulting treatment and control groups had different covariate distributions. Figure 3 shows the initial distributions in the treatment group and control pool for covariates 1, 2, and 3, respectively, of data3c10k. In these histograms, covariate values are separated into 32 uniformly sized bins. The number of control units in a bin was normalized by a factor of 1/20 to account for the difference in size between the treatment group and control pool. The histograms indicate that the covariate distributions of the treatment group differ from those of the control pool, particularly for the first two covariates. Optimization was performed using a simulated annealing algorithm (Kirkpatrick et al. 1983). In the experiments, the preselected treatment group was used, and the desired control group size was 500 units. The first step in the algorithm is to bin the data: each unit is converted from a vector of covariate values 8X1i 1 X2i 1 0 0 0 1 XKi 9 into a vec0 tor of bin numbers 8X1i0 1 X2i0 1 0 0 0 1 XKi 9 where Xki0 = j if k k and only if tj−1 ¶ Xki ¶ tj (i.e., unit i falls into bin j for covariate k). In the experiments, the bin thresholds were uniformly spaced across the covariate distributions, with R4k5 set to a given value (an input parameter) for all covariates k = 11 21 0 0 0 1 K. Moreover, a unique covariate cluster was created for each individual covariate. By Theorem 1, these covariate clusters are sufficient for generating an accurate estimate of ATT because of the separability of the response function (7). After binning the data, the simulated annealing algorithm begins with an initial control group consisting of a random subset of 500 units from the control pool. At each iteration, the algorithm attempts a 1-exchange, replacing one unit in the control group with an unselected unit in the control pool. If the exchange improves (4), then it is accepted unconditionally. Otherwise, it is accepted with some probability according to the input parameters. A random restart is applied when little progress has been made in (4) for some number of iterations or after the algorithm identifies a perfectly optimized control group. The algorithm terminates after performing a preset number of iterations. For more details, see Algorithm 1 in the paper’s online supplement. 3.2. Experimental Results Several experiments were conducted on the two data sets (data3c10k and data10c10k) using uniformly spaced bins with R4k5 = 41 81 16, and 32 for all k = 11 21 0 0 0 1 K. This sequence was chosen because it forms a bin scheme where each successive set of bins simply subdivides the previous set of bins in half, creating a telescopic increase in the number of bins. For each data set and bin scheme, 25 runs of the simulated annealing algorithm were performed, with a different random seed used for each run. Throughout a run, every 50th identified control group or perfectly optimized control group was processed and stored, along with KolmogorovSmirnov (KS) two-sample goodness-of-fit test statistics for the treatment and control covariate distributions. For data sets with multiple covariates, the KS test statistic values were averaged over all the covariates. Upon completion of the experiments, any duplicated control groups were removed. This was implemented by assigning a hash number to each control group based on its units. Note that because the search process moves by 1-exchange, each successive control group that is reported by the algorithm will have a high degree of overlap with the previously reported control group. To prevent overlap among the perfectly optimized solutions, random restarts were performed after each perfectly optimized solution was identified. This facilitates the generation of perfectly optimized control groups with minimal overlap between them. Table 1 summarizes the features of optimal solutions obtained in solving the data3c10k instance. In the table, the objective function in (4) is referred to as Difference Squared (DiffSqr). Column Bins specifies the number of bins used (per covariate), and the column Observations reports the number of perfectly optimized solutions that Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 406 Operations Research 61(2), pp. 398–412, © 2013 INFORMS Initial covariate distributions of treatment group and control pool (normalized) for data3c10k. Figure 3. Distribution of covariate 1 for data3c10k Distribution of covariate 2 for data3c10k 70 Treatment group Normalized control pool 90 Individuals with covariate values in bin range Individuals with covariate values in bin range 80 70 60 50 40 30 20 10 0 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Histogram bin number Histogram bin number Distribution of covariate 3 for data3c10k 70 Individuals with covariate values in bin range Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. 100 60 50 40 30 20 10 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Histogram bin number were identified. The remaining two columns list the treatment effect and the KS two-sample test statistic (averaged over the covariates), respectively. No results are presented for data10c10k because perfectly optimized solutions were Table 1. Optimal solutions for data3c10k with respect to DiffSqr objective. Treatment effect Bins 4 8 16 32 64 Observations 251214 171404 71689 833 0 KolmogorovSmirnov Mean SD Mean SD 202904 101434 002380 000122 N/A 002684 001605 001098 000900 N/A 001155 000825 000369 000274 N/A 000090 000072 000038 000027 N/A not obtained for this data set when more than four bins per covariate were used. Table 1 shows that as the number of bins for each covariate increases, the estimator mean tends toward the true ATT value of zero. The KS test statistic values also indicate an increasingly higher level of balance in the covariate distributions of the treatment and control groups. Table 2 shows the difference in covariate means for the treatment group and control pool, as well as the difference in covariate means for the treatment group and an optimized control group obtained by solving BOSS-B with R4k5 = 32 for all k = 11 21 0 0 0 1 K. Observe that the bias due to covariate imbalance in the treatment group and control pool is largely removed by the optimization. Next, for a given data set and number of bins, all recorded control groups were sorted by their scores in (4). Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 407 Operations Research 61(2), pp. 398–412, © 2013 INFORMS Table 2. Difference of covariate means for covariates before and after optimization with R4k5 = 32. Table 4. Solutions for data10c10k ranked by DiffSqr objective using 32 bins. Difference of means data3c10k data10c10k Covariate Before optimization After optimization 1 2 3 1 2 3 4 5 6 7 8 9 10 00869 00862 00160 00539 00553 00420 −00355 00446 00346 00407 −00180 00208 00152 00009 00001 00007 00007 00014 00001 00002 00028 00007 00010 00005 00002 00009 Then, control groups in a fixed range of scores were aggregated and their estimated treatment effects and other relevant statistic values were averaged. Tables 3 and 4 display these average values obtained with R4k5 = 32 for all k = 11 21 0 0 0 1 K. Figures 4 and 5 show the trends for the treatment effect and its standard deviation, as the objective function value decreases. In general, as the score for (4) approaches zero, the estimated treatment effect tends toward 0, the true ATT value. Despite the inability to obtain perfectly optimized solutions for data10c10k, accurate ATT estimates are still obtained when the objective function is close to 0. Table 3. Solutions for data3c10k ranked by DiffSqr objective using 32 bins. Treatment effect OF range Observations KolmogorovSmirnov Mean SD Mean SD 000122 000679 001478 002291 002948 003596 004085 004666 005173 005584 005881 007889 101213 104045 106617 108779 200778 202490 204258 205803 000900 000950 001111 001173 001183 001233 001304 001303 001306 001315 001355 001790 001828 001974 001956 002050 002125 002160 002159 002250 000274 000282 000294 000312 000328 000344 000356 000370 000381 000394 000402 000449 000528 000597 000659 000713 000762 000808 000854 000892 000027 000028 000029 000032 000034 000035 000035 000036 000037 000037 000038 000047 000044 000046 000045 000047 000048 000049 000049 000052 OF range ¶2.0 2.0–3.0 3.0–4.0 4.0–5.0 5.0–6.0 6.0–7.0 7.0–8.0 8.0–9.0 9.0–10.0 10.0–20.0 20.0–30.0 30.0–40.0 40.0–50.0 50.0–60.0 60.0–70.0 70.0–80.0 80.0–90.0 90.0–100.0 Observations 0 1 25 116 229 332 327 377 350 31305 31105 21737 21677 21608 21649 21499 21527 21221 KolmogorovSmirnov Mean SD Mean SD N/A 002168 002409 002809 003567 004024 004467 004914 005159 007416 100607 103523 106002 108155 200576 202616 204404 206453 N/A 000000 001056 001113 001065 001198 001189 001200 001225 001719 001679 001748 001855 001970 001899 001956 002036 002113 N/A 000260 000251 000251 000255 000259 000262 000267 000271 000295 000328 000359 000384 000409 000434 000456 000477 000499 N/A 000000 000014 000016 000014 000013 000016 000016 000015 000021 000021 000021 000022 000022 000023 000024 000024 000024 Note that in Figures 4 and 5, there is a break where the objective function range changes from increments of 1 to increments of 10 between 9–10 and 10–20. This break is shown with bars in the plot and on the axis. Also, results from control groups with scores for (4) that were greater than 100 are available in the online supplement. 3.3. Comparison with an Alternate Balance Measure The BOSS framework is not limited to just the BOSS-B formulation presented in §2. Indeed, the goal of the BOSS Figure 4. data3c10k with 32 bins: Average treatment effect for varying objective function ranges. 3.0 2.5 TE TE+SD TE–SD 2.0 1.5 1.0 0.5 0 –0.5 10.0 – 20.0 20.0 – 30.0 30.0 – 40.0 40.0 – 50.0 50.0 – 60.0 60.0 – 70.0 70.0 – 80.0 80.0 – 90.0 90.0 – 100.0 833 41377 41675 31747 31098 21751 21308 21022 11873 11670 11544 101937 81313 71009 61148 51416 41910 41437 31920 31745 ≤ 1e – 07 1e –07 – 1.0 1.0 – 2.0 2.0 – 3.0 3.0 – 4.0 4.0 – 5.0 5.0 – 6.0 6.0 – 7.0 7.0 – 8.0 8.0 – 9.0 9.0 – 10.0 ¶1e−07 1e−07–1.0 1.0–2.0 2.0–3.0 3.0–4.0 4.0–5.0 5.0–6.0 6.0–7.0 7.0–8.0 8.0–9.0 9.0–10.0 10.0–20.0 20.0–30.0 30.0–40.0 40.0–50.0 50.0–60.0 60.0–70.0 70.0–80.0 80.0–90.0 90.0–100.0 Avg. TE for control groups in OF range Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Data set Treatment effect Objective function range Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 408 Operations Research 61(2), pp. 398–412, © 2013 INFORMS data3c10k with 32 bins: Average treatment effect for varying objective function ranges. Figure 5. Table 5. Solutions for data10c10k ranked by DOM objective. Treatment effect TE TE + SD TE – SD 2.5 OF range 2.0 1.5 1.0 80.0 –90.0 90.0 –100.0 70.0 –80.0 60.0 –70.0 50.0 –60.0 40.0 –50.0 30.0 –40.0 20.0 –30.0 10.0 –20.0 8.0 –9.0 9.0 –10.0 7.0 –8.0 6.0 –7.0 5.0 –6.0 2.0 – 3.0 0 4.0 – 5.0 0.5 3.0 – 4.0 Avg. TE for control groups in OF range Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. 3.0 Objective function range framework is to handle any proposed measure of balance M4ST 1 SC 5. For example, one can use a difference of means P as an optimization objective. Let 4S1 k5 = 41/S5 s∈S Xks be the mean value of covariate k across the individuals in S. Then, a BOSS objective is to find a control group SC ⊂ C with SC = T that minimizes K X 4SC 1 k5 − 4T1 k50 (8) k=1 Note that such analysis was done by Rubin (1973) for one covariate, where it was referred to as mean matching. With BOSS objective (8), no preprocessing of the data is necessary, because no binning is performed (compared to BOSS-B). Table 5 shows the performance of objective (8), referred to as DOM for difference of means, in determining the treatment effect across a wide range of solutions obtained during the simulated annealing algorithm execution. As the score for (8) approaches 0, the estimated treatment effect tends toward the true treatment effect of 0, which is as expected given the linear nature of the response function (7). Results for control groups with scores for (8) greater than 1000 are available in the online supplement. Observe that using (8) as a BOSS objective compared to (4) results in more accurate ATT estimation. This observation might lead one to assume that (8) is better than (4) at capturing balance. However, the KS scores are worse with (8), indicating that although the covariate means are close, the covariate distributions are not as balanced as those for the solutions obtained with (4). An additional set of experiments was performed to illustrate the importance of balancing the distributions. These experiments used a new data set, data3c10kn, created by taking the same individuals from data3c10k and using the response function 1405 Yi = 10 + eX1i + X2i2 + 001X3i3 + i 0 (9) ¶0.001 0.001–0.01 0.01–0.02 0.02–0.03 0.03–0.04 0.04–0.05 0.05–0.10 0.10–0.20 0.20–0.30 0.30–0.40 0.40–0.50 0.50–0.60 0.60–0.70 0.70–0.80 0.80–0.90 0.90–1.00 Observations 0 121004 661859 941364 941269 831005 2861406 3741035 2901608 2551131 2381708 2441812 2411576 2261956 2291046 2351354 KolmogorovSmirnov Mean SD Mean SD N/A 000596 000789 001115 001548 002015 003434 007421 102774 107747 202529 207030 301296 305528 309600 403380 N/A 000857 000913 000916 000920 000938 001323 002066 002244 002439 002560 002688 002770 002829 002831 002934 N/A 004101 004167 004201 004200 004199 004236 004419 004721 005027 005347 005667 005999 006350 006688 007032 N/A 000258 000276 000272 000264 000265 000266 000276 000291 000289 000306 000301 000315 000313 000312 000313 Five runs of the simulated annealing algorithm were performed with data3c10kn, using both (4) with R4k5 = 32 for all k = 11 21 0 0 0 1 K and (8). The best solutions obtained from these runs are reported in the first two rows of Table 6. In this case, the best solutions obtained with (4) lead to better estimates of ATT than those obtained with (8). Optimizing (4) results in more accurate estimation because Theorem 1 still holds for (9) due to the separability of the covariate terms. Moreover, the KS scores are better, indicating better balance for the covariate distributions. The function (8) can be improved by incorporating higher moments of the distributions, such as the variance. P Let s 2 4S1 k5 = 41/4S − 155 s∈S 4Xks − 4S1 k552 be the unbiased sample variance of covariate k across the individuals in S. Then two additional BOSS objectives can be defined as min K X 4SC 1 k5 − 4T1 k5 k=1 + K X s 2 4SC 1 k5 − s 2 4T1 k5 (10) k=1 and min K X 4SC 1 k5 − 4T1 k52 k=1 + K X s 2 4SC 1 k5 − s 2 4T1 k50 (11) k=1 These two objectives aim at finding control groups with the first and second moments of the covariate distribution as close as possible to those of the treatment group. Objectives (10) and (11) differ in the weight they place on the difference of means, with (11) squaring this difference for each covariate. For data3c10kn, the results of optimizing these Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 409 Operations Research 61(2), pp. 398–412, © 2013 INFORMS Table 6. Best solutions for data3c10kn for various objectives. Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Treatment effect Objective OF range DiffSqr(32) DOM DOM + DOV DOM2 + DOV ¶1e−07 ¶0.001 ¶0.001 ¶0.001 Observations 156 71086 357 403 two objectives (referred to as DOM + DOV and DOM2 + DOV ) are much better than those obtained for (8), as shown in Table 6. In a similar manner, higher moments can be included in the objective being optimized. Including higher moments ensures that the two distributions are closer and closer together, which is exactly what the BOSS-B formulation aims to achieve, albeit in a more direct manner. KolmogorovSmirnov Mean SD Mean SD −000170 −103889 000392 000986 000875 003395 000959 001057 000804 002770 001669 001435 000078 000226 000179 000121 To demonstrate the performance of BOSS with respect to existing matching methods, the Matching package (Sekhon 2011) was used. The package allows for matching based on propensity score, matching directly on the values of the covariates, or some combination of the two. For the purposes of testing, a standard logistic regression model was used to estimate the propensity score. Table 7 compares the best solutions (as defined by the objective function value, with ties broken arbitrarily) obtained by the BOSS procedure for objectives (4) with R4k5 = 32 for all k = 11 21 0 0 0 1 K, (8), (10), and (11) with the solutions returned by both propensity score matching and matching on the covariates for the data3c10kn data set (with the nonlinear response function (9)). Column Objective lists the method used to obtain the solution, column OF Score lists the function value of the best solution for the BOSS methods (no objective score is provided by the Matching package), column Treatment Effect lists the estimate of the treatment effect computed from the best solution, and columns Kolmogorov-Smirnov Mean and Max list the average and maximum values of the KS test statistic for the covariate distributions in the treatment group and the best control group. The propensity score model fares the worst in producing accurate estimates of the treatment effect, whereas direct matching and BOSS with objective functions (4), (10), and (11) all produce good results. The reason for the poor performance of the propensity score approach is the use of a linear model for estimating the propensity score, whereas the actual response function is nonlinear. A better model for estimating the propensity score would potentially improve these results. It should also be noted that the propensity score approach produces the worst balance as measured by the KS statistic, whereas BOSS with objective function (8) also produces unsatisfactory levels of balance, with BOSS with objective function (4) and covariate matching performing the best. A difficulty of matching on the covariates is that close matches become difficult to find as the number of covariates increases. To demonstrate this, the matching procedures were also run on the data10c10k data set. Table 8 shows the best solutions obtained by the BOSS approaches and the matching approaches. Because data10c10k uses a linear response function (7), both propensity score matching and BOSS with (8) perform better than they did in the previous case. This improvement occurs because balancing covariate means for a linear response function produces accurate ATT estimates. Estimating the propensity score with a linear model will accomplish this indirectly, whereas optimizing (8) will accomplish this directly. On the other hand, the effectiveness of covariate matching is greatly reduced due to the difficulty of finding close matches on 10 different covariates. Finally, BOSS with (4) is seen to produce the best covariate balance as measured by the KS test statistic, whereas the matching approaches produce the worst covariate balance. Table 7. Table 8. 3.4. Comparison with Matching Methods Comparison of single best solutions for BOSS and matching for data3c10kn. Comparison of single best solutions for BOSS and matching for data10c10k. KolmogorovSmirnov Objective OF score Treatment effect Mean Max DiffSqr(32) DOM DOM + DOV DOM2 + DOV Prop. score Cov. matching 0.0 1.50e−5 3.77e−4 2.69e−4 N/A N/A −001142 −009877 000271 001154 −103434 000943 00025 00093 00062 00045 00125 00025 00026 00118 00088 00060 00158 00034 Objective DiffSqr(32) DOM DOM + DOV DOM2 + DOV Prop. score Cov. matching KolmogorovSmirnov OF score Treatment effect Mean Max 209502 000029 000157 000158 N/A N/A 002168 001294 001857 001947 −001148 20818 00026 00039 00037 00045 00066 00067 00036 00056 00048 00052 00114 00088 Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 410 Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. 3.5. Discussion of Results Inspecting the reported results with the goal of evaluating the potential effectiveness of the BOSS approach, the conducted experiments well illustrate the theory of §2. The simulated annealing algorithm was able to perform well for BOSS-B and several other objectives, which suggests that specialized algorithms could be much more effective and efficient in finding optimal balance. Additionally, the BOSS approach performed favorably when compared with some of the existing matching methods proposed in the literature. The accurate estimates of ATT produced by BOSS in these experiments suggest that BOSS may be a viable approach to successfully determine whether or not a treatment effect exists in problems that approximate realworld scenarios for which observational data exists. For the BOSS-B formulation in particular, as R4k5 increases, (4) provides a better measure of covariate balance, and hence a better estimate of the treatment effect. However, as R4k5 increases, it also becomes more difficult to identify control groups that are perfectly optimized with respect to (4). Certainly there are improvements that can be made in terms of the optimization process, but determining the appropriate value for R4k5 and even the appropriate bin thresholds will be a major factor as well. For the former, Cochran (1968) states that for one covariate, subclassification with five categories is sufficient to remove about 90% of the existing bias under certain conditions. Rosenbaum and Rubin (1983) present similar results when subclassifying on the propensity score. Determining the appropriate locations for bin thresholds will be dependent upon the nature of the data. See Iacus et al. (2012) for further discussion of these issues. Another issue is determining which covariate clusters to use. In the experiments presented here, the covariate clusters were chosen based on knowing the separability of the response function. In a real-world problem, the response function will almost certainly be unknown, and therefore some guesswork will be involved in appropriately picking the covariate clusters. For the general BOSS problem, there remains significant work to be done in determining appropriate balance measures for optimization. In the simulated example problems considered here, the difference of means objective (8) was sufficient for a separable linear response function, but not for a separable nonlinear one. Although incorporating the variance into the objective (10) yielded more accurate results for the nonlinear response function, this may not always be the case. Determining exactly what balance measures should be optimized remains an open problem. 4. Research Directions BOSS introduces a new paradigm for developing an analytical toolbox based on techniques from operations research to create a solution methodology where human bias, associated, for example, with defining distance measures for Operations Research 61(2), pp. 398–412, © 2013 INFORMS matching or guessing the form of a regression model, is eliminated, and the accuracy of treatment effect estimation is limited solely by the complexity of an optimization problem (NP-hard) and available computational power. To make a connection between the balanced marginal distributions and the balanced joint distributions of covariates, the concept of copulas (Nelsen 1999) may be useful if a copula family can be designed to incorporate continuous and categorical covariate values simultaneously with a sizable number of parameters. In many cases, however, preserving the same covariance structure over the covariate values in the control and treatment groups might suffice. For example, if a treatment group consists only of pairs AA and BB, they would have the same marginal distributions as a control group with pairs AB and BA, because both A and B appear twice; the joint distributions, however, would not align. Examining covariance structures would identify and help alleviate this issue. One approach would be to minimize the covariance matrix difference directly, incorporating it into BOSS as part of the objective function or as a constraint. Note that some widely used matching approaches (e.g., propensity score matching) operate under the Stable Unit Treatment Value Assumption (SUTVA) that is violated when observations on one unit are affected by the particular assignment of treatment to other units. The BOSS approach also relies on this strong assumption, even though it may not hold in real observational studies and randomized experiments. The issue of space traversal, or how well BOSS explores the space of available control groups, is also a rich area for future exploration. For algorithms that generate a large number of optimal or near-optimal solutions, ensuring that these solutions are sufficiently diverse will allow for better estimates on the distribution of the treatment effect. One way in which this can be accomplished is by iteratively running the BOSS algorithm, finding an optimal control group, removing the members of the control group from the control pool, and then rerunning the BOSS algorithm using the smaller control pool. Alternatively, control individuals can be prevented from being used in a control group after appearing in some number of other identified control groups. In problems with a large number of covariates and/or covariate clusters to balance, it is unlikely that perfectly optimized control groups exist when using even a moderate number of bins for each covariate. Therefore, further research on binning-based measures of balance is required, and bounds are needed on the quality of a control group when it is not perfectly optimized. In the simulated experiments reported in §3, it was observed that many control groups that were near-optimal led to the correct decision with regards to the effectiveness of treatment, although the exact dynamics of this phenomenon are not completely clear. Alternate ways to assess the quality of a control group in addition to the objectives presented here should also be considered. Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Operations Research 61(2), pp. 398–412, © 2013 INFORMS Additionally, developing algorithms to optimize directly on covariate balance measures such as the KolmogorovSmirnov two-sample test statistic instead of using approximation techniques as binning is a promising direction. In the current implementation, using the KS score instead of objective (4) caused the search process to stall and fail to make significant progress. This suggests that a 1-exchange neighborhood is insufficient when used in conjunction with the KS score. For BOSS to be useful in practice, computational tools need to be developed that can analyze distribution(s) of the designed estimator(s). Besides point estimation, social scientists often resort to hypothesis testing as well as building confidence intervals, the tasks where estimating standard error becomes important. Although our computational investigations indicate that the distribution of the BOSS estimators presented in this paper appears to be Gaussian, more research is required to establish this result theoretically for the subset-selection based approach. The challenges presented should be addressed simultaneously by research communities over various domains of science. Statisticians might be interested in developing a copula approach for the balancing of joint distributions, whereas operations researchers and computer scientists might work on more efficient optimization algorithms. Opportunities for interdisciplinary collaboration may prove to be fruitful as this research direction continues to expand and evolve. Supplemental Material Supplemental material to this paper is available at http://dx.doi .org/10.1287/opre.1120.1118. Acknowledgments The authors thank Alexander Shapiro, the associate editor, and two anonymous referees for their helpful comments, which greatly improved the presentation of this paper and led to more substantial computational results. This research has been supported in part by the National Science Foundation [SES-0849223 and SES-0849170]. The second author was also supported in part by the Air Force Office of Scientific Research [FA9550-10-1-0387]. The fourth author was supported by the Department of Defense (DoD) through the National Defense Science and Engineering Graduate Fellowship (NDSEG) Program (32 CFR 168a). This material is based upon work supported in part by (while serving at) the National Science Foundation. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the United States Air Force, or the United States Government. The computational work was conducted with support from the Simulation and Optimization Laboratory at the University of Illinois. References Abadie A, Gardeazabal J (2003) The economic costs of conflict: A case study of the Basque country. Amer. Econom. Rev. 93(1):112–132. 411 Abadie A, Diamond A, Hainmueller J (2010) Synthetic control methods for comparative case studies: Estimating the effect of California’s tobacco control program. J. Amer. Statist. Assoc. 105(490):493–505. Cho WKT, Sauppe JJ, Nikolaev AG, Jacobson SH, Sewell EC (2011) An optimization approach to matching and causal inference. Technical report, University of Illinois at Urbana–Champaign, Urbana, IL. Cochran WG (1968) Effectiveness of adjustment by subclassification in removing bias in observational studies. Biometrics 24(2):295–313. da Veiga PV, Wilder RP (2008) Maternal smoking during pregnancy and birthweight: A propensity score matching approach. Maternal and Child Health J. 12(2):194–203. Dawid AP (1979) Conditional independence in statistical theory. J. Roy. Statist. Soc. Ser. B 41(1):1–31. Diamond A, Sekhon JS (2010) Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Technical report, Department of Political Science, University of California, Berkeley, Berkeley, CA. Accessed July 2011, http://sekhon.berkeley.edu/papers/ GenMatch.pdf. Garey MR, Johnson DS (1979) Computers and Intractability: A Guide to the Theory of NP-Completeness (Freeman and Company, San Francisco). Hainmueller J (2012) Entropy balancing: A multivariate reweighting method to produce balanced samples in observational studies. Political Anal. 20(1):25–46. Hellerstein J, Imbens G (1999) Imposing moment restrictions from auxiliary data by weighting. Rev. Econom. Statist. 81(1):1–14. Herron MC, Wand J (2007) Assessing partisan bias in voting technology: The case of the 2004 New Hampshire recount. Electoral Stud. 26(2):247–261. Holland PW (1986) Statistics and causal inference. J. Amer. Statist. Assoc. 81(396):945–960. Iacus SM, King G, Porro G (2012) Causal inference without balance checking: Coarsened exact matching. Political Anal. 20(1):1–24. Imai K (2005) Do get-out-the-vote calls reduce turnout? The importance of statistical methods for field experiments. Amer. Political Sci. Rev. 99(2):283–300. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680. Morris C (1985) A finite selection model for experimental design of the health insurance study. J. Econometrics 11(1):43–61. Nelsen RB (1999) An Introduction to Copulas (Springer, New York). Reinisch LM, Sanders SA, Mortensen EL, Rubin DB (1995) In utero exposure to phenobarbital and intelligence deficits in adult men. J. Amer. Medical Assoc. 274(19):1518–1525. Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55. Rosenbaum PR, Rubin DB (1985) Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Amer. Statist. 39(1):33–38. Rosenbaum PR, Ross RN, Silber JH (2007) Minimum distance matched sampling with fine balance in an observational study of treatment for ovarian cancer. J. Amer. Statist. Assoc. 102(477):75–83. Rubin DB (1973) Matching to remove bias in observational studies. Biometrics 29(1):159–183. Rubin DB (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. J. Educ. Psych. 66(5):688–701. Rubin DB (1978) Bayesian inference for causal effects: The role of randomization. Ann. Statist. 6(1):34–58. Rubin DB (1991) Practical implications of modes of statistical inference for causal effects and the critical role of the assignment mechanism. Biometrics 47(4):1213–1234. Rubin DB (2006) Matched Sampling for Causal Effects (Cambridge University Press, New York). Sekhon JS (2004) The varying role of voter information across democratic societies. Working paper, Department of Political Science, University of California, Berkeley, Berkeley, CA. Accessed January 2012, http://sekhon.berkeley.edu/papers/SekhonInformation.pdf. Sekhon JS (2011) Multivariate and propensity score matching software with automated balance optimization: The matching package for R. J. Statist. Software 42(7):1–52. Accessed January 2012, http:// www.jstatsoft.org/v42/i07. Nikolaev et al.: BOSS: An Alternative Approach for Causal Inference 412 Downloaded from informs.org by [192.17.144.156] on 30 June 2014, at 06:17 . For personal use only, all rights reserved. Terrie L (2008) Using matching to assess the effect of electoral rules on the presence of the elderly in national legislatures. Poster presented at 2008 Political Methodology Meetings. Society for Political Methodology, American Political Science Association, Washington, DC. Alexander G. Nikolaev is an assistant professor in the Department of Industrial and Systems Engineering at the University at Buffalo. His research interests include stochastic optimization, statistical inference, and social network modeling. Sheldon H. Jacobson is a professor and director of the Simulation and Optimization Laboratory in the Department of Computer Science at the University of Illinois. He has a diverse set of basic and applied research interests, including problems related to optimal decision making under uncertainty, discrete optimization, causal inference with observational data, aviation security, Operations Research 61(2), pp. 398–412, © 2013 INFORMS public health policy (immunization, transportation and obesity, cell phone ban effectiveness), March Madness bracketology, and forecasting the outcome of the United States presidential election. Wendy K. Tam Cho is a professor in the Department of Political Science and Department of Statistics, and Senior Research Scientist at the National Center for Supercomputing Applications, all at the University of Illinois at Urbana–Champaign. Jason J. Sauppe is a Ph.D. candidate in the Department of Computer Science at the University of Illinois. His current research interests include mathematical programming, discrete optimization, and approximation. Edward C. Sewell is a Distinguished Research Professor of Mathematics and Statistics at Southern Illinois University at Edwardsville. His current research interests are combinatorial optimization and health applications. Netw Model Anal Health Inform Bioinforma (2014) 3:69 DOI 10.1007/s13721-014-0069-7 ORIGINAL ARTICLE Towards evaluating and enhancing the reach of online health forums for smoking cessation Michael Stearns • Siddhartha Nambiar • Alexander Nikolaev • Alexander Semenov Scott McIntosh • Received: 10 December 2013 / Revised: 29 August 2014 / Accepted: 6 September 2014 Ó Springer-Verlag Wien 2014 Abstract Online pro-health social networks facilitating smoking cessation through web-assisted interventions have flourished in the past decade. In order to properly evaluate and increase the impact of this form of treatment on society, one needs to understand and be able to quantify its reach, as defined within the widely adopted RE-AIM framework. In the online communication context, user engagement is an integral component of reach. This paper quantitatively studies the effect of engagement on the users of the Alt.Support.Stop-Smoking forum that served the needs of an online smoking cessation community for more than 10 years. The paper then demonstrates how online service evaluation and planning by social network analysts can be applied towards strategic interventions targeting increased user engagement in online health forums. To this end, the challenges and opportunities are identified in the development of thread recommendation systems for M. Stearns S. Nambiar A. Nikolaev (&) Department of Industrial and Systems Engineering, University at Buffalo (SUNY), Buffalo, NY, USA e-mail: [email protected] M. Stearns e-mail: [email protected] S. Nambiar e-mail: [email protected] A. Semenov Department of Mathematical Information Technology, University of Jyväskylä, Jyväskylä, Finland e-mail: [email protected] S. McIntosh Department of Public Health Sciences, University of Rochester, Roshester, NY, USA e-mail: [email protected] effective and efficient spread of healthy behaviors, in particular smoking cessation. Keywords Social network analysis Smoking cessation Online forum communication RE-AIM framework Reach Engagement Intervention modeling 1 Introduction Tobacco use is one of several individual modifiable health behaviors including poor diet, alcohol misuse, and physical inactivity, identified by the World Health Organization as leading risk factors for global disease burden (Lim et al. 2012; Narayan et al. 2010; Scarborough et al. 2011). The development of cost-effective public health initiatives, capable of reducing the rate of tobacco use at the population level, is of great importance. Web-assisted tobacco interventions (WATIs) represent one potential solution to this challenge with the expansion of Internet access globally leading an increasing number of individuals to turn to them in place of or as an adjuvant to traditional forms of treatment (Selby et al. 2010). They provide a cost-effective medium for delivering targeted social support to a wide audience (Norman et al. 2008). Models and methods capable of describing the dynamics of social interactions and influence within such communities are primed to become part of the health policy-maker’s toolbox. Intentionally created online networks for smoking cessation have existed for over two decades with newsgroups such as Alt.Support.Stop-Smoking, and websites such as Quitnet, Alt.Support.Stop-Smoking, StopSmokingCenter, and WebCoach, among others. (Cobb et al. 2011). These are dynamic, supervised systems allowing for various modes of communication (e.g., chat rooms, forums, private 123 69 Page 2 of 11 messaging), self-representation (e.g., personal profiles, blogs, journals), and affiliations (e.g., friend lists, private groups), ensuring that users can seek for social support of distant friends ‘‘like them’’ in real time (Norman et al. 2008). Recent research has established that modern online health communities for smoking cessation function as a form of treatment for their participants, significantly increasing abstinence rates and exhibiting similar levels of effectiveness as intensive face-to-face counseling (Shahab and McEwen 2009). Web- and computer-based smoking cessation programs for adult smokers were found effective: ‘‘in a random-effects meta-analysis of 22 eligible trials (9 web-based, and 13 offline computer-based interventions), the intervention group had a significant effect on cessation (relative risk (RR), 1.44; 95 % confidence interval 1.27–1.64)’’ (Myung et al. 2009). Similar successes were reported exclusively with web-based interventions for adolescents: RR, 1.40; 95 % CI, 1.13–1.72, where ‘‘the intervention group had a significantly larger cessation rate than that of the control group’’ (Crutzen et al. 2008). Social Web (Web 2.0) technology alone does not guarantee a successful online community where members participate actively and develop lasting relationships (Iriberri and Leroy 2009). Adapting the well-established REAIM framework for sustainable interventions to online forums, the following five criteria can be distinguished (Glasgow et al. 2006). Reach is an individual-level measure of participation, referring to the percentage and risk characteristics of forum participants. Effectiveness is the degree to which the intervention achieved the intended outcomes, e.g., progress towards smoking cessation as a function of social interaction. Adoption refers to the proportion and representativeness of the settings that adopt an intervention, e.g., incorporating online community forums as a smoking cessation strategy. Implementation describes the extent to which an intervention is delivered as intended, e.g., online social forums have measurable interactions. Maintenance is the extent to which an intervention becomes routine, e.g., ongoing utilization and evolution of online forums. The literature focusing on individual-level measures has paid much attention to evaluating the effectiveness of online forums as a treatment of smoking, finding that both intra-treatment and extra-treatment social support are associated with increased rates of smoking cessation (Crutzen et al. 2008). However, little to no research has been reported on measuring reach, which is tantamount to user engagement in the context of WATIs. The contribution of this paper lies in enhancing our understanding of user engagement as a key component of reach of online treatments, and in particular, social support environments and interventions (e.g., WATIs). The paper illustrates how and why the lack of prescriptive, as opposed to descriptive, models is growing into a serious challenge 123 Netw Model Anal Health Inform Bioinforma (2014) 3:69 in social network analysis today. By distilling the factors that influence user engagement, the present discussion looks for insights that could be applied to adapt thread recommendation research to the context of smoking cessation with the aim of enhancing the reach of online smoking cessation communities. In particular, the paper discusses how targeted thread recommendations can be employed to assist the less experienced health forum users in order to achieve higher levels of user engagement. The paper expands on the argument that social network formation models based on actors’ decisions do not allow for incorporating exogenous interventions, and as a remedy, proposes a strategy to explicitly model weak, acquaintancetype ties which with time can turn into strong, friendship ties. In order to motivate this line of inquiry, posting records from the Alt.Support.Stop-Smoking newsgroup are studied. The members of this online community, which was particularly active in the early 2000s, discussed topics pertaining to smoking cessation in the forum’s threads. This paper reports the following: Sect. 2 encompasses a study of online health community Alt.Support.StopSmoking, and identifies the metrics reflecting the implications of user engagement; Sect. 3 details the challenges and opportunities surrounding the use of prescriptive social network modeling methods within smoking cessation communities; and Sect. 4 concludes the paper and offers directions for future research. It should be noted that the analysis and models presented in this paper are smokingcessation specific and may not be immediately generalizable to digital health social networks addressing other conditions. 2 Data analysis The Internet-based Alt.Support.Stop-Smoking forum, used in this study to distill measures to enable the monitoring of engagement patterns, is a Usenet newsgroup. Its structure is similar to other World Wide Web forums in that users can both read and post messages which are stored and available for viewing in a hierarchical tree. Usenet is a distributed system, accessible via Network News Transfer Protocol (NNTP) or, alternatively, using WWW front-ends such as Google Groups. The data analyzed in this paper were downloaded from a Usenet archive via NNTP in September, 2013, and inserted into a PostgreSQL database. Complex data analyses were then conducted using a developed Java code. The de-identified data analyzed in the present study were derived from retrospective publicly available data. Per IRB procedures at the University of Buffalo IRB, submission of a human subject’s research protocol for ethical board review of this type of investigation was not required. Netw Model Anal Health Inform Bioinforma (2014) 3:69 The Alt.Support.Stop-Smoking forum activity examined in this section spans the ten-year period between 8/1/2003 to 9/15/2013. During this time, 438,136 posts were made by 8,236 unique users in 48,518 threads. Each of the 438,136 entries in the dataset corresponds to an individual post made by a user on the forum and comprises the timestamp of the post, the author’s unique forum username, and the thread to which it was submitted. Note that forums user data in the Alt. Support.Stop-Smoking dataset were limited to posting records and therefore the activity of only registered users with at least one post was analyzed. Owing to the difficulty in quantifying the difference in benefits between active posters and passive users (‘‘lurkers’’), user records of the latter were not included in the analysis. The first step in analyzing the Alt.Support.Stop-Smoking data involved extracting and analyzing the aggregate forum metrics as a function of time. Figure 1a, b showcase moving averages for post and thread counts, and new and active user counts over the observed life of the forum, respectively. Users were considered to be active during a period if they were observed to have made one or more posts during the period. As observed in the trends, the initial rapid growth experience during the initial time period is short-lived; January, 2004, marks the relative peak of the forum’s activity, with 12,100 posts, 1,419 new threads, 1,490 active threads, 266 new users, and 439 active users. All of these measures are significantly higher than the overall averages observed in the dataset, where a typical month revealed 3,591 posts, 397 new threads, 445 active threads, 67 new users, and 165 active users. Over the 9 years following the forum’s popularity peak, a gradual decline is observed in each of the four main aggregate forum metrics. In the last period covered by the dataset, 9/1/2013 to 9/15/2013, there are only five posts submitted to the forum, made by four active users, in three active threads. It is worthwhile to try to understand the factors that precipitated this decline. Moreover, there is a need to study whether the application of calculated external Fig. 1 a Moving average of posts and threads. b Moving average of new and active users (a) Page 3 of 11 69 pressures could enable the forum to reach more users over a longer period of time, thus increasing its cumulative health benefit. Accordingly, Sect. 2.1 offers user-specific analysis, enabling a deeper assessment of a typical forum user’s behavior. The remainder of Sect. 2 is structured into the following subsections: Sect. 2.1 reports on user-level statistics; Sect. 2.2 studies how the gradually developed strong ties affect user behavior; and Sect. 2.3 classifies users by type and identifies user subgroups that could potentially benefit from engagement-enhancing interventions. 2.1 User-specific analysis The consideration of historical forum aggregates alone does not fully capture the underlying user activity and engagement patterns. To provide a more complete picture, individual user data must be analyzed. The frequency graphs in Fig. 2a, b indicate that based on the Alt.Support.Stop-Smoking data, the forum content is distributed amongst a relatively small cadre of highly involved, ‘‘core’’ users rather than being distributed evenly throughout a largely homogeneous user base. In the analyzed data, an average user contributes 53.2 posts during their forum lifetime (defined as the time elapsed between the user’s first post and their last post). The average user contribution is skewed by a small group of users who account for the majority of posts made to the forum. The top 1 % of users (n = 83) accounted for 194,498 of the 438,136 total posts (44.39 %), the next 9 % (n = 741) accounted for 193,498 posts (44.16 %), and the bottom 90 % (n = 7,412) accounted for 50,140 posts (11.44 %). The distribution of thread creators has a similar shape, with the top 1 % of thread creators accounting for 22,707 of the forum’s 48,518 total threads (46.8 %), the next 9 % accounting for 19,014 threads (39.2 %), and the bottom 90 % accounting for 6,797 threads (14.0 %). These measures indicate that the most active users are responsible (b) 123 69 Page 4 of 11 Netw Model Anal Health Inform Bioinforma (2014) 3:69 Fig. 2 a Distribution of the number of posts per user across the lifetime. b Distribution of thread creators for ([5 threads created) Fig. 3 User active lifetime bars for a disproportionate amount of the forum’s overall content. Previous analyses of online communities have observed a similar phenomenon, referred to as the 1 % rule or the 90–9–1 principle, in which 90 % of actors observe and do not participate, 9 % participate sparingly, and 1 % create the vast majority of new content (van Mierlo 2014). Overall, the majority of the users have short active forum lifetimes, with 4,557 users (55.33 % of the user base) having a lifetime of 1 day and only 634 users (7.7 % of the user base) having an observed lifetime over 1 year. Amongst the 100 most active posters, the observed average lifetime is 936.45 days. These observations imply there is a largely transient user base that enters and exits before having any opportunity for engagement. It is worthwhile to note that some short-term users were likely ‘‘bots’’ (automatic programs posting commercial ads) that must have been banned by the server’s administration. Figure 3 indicates that, as the forum grew older and its active user base became more static, fewer new members joined and even fewer elected to remain engaged. A plausible explanation for this phenomenon lies in the increased difficulty faced by new users in trying to integrate themselves into an established community, with the majority of active members enjoying already established friendship relationships. Young (2013) indicated that when users start to think that they can no longer influence the 123 community, they will disengage. Failure to reverse such patterns of user-disengagement and barrier to entry, can lead to the death of the forum as the established user base dwindles and fewer new users join to take their place. Thus, it is necessary to determine how these friendship networks that initially served as a barrier to new users could instead be leveraged to engage them. To do so, the concept of friendship between users (how it arises and the influence it exerts on user behavior) must be defined. 2.2 Engagement related analysis As the Alt.Support.Quit-Smoking forum does not explicitly report on friendship ties between its members, they must be inferred heuristically. Online friendships capture the emergence of mutual recognition between two persons, arising from their repeated interactions. In this vein, Rheingold (2000) describes online communities as ‘‘cultural aggregations that emerge when enough people bump into each other often enough in cyberspace’’, while Preece (2001) defines them as ‘‘any virtual social space where people come together to get and give information or support, to learn, or to find company’’. Interaction instances, termed ‘‘weak ties’’ hereafter, between users were derived by analyzing posting patterns within threads. If a certain number of interactions or weak Netw Model Anal Health Inform Bioinforma (2014) 3:69 ties are observed between a pair of users, it can be surmised that a strong tie is formed between them, i.e., that they have become friends. When User#1 submits a post to a thread within 2 days of User#2, it is (by assumption) interpreted as an instance of interaction between them. The gain in recognition arising from such interactions is divided into two sub-components: User#1 adds a weak in-tie from User#2 while User#2 simultaneously adds a weak out-tie to User#1. The reasoning for this division is based on an interpretation of how friendship germinates, being restricted to those pairs of users that demonstrate equitable and balanced interaction patterns. When the number of recorded in-ties/out-ties between a pair of users exceeds a specified threshold (10 of each in the present analysis), those users are assumed to have become friends in the sense that they can distinguish each other from the general user body and such recognition prompts them to communicate more. Following this logic, the analysis of the dataset’s posting patterns reveals the distribution of friendship ties between forum users (see Fig. 4). As evidenced by user-based metrics, the forum’s user base is highly segmented. The user with the largest friendship network has 395 friends, with only four other users exceeding the friend count of 300 or more, and only 29 exceeding the count of 100. Unconnected, i.e., friendless users, comprise the largest segment of the forum’s user base (n = 7,206 users). These users were not reached, and therefore were not affected by the forum to the extent where their experience/thoughts/social support could be helpful to others. Having defined strong (friendship) ties, the assessment of the influence that such ties might exert on user behavior can proceed. To this end, two research questions were explored: (1) Do friendship ties (or lack thereof) influence a user’s propensity to abandon the forum? and (2) Do friendship ties influence a user’s posting behavior in that the users are more likely to post in threads created by their friends as compared to those created by non-friends? In order to answer the first question, active users during each time period (month) were divided into two groups: Page 5 of 11 69 those who elected to leave during that period and those who elected to remain active. Analysis revealed 12,064 instances of user ‘‘survival’’ and 8,236 instances of user ‘‘death’’ (forum abandonment). The average number of active friends for each user in these two groups was then determined. When the entire duration of the dataset was examined it is discovered that, on average, surviving users have 8.244 active friends while outgoing users had 1.165 active friends. These results indicate that the presence of an active friendship network is highly correlated with a user’s decision of whether to stay or leave, with the users having comparatively larger active networks being more likely to remain. In order to answer the second question, data consisting of active threads and active users, for whom there existed at least one active thread created by a friend, were collected for each time period (month). The number of friend- and non-friend-threads to which users responded (among the active threads), and the number of posts made in each was then obtained for each user. Of 998,884 opportunities to post in a friends thread, 135,640 were used, with contributions submitted to 87,900 distinct threads. Conversely, of 3,487,854 opportunities to post to a non-friends thread, 239,150 were used, with 160,652 distinct threads receiving contributions. This corresponds to an 8.8 % probability of a user posting in a friend-created thread and a 4.6 % probability of posting in a non-friend-created thread. When the gross number of posts made within these threads is considered, the observed counts correspond to an average of 0.1358 posts made by an average user per friend-thread and 0.0685 posts made per non-friend-thread. This effect is not only statistically significant (which is not surprising, given the sample sizes), but it is, more importantly, practically significant. 2.3 Analysis of user engagement needs In order to summarize and simultaneously provide a more in-depth analysis of user behavior patterns and engagement needs, the forum’s user base is provisionally divided into Fig. 4 Distribution of friendship network sizes ([0 friends) 123 69 Page 6 of 11 Netw Model Anal Health Inform Bioinforma (2014) 3:69 Fig. 5 Representative examples of different user types four distinct groups. The drivers of the division are users’ forum lifetimes and the observed levels of activity. Users may be generally divided into short-term users and longterm users, with each having two distinct subgroups. The four distinguished user types along with representative examples of engagement-specific activity patterns are shown in Fig. 5. 2.3.1 Short-term users Quadrants II and III in Fig. 5 comprise those users having relatively short lifespans—frequently a week or less. Quadrant II and III users are differentiated from each other by their respective activity levels. Quadrant III users are those who join the forum, make a small number of initial contributions, and then leave for good. As shown in Sect. 2.1, such users make up a significant proportion of the forum’s user base. Conversely, Quadrant II users post heavily immediately upon joining the forum, only to leave soon after. Although it is impossible to definitively conclude the primary motivators for short-term users, research has suggested that they are composed largely of recent quitters seeking support while struggling with their quit attempt. In the analysis of an online smoking cessation community, Selby et al. (2010) found that seeking support and advice was the most common theme identified in first posts among both recent and longer term quitters. In their analysis of 2,562 first posts to an online smoking cessation support 123 group, approximately 54.7 % were made by individuals who had quit smoking within the past month, 8.9 % by those who had quit more than 1 month prior, and 24.9 % by those who had not yet quit smoking (Selby et al. 2010). The analysis of posting patterns within the Alt.Support.Stop-Smoking community indicates that the typical user is narrowly focused, limiting their posting activity to a few number of threads, oftentimes their own. Of 8,236 unique users, 51.2 % (4,219/8,236) limited their posting activity to a single thread, and 73 % (6,009/8,236) to five or fewer threads. Considering the 4,219 users whose activity was confined to a single thread, 43.6 % (1,839/ 4,219) posted solely to the threads that they themselves had created, indicating that their primary motivator for participation is personal benefit. Although the short-term users may have received the benefit of social support during their time on the forum, the failure to retain them as contributing members can be considered an overall community loss. By leaving the forum soon after having joined, such users will not ‘‘return the favor’’ by providing social support to other users in the future. This behavioral pattern may not necessarily be considered a disservice to the exiting shortterm user. The literature is split on the significance of continued and active participation during quit attempts (Preece et al. 2004; An et al. 2008), although it could be viewed as a disservice to other current and future members who will not benefit from the user’s experience and insights. Netw Model Anal Health Inform Bioinforma (2014) 3:69 2.3.2 Long-term users Quadrants I and IV comprise users with long lifespans, often many years. Quadrant IV users demonstrate relatively low activity levels and small friendship networks, but nonetheless chose to remain active for an extended period of time. These users are likely heavily topic-driven, primarily posting to threads that serve their immediate needs or pertain to a personal interest. Quadrant I users demonstrate sustained high activity over long lifetimes. They are sometimes referred to in the literature as ‘‘core-users’’ or ‘‘super-users’’ (Young 2013). Core-users are responsible for much of the forum content, as illustrated in Sect. 2.1, and form the backbone of the community—exerting a disproportionate level of influence relative to their overall numbers (O’Neill et al. 2014; van Mierlo et al. 2012). Participation of core-users is motivated more by community factors than personal interest in specific topics. It has been found that many coreusers are altruistic and truly serve the community: they are the first to greet newcomers and provide social support to other users. They may have benefited from the forum in the past, and are motivated to ‘‘pay it forward’’. In previous analysis of an online smoking cessation forum, it was found that the majority of responses ([50 %) to new users’ first posts were made by members who had quit for a month or more, with only 1 % of first replies being made by members who had not yet quit (Selby et al. 2010). Posting patterns observed in the Alt.Support.StopSmoking reflect the role played by core-users in the community. Lurkers are often hesitant to ask a question or seek support within an inactive community where they perceive the likelihood of a response to be low, with core-user activity giving lurkers the confidence to join in the conversation (Bishop 2007). As seen in Sect. 2.1, the rate at which new members posted to the community is positively correlated with the number of posts and active threads during that time—content for which core-users were largely responsible. Core-users’ role as community ambassadors, typically being among the first to respond to newcomers, is another essential function. In the Alt.Support.Stop-Smoking dataset, 3,743 users started a new thread within 2 days of having joined the forum, with 45.0 % (1,686/3,743) receiving a prompt reply (within 3 h of their initial post). Initial responses to these threads were typically made by core-users, with the average post count and number of friends of first responders being equal to 1,681.18 and 71.39, respectively. Both of these values were significantly higher than the overall community averages of 53.2 posts and 2.505 friends, respectively (p \ 0.001, p \ 0.001). Following an initial post, the prompt engagement of newcomers by core-users was found to have a significant Page 7 of 11 69 correlation with their future activity patterns. New users receiving a prompt reply to their first thread had an average lifespan of 114.52 days and an average post count over their lifespan of 93.36. Conversely, new users who did not receive a prompt reply to their first thread were found to have an average lifespan of 61.36 days and an average lifespan post count of 49.00. These differences are both statistically (p \ 0.001) and practically significant, having 95 % confidence intervals of 22.4–66.3 days and 35.79–70.53 posts, respectively. 3 Directions and future considerations for increasing engagement in smoking cessation communities This section builds upon the insights offered by Sect. 2, demonstrating how online service evaluation and planning by social network analysts can be applied towards strategic interventions targeting increased user engagement in online health forums. Calculated strategic management is essential for maintaining successful online communities where members actively participate and develop lasting relationships (Iriberri and Leroy 2009). Modeling the dynamics of interactions between core-users, regular users, and newcomers in online health forums would provide a technical foundation of modern pro-health engagement research. There is a gap in the literature of prescriptive models capable of monitoring, controlling, and improving user engagement in online health forums. One avenue towards accomplishing these goals is through targeted recommendations of threads to users. Thread recommendation systems apply knowledge discovery techniques to match users to threads. Given the diverse interests and needs of forum users, coupled with the large amount of information that they must sift through on a typical forum, recommender systems present an essential tool for improving end-user retention and facilitating meaningful user interactions. Thread recommender systems serve to simultaneously satisfy users’ information needs by directing them to appropriate content, and their social needs by connecting them to other users within the community. There are a number of domain-specific considerations, not emphasized or even present in conventional thread recommendation tasks, which are essential for the development of effective health forum recommender systems. In contrast to conventional online forums, the participation of users in online health forums is primarily motivated by a desire to give and/or receive social support (White and Dorman 2001). Friendships between forum participants play an essential role in the provision of social support within such communities. Reading and participating in forum threads leads users to encounter other members like themselves with whom friendships can be built, 123 69 Page 8 of 11 thus enabling personalized support. Therefore, threads serve not solely as platforms for the dissemination of static content, but also as conduits for meaningful user interactions, with thread value being generated by and representing its participants. Within this framework, each thread can be viewed as a resource for introducing new user ties and strengthening existing ones. The mechanisms by which friendships form between users, and the manner in which threads can be employed to facilitate the process, are essential components of the emerging methodology for health forum thread recommender systems. The ensuing subsections discuss adjustments to current paradigms that can lead to models capable of informing and controlling online forum user engagement. These subsections offer more focused discussion about the domain-specific challenges confronting thread recommendation systems in online smoking cessation forums, the use of social network structure as a means to motivate thread recommendation, and present a new paradigm for modeling actor ties within a social network to better capture the manner in which friendships between users develop. 3.1 Thread recommendation within an online smoking cessation forum Research on thread recommendation systems is just beginning to emerge (Tang et al. 2013). Traditional product recommendation systems have employed a combination of collaborative filtering and content-based approaches to match consumers with products, under the assumption that product appeal and consumer preference are static but initially unknown (Sarwar et al. 2001). Collaborative filtering methods identify and exploit consumer and product similarities to make predictions about user tastes or preferences. They may be reinforced via content-based approaches which function by comparing consumer preferences to product features, and thereby, provide even more suitable recommendations. In addition to the traditional challenges confronting all recommender systems (e.g., cold starts and data sparsity) (Sarwar et al. 2001), user preferences/needs within online smoking cessation communities are dynamic, i.e., continually evolving, as users progress through health state changes (e.g., quitting stages). The process of changing smoking behavior has been subdivided into five distinct stages by smoking cessation researchers, including: precontemplation, contemplation, preparation, action, and maintenance (Prochaska and DiClemente 1984). Additionally, users may relapse, i.e., return to an earlier stage. An effective health forum thread recommendation system should therefore tailor its recommendations to reflect users’ states of behavioral change (in this case, smoking cessation) in order to provide them with an appropriate level of support. 123 Netw Model Anal Health Inform Bioinforma (2014) 3:69 Forum threads are typically short-lived and quickly changing in their content/narrative, in contrast to the long lives and static characteristics of products being recommended in conventional settings. The content and narrative of a thread may evolve as contributions are made to it by users introducing uncertainty into the very defining characteristic of a thread as a product. An effective thread recommendation system should capture and account for such uncertainty. Note that thread evolution affects not only its future contributors: the benefit/utility that a user derived from participating in a thread is not immediately realized upon their initial posting, resulting instead from the responses made by the future contributors. Due to these dynamics, the benefit/utility of thread participation is unpredictable, being a function of time and depending on future, as yet unrealized, events. 3.2 Network structure considerations and complex behaviors Threads in online smoking cessation forums facilitate user engagement, providing a platform for interactions between users and the provision of social support. A thread’s value lies not solely in its narrative, but in the opportunity that it provides users to directly communicate with one another. A user’s acceptance of a thread recommendation can be thought of as signifying that engagement is taking place. To reflect the importance of communication between users, and the role that threads serve to enable it, an effective thread recommendation system must consider both the social network structure of the overall community and within the thread itself. Social network analysis takes an expanded view of a social environment, allowing for inferences about how network structure both enables and drives behavior change (Cobb et al. 2011). Smoking cessation is an example of a complex adoptable behavior, which is differentiated from simpler behaviors in social network literature (Centola and Macy 2007). The distinction between the simple and complex behaviors is an essential consideration for an effective thread recommender system due to fundamental differences in how behaviors are diffused through a network. Simple behaviors, such as the adoption of a new technology or product, are spread farther and more quickly by networks having many long-ties. Conversely, complex behaviors, such as smoking cessation, typically require a user to be in contact with multiple individuals capable of supporting them in their behavior change before it is adopted. Once adoption of a complex behavior has been realized, continued reinforcement is crucial to ensure that the newly adopted behavior persists and the user does not relapse back to their prior state. Research has shown that highly clustered networks are most effective in facilitating Netw Model Anal Health Inform Bioinforma (2014) 3:69 the adoption of complex behaviors within a community (Centola 2010). The consideration of network structure in thread recommendation tasks alters the manner in which a thread’s value is determined and the purpose which it ultimately serves. The relationship between a thread’s network structure and that of a user targeted by the recommender system, directly influences the ability of the thread to provide the user with social support. A thread containing contributions from friends (recognizable peers) may be assumed to provide a greater level of social support to an individual. However, this is not to say that threads containing relatively few of a user’s friends are of no value to that user. Rather than providing high levels of immediate social support, such threads provide a user with the opportunity to change their local network structure through the introduction of new ties and/or the strengthening of existing ones. In summary, threads possess the capacity to provide both immediate and future benefits to users. In order to reflect a thread’s capacity to influence a user’s local network structure, a prescriptive modeling framework capable of capturing the influence of outside interventions (in the form of thread recommendations) on network structure is required. Existing stochastic actor-based models lack the means to analyze and quantify influence imposed on social networks from the outside. Stochastic actor-based models are a popular methodology for modeling network evolution and predicting ties between actors. Within such networks, nodes represent social actors, e.g., forum users, while edges (ties) represent social relations between them such as friendship, trust, or cooperation. Ties between pairs of actors may be established, or existing ties dissolved, influenced by factors such as the actors’ structural positions within the network, actor characteristics (actor covariates), and their relationships with other nearby actors (dyadic covariates). However, in stochastic actor-based models, network ties are actor-initiated, i.e., they can only be changed myopically by the actors themselves. Formulation of an exogenous intervention strategy requires one to choose an aggregate, actor-based objective function, and decision variables to optimize this function. The modeling challenge lies in the identification and application of external interventions (in the form of thread recommendations) that serve to modify a user’s local network in such a way to benefit that user and/or those around them. When recommending a thread to a user for the purpose of altering their local network structure, the likelihood of such changes is an essential consideration. The concept of link-prediction may be applied towards this task (Liben-Nowell and Kleinberg 2007). The link-prediction problem for a social network involves the identification of new links that are likely to appear in the future, Page 9 of 11 69 complementing the network’s current structure and the characteristics of pairs of users (dyadic covariates). In order to modify the existing actor-oriented modeling paradigm to accommodate exogenous interventions, the manner in which actor ties are modeled must be revisited. Ties in traditional stochastic actor-based models are assumed to be binary, with relationships between actors either existing or not. To capture the dynamics of user interactions within a smoking cessation community, actor ties should instead be weighted, reflecting varying levels of friendships between actors and the build-up from weak ties to strong ones. 3.3 Weak and strong tie dynamics While the first co-posting in the same thread by two users may only constitute a weak tie, potential repeated communication between them can lead to tie strengthening over time, eventually resulting in the establishment of a strong friendship tie. In this way, altruistic behavior of core-users can be employed to ‘‘push’’ the network towards a state characterized by higher levels of user engagement by introducing users to one another by co-referencing threads, thereby facilitating meaningful user interactions. A major deficiency of existing actor-oriented models lies in their inability to explicitly accommodate weak ties and their dynamics. One of the most significant premises, upon which the actor-oriented models are built, is that tie formation is a Boolean class of variables wherein a tie is either present or absent, and must be observable (Snijders et al. 2010). This paper posits that the problem of modeling exogenous interventions can be approached by considering two processes that together describe the formation of a social network. Process 1 expresses how an actor builds strong ties with other nearby actors, i.e., what drives their decisions about with whom to communicate more/less. However, such decisions are clearly made with respect to the actor’s acquaintances, with most other actors treated as strangers. Strangers’ attributes are unknown to an actor, and their influence on their decision-making mechanism, captured by Process 1, is minimal. This accentuates the importance of Process 2—building acquaintances, termed weak ties. This definition of a ‘‘weak tie’’ is different than that based on the structural holes theory (Walker et al. 1997). Therefore, models incorporating varying levels of ‘‘affinity’’ between actors are required, enabling more detailed analyses of transitions between weak and strong ties: these transitions may serve as a key underlying facilitator for the growth of health behavior online social networks. It is strong ties that people would report in a questionnaire, or that can be observed from time-stamped 123 69 Page 10 of 11 interaction records. Meanwhile, weak tie patterns are hidden unless they trivially span a whole (small) network. Weak ties can be traced online in certain situations: they are ‘‘follow’’-type ties as opposed to ‘‘friends’’-type ties. Therefore, it is crucial to explore approaches to learning weak tie formation dynamics in large networks simultaneously with strong tie dynamics. This will allow (1) the accurate expression of actor decision-making logic, i.e., estimation of Process 1 parameters by removing the bias of the tie patterns that actors are unaware of, and (2) the quantitative evaluation of social influence effects inside networks as well as effects of interventions imposed from the outside. While strong tie formation driven by actors themselves cannot be influenced from the outside, weak tie formation can. People cannot be expected to become friends just because a model-based tool says they should. However, they can be introduced to each other, informed of congruent interests, and invited to vote on or contribute to ‘‘hot’’ forum threads, etc. Such actions help build acquaintances, as they unobtrusively increase the probability that people will more quickly expand friendship circles, begin communicating with newly found acquaintances, and eventually build stronger ties. A model incorporating weak ties can quantify weak influence effects, and suggest feasible interventions for actor outcomes. It is only with time that a network actor (e.g., a smoking cessation forum user) expands their local neighborhood, on which they will make decisions about building long-lasting relationships, getting engaged, or staying inactive and leaving for good. Thus, strong tie formation depends on weak ties. On the other hand, it is through communication with friends (i.e., people already trusted) that an actor will learn about other trustworthy actors, begin to distinguish those actors from strangers and explore communication pathways to them. Thus, weak tie formation is facilitated by strong ties. A potential pathway to incorporating both strong ties and weak ties into a mathematical model lies in studying the behavior of any actor based on the actor’s local network structure, i.e., the actor’s acquaintances, under the assumption that a part of the network is hidden, which is a critical omission in all the existing actor-oriented models the investigators are aware of. The exploration of a network, i.e., the discovery of its hidden parts that may contain useful information, then becomes an important task for an actor, where they may benefit from ‘‘outside’’ assistance. 4 Concluding remarks Calls for the design and implementation of prescriptive social network analysis techniques for the growth and 123 Netw Model Anal Health Inform Bioinforma (2014) 3:69 maintenance of online health communities continue to emerge. The National Institutes of Health have called for research addressing ‘‘the emergence of collective behaviors that arise from individual elements or parts of a system working together’’ through an exploration of ‘‘complex and dynamic relationships among the parts of a system and between the system and its environment’’ (Marcus 2013). Recent papers such as ‘‘The Spread of Behavior in an Online Social Network’’ (Centola 2010), have improved our understanding of how network structure influences the diffusion of complex behaviors. The present study contributes to this research direction by paving the way for the prescriptive modeling of behavior dynamics. This section touches upon some additional aspects of prescriptive social network modeling for reach enhancement of online pro-health communities, in particular, the treatment of lurkers and recent trend towards using gamification for therapeutic purposes. A challenge facing the present and prior analyses of online health communities is that passive users (lurkers) are difficult to account for, although they have been found to make up a significant proportion of users in online health forums (Selby et al. 2010). Other research has indicated that lurkers enjoy many of the same benefits as active posters, with more than half of lurkers reporting that ‘‘just reading/browsing is enough’’ (Preece et al. 2004). User anonymity has been observed to play a significant role in WATIs and other online health communities. Although known contacts are potentially more influential than anonymous ones, typically having more detailed knowledge of a particular user’s needs and emotional state (Newman et al. 2011), many users are disinclined to discuss sensitive issues pertaining to habits and behavior on non-anonymous social networks such as FacebookÒ (Ploderer et al. 2013; Morris et al. 2010). While the empirical work presented in this paper relies on the data from an online health forum with limited capabilities beyond posting, the more recent introduction of user-controlled features (profile creation, friendship assignment, thread tracking, etc.), the use of gamification as a modern treatment delivery mechanism (Primack et al. 2012), and mobile-based treatment delivery mechanisms (Whittaker et al. 2008; Stanton et al. 1999; Lawrance 2001), may require further effort for analyzing how modern health portals, such as ‘‘medhelp.com’’, deliver treatment to their participants. Acknowledgments This work was supported in part by the Academy of Finland Grant #268078 ‘‘Mining social media sites’’ (MineSocMed) and the National Cancer Institute (R01CA152093-01 to S.M.). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the Academy of Finland, the National Cancer Institute or the National Institutes of Health. Netw Model Anal Health Inform Bioinforma (2014) 3:69 References An L, Schillo BA, Saul JE, Wendling AH, Klatt CM, Berg CJ, Ahulwalia JS, Kavanaugh AM, Christenson M, Luxenberg MG (2008) Utilization of smoking cessation informational, interactive, and online community resources as predictors of abstinence: cohort study. J Med Internet Res 10(5):e55 Bishop J (2007) Increasing participating in online communities: a framework for human-computer interaction. Comput Hum Behav 23(4):1881–1893 Centola D (2010) The spread of behavior in an online social network experiment. Science 329(5996):1194–1197 Centola D, Macy M (2007) Complex contagions and the weakness of long ties. Am J Sociol 113(3):702–734 Cobb NK, Graham AL, Byron Abrams DB (2011) Online social networks and smoking cessation: a scientific research agenda. J Med Internet Res 13:e119 Crutzen R, De Nooijer J, Candel MJ, de Vries NK (2008) Adolescents who intend to change multiple health behaviours choose greater exposure to an internet-delivered intervention. J Health Psychol 13:906–911 Glasgow RE, Klesges LM, Dzewaltowski DA, Estabrooks PA, Vogt TM (2006) Evaluating the impact of health promotion programs: using the RE-AIM framework to form summary measures for decision making involving complex issues. Health Educ Res 21(5):688–694 Iriberri A, Leroy G (2009) A life-cycle perspective on online community success. ACM Comput Surv 41(2):1–29 Lawrance KG (2001) Adolescent smokers’ preferred smoking cessation methods. Can J Public Health 92(6):423–426 Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inform Sci Technol 58(7):1019–1031 Lim SS, Vos T, Flaxman AD, Danaei G, Shibuya K, Adair-Rohani, et al (2012) A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990–2010. Lancet 380(9859):2224–2260 Marcus S (2013, Nov 4) Modeling social behavior funding opportunity [Web log comment]. Retrieved from https://loop.nigms. nih.gov/2013/11/modeling-social-behavior-funding-opportunity Morris MR, Teevan J, Panovich K (2010) What do people ask their social networks, and why? A survey study of status messages Q&A behavior. In: Proceedings CHI pp 1739–1748 Myung SK, McDonnell DD, Kazinets G, Seo HG, Moskowitz JM (2009) Effects of web- and computer-based smoking cessation programs: meta-analysis of randomized controlled trials. Arch Intern Med 109:929–937 Narayan KM, Ali MK, Koplan JP (2010) Global noncommunicable diseases—where worlds meet. N Engl J Med 363(13):1196–1198 Newman MW, Lauterbach D, Munson SA, Resnick P, Morris ME (2011) It’s not that I don’t have problems, I’m just not putting them on Facebook: challenges and opportunities in using online social networks for health. In: Proceedings CSCW, pp 341–350 Norman CD, McIntosh S, Selby P, Eysenbach G (2008) Web-assisted tobacco interventions: empowering change in the global fight for the public’s (e)health. J Med Internet Res 10:e28 O’Neill B, Ziebland S, Valderas J, Lupiáñez-Villanueva F (2014) User-generated online health content: a survey of internet users in the United Kingdom. J Med Internet Res 16(4) Ploderer B, Smith W, Howard S, Pearce J, Borland R (2013) Patterns of support in an online community for smoking cessation. In: Page 11 of 11 69 Proceedings of the 6th international conference on communities and technologies, pp 26–35 Preece J (2001) Sociability and usability in online communities: determining and measuring success. Behav Inform Technol 20(5):347–356 Preece J, Nonnecke B, Andrews D (2004) The top 5 reasons for lurking: improving community experiences for everyone. Comput Hum Behav 20(2):201–223 Primack BA, Carroll MV, McNamara M et al (2012) Role of video games inn improving health-related outcomes: a systematic review. Am J Preventative Med 42:630–638 Prochaska JO, DiClemente CC (1984) The transtheoretical approach: towards a systematic eclectic framework. Dow Jones Irwin, Homewood Rheingold H (2000) The virtual community: homesteading on the electronic frontier. MIT Press, MA Sarwar B, Karypis G, Konstan J, Riedl J (2001) Item-based collaborative filtering recommendation algorithms. In: Proceedings of the 10th international conference on World Wide Web Scarborough P, Bhatnagar P, Wickramasignhe KK, Allender S, Foster C, Rayner M (2011) The economic burden of ill health due to diet, physical inactivity, smoking, alcohol and obesity in the UK: an update to 2006-07 NHS costs. J Public Health 33(4):527–535 Selby P, van Mierlo T, Cunningham J (2010) Online social and professional support for smokers trying to quit: an exploration of first time posts from 2,562 members. J Med Internet Res 12(3):e34 Shahab L, McEwen A (2009) Online support for smoking cessation: a systematic review of the literature. Addiction 104(11):1792–1804 Snijders T, van de Bunt G, Steglich C (2010) Introduction to stochastic actor-based models for network dynamics. Soc Netw 32:44–60 Stanton WR, Lowe JB, Fisher KJ, Gillespie AM, Rose JM (1999) Beliefs about smoking cessation among out-of-school youth. Drug Alcohol Depend 54(3):251–258 Tang X, Zhang M, Yang CC (2013) Leveraging user interest to improve thread recommendation in online forum. In: 2013 international conference on social intelligence and technology, pp 11–19 van Mierlo T (2014) The 1% rule in four digital health social networks: an observational study. J Med Internet Res 16(2) van Mierlo T, Voci S, Lee S, Fournier R, Selby P (2012) Superusers in social networks for smoking cessation: analysis of demographic characteristics and posting behavior from the Canadian Cancer Society’s smokers’ helpline online and StopSmokingCenter.net. J Med Internet Res 14(3):e66 Walker G, Kogut B, Shan W (1997) Social capital, structural holes and the formation of an industry network. Organ Sci 8(2):109–125 White M, Dorman SM (2001) Receiving social support online: implications for health education. Health Educ Res 16(6):693–707 Whittaker R, Maddison R, Rodgers A (2008) A multimedia mobile phone-based youth smoking cessation intervention: findings from content development and piloting studies. J Med Internet Res 10(5):e49 Young C (2013) Community management that works: how to build and sustain a thriving online health community. J Med Internet Res 15(6):e119 123 Social Structure Optimization in Team Formation Alireza Farasata , Alexander G. Nikolaeva a Department of Industrial and Systems Engineering, University at Buffalo (SUNY), Buffalo, NY, U.S.A Abstract This paper presents a mathematical framework for treating the Team Formation Problem explicitly incorporating Social Structure (TFP-SS), the formulation of which relies on modern social network analysis theories and metrics. While recent research qualitatively establishes the dependence of team performance on team social structure, the presented framework introduces models that quantitatively exploit such dependence. Given a pool of individuals, the TFP-SS objective is to assign them to teams so as to achieve an optimal structure of individual attributes and social relations within the teams. The paper explores TFP-SS instances with measures based on such network structures as edges, full dyads, triplets, k-stars, etc., in undirected and directed networks. For an NP-Hard instance of TFP-SS, an integer program is presented, followed by a powerful LK-TFP heuristic that performs variabledepth neighbourhood search. The idea of such λ-opt sequential search was first employed by Lin and Kernighan, and refined by Helsgaun, for successfully treating large Traveling Salesman Problem instances but has seen limited use in other applications. This paper describes LK-TFP as a tree search procedure and discusses the reasons of its effectiveness. Computational results for small, medium and large TFP-SS instances are reported using LKTFP. The insights generated by the presented framework and directions for future research are discussed. Keywords: team formation problem, social network analysis, combinatorial optimization, discrete optimization applications, Lin-Kernighan heuristic Email addresses: [email protected] (Alireza Farasat), [email protected] (Alexander G. Nikolaev) Preprint submitted to Computers and Operations Research April 23, 2015 1. Introduction The success of a project as well as the productivity of a whole organization often depends on the effectiveness and efficiency of work of participating teams (Agustı́n-Blas et al. 2011). The challenge of assembling successful teams can be addressed by formulating a problem of grouping individuals or assigning them to (sub)sets so as to optimize some outcome-related objectives (Zhang and Zhang 2013). Team Formation Problem (TFP) has received attention from the operations research community over the past years (Chen and Lin 2004, Gaston et al. 2004, Fitzpatrick and Askin 2005, Wi et al. 2009). However, despite the common understanding that the social structure between members of the same team plays an important role in the team’s output, such consideration has not been explicitly taken into account in mathematical modeling, primarily due to the lack of quantitative means to do so (Lappas et al. 2009, Zhong et al. 2012). This paper addresses the challenge of developing a mathematical framework for incorporating social structure measures into TFP. It identifies the means to quantify social structure by assessing the impact of each individual’s local network on their work-related outcome. For example, such outcome can be the amount of goods produced, the number of errors committed (self reported or observed), it can be some job satisfaction indicator, the frequency of conflicts at workplace, etc. Rooted in social science theories, the presented framework allows one to build models for TFP explicitly incorporating Social Structure (TFP-SS). The class of TFP-SS models sheds light on team building strategies and also advances the emerging quantitative research of social theories and team outcomes (Manser 2009, Ceravolo et al. 2012). The presented framework elucidates the connection between work environment, social network theories and measurable team outcomes: see Figure 1. Social network theories motivate the use of graph-based constructs, called network structures, for representing social relations: such network structures include edges, full dyads, k-stars, and (un)directed triplets, among others. Theories social exchange, structural holes, homophily, reciprocity, transitivity and network evolution support the design of interpretable network structure measures as functions of network structures in TFP-SS (e.g., the number of transitive triplets in a given graph). The resulting models are useful for both descriptive and prescriptive purposes. Given historical work-related outcome data for differently structured teams, the researcher can quantify the impact of each theory on team performance, by estimating the weight of 2 Team Formation Problem Social Network Theories Network Structure Measures (NSMs) Performance Measurement Data Mathematical Model Tractable Program Figure 1: The proposed framework for the TFP. each respective network structure measure in a model of the outcome. Then, by adjusting team roster decision variables, the outcome can be driven in the desired direction. This optimization-permitting ability together with the reliance of the TFP-SS models on social theories distinguishes the presented framework from the existing network clustering, community detection and clique problems literature (San Segundo et al. 2011, Pirim et al. 2012, San Segundo and Tapia 2014). Importantly, the presented framework allows one to more closely control individual team members’ local networks, which play a big role in information transmission, according to the structural holes theory. This paper also presents an extremely efficient LK-TFP algorithm for solving TFP-SS, based on variable depth first neighborhood search. To summarize, this paper presents a framework for formulating and solving team formation problems that employ information provided by the local social network of individuals. This framework addresses problems at a meeting point of social science and operations research that have significant practical appeal. The efforts in building and treating models for TFP-SS lead to the creation of a methodological toolbox that quantifies social and behaviorial aspects of working in teams, particularly in professional nursing, rescue, police operations, sport teams and academic research. The paper’s key contributions are three-fold. 1) It motivates and justifies the use of mathematical programming and optimization techniques in the area of social science, where most problems have been previously qualitatively addressed by observations, experiments 3 and basic statistical methods. (2) It presents a prescriptive, quantitative approach to a real-world application of social network analysis, as opposed to the existing descriptive studies. It also introduces a framework to operations research community, computer and social scientists for modeling more complex problems in the area of social science from operations research perspective. (3) It identifies the relation between established social science findings and team outcomes. It presents explicit, rigorous functions of social structures to evaluate the outcomes. It also describes how social structures and individual attributes can be incorporated into mathematical models of the outcome regardless of the network type (e.g., directed, undirected, weighted or unweighed). (4) Designing and testing methods for solving the TFP where the optimal social structure is sought within the teams. The paper presents both an exact method and an efficient heuristic exploiting the Lin-Kernighan-inspired variable depth neighborhood search. The rest of the paper is organized as follows. Section 2 offers a review on existing models based on Social Network Analysis (SNA) and motivates a call for prescriptive models in this field. Section 3 provides an overview of social network theories and defines relevant network structure measures that quantify social structure. Section 4 discusses the relation between workrelated outcomes and social structure measures. Section 5 gives a formal statement of a special-case non-trivial instance of TFP-SS and studies this instance in greater detail. Section 6 presents LK-TFP algorithm. Section 7 reports experimental results of solving TFP-SS instances of varied sizes with undirected and directed networks. Section 8 concludes the paper and discusses future research directions. 2. Emerging Prescriptive Research in SNA The science of SNA encompasses a set of techniques for building models of networks and models on networks (see Wasserman and Faust (1994) for SNA motivation and position statement). These techniques range from studying centrality measures (Borgatti 2005) to building complex probabilistic models describing network structure and formation (Albert and Barabási 2002, Robins et al. 2007, Aral et al. 2009). More recently, the domain of SNA has attracted the attention of exact science professionals whose expertise allowed 4 for advances in modeling interactions between agents (Contractor et al. 2006, Newman 2006, Snijders et al. 2010). The main deficiency of the existing SNA tools is that they mostly offer descriptive insights (Nascimento and Pitsoulis 2013), rather than prescriptive capabilities. The dearth of models that could allow a decision-maker to optimally change/influence a social network structure accentuates the difficulty in handling such tasks, and at the same time, calls for filling this gap. The existing works in the area of optimization and prediction are notable (Squillante 2010, Leite et al. 2011, Bettinelli et al. 2013), however, they have focused on small, highly constrained tasks as opposed to introducing broad classes of problems and general methodologies for addressing them. Of such prescriptive efforts, the models for finding subsets of influential individuals in networks are most well-studied (Kempe et al. 2003, Goyal et al. 2010, Arulselvan et al. 2009). There exist models that incorporate such graph-based measures as network diameter, density, and centrality, into TFP. However, again, most such studies are descriptive and focus on impacts of social relations, expressed by SNA measures, on team performance (Balkundi et al. 2009, Manser 2009, Abbasi and Altmann 2011, Ceravolo et al. 2012). Existing prescriptive models considering a team’s social network use little information captured in the social network structure. Basic SNA concepts such as closeness, diameter, and minimum spanning tree have been employed in identifying a team of experts so as to minimize intra-team communication costs (Lappas et al. 2009, Dorn and Dustdar 2010, Shi and Hao 2013), and in some cases, individual member costs (Kargar et al. 2012, Juang et al. 2013). In 2003, in a study of 816 organization founding teams, Ruef et al. showed that homophily and network constraints are the key factors defining team composition (Ruef et al. 2003). In a more recent study of 2349 open-source software (OSS) development teams, Hahn et al. reported positive correlations between the developers’ decisions to join project teams, the collaborative ties with project initiators and the perceived status of other (non-initiator) members (Hahn et al. 2008). Zhu et al. investigated the impacts of personal and dyadic motives on team formation (Zhu et al. 2013). They used Exponential Random Graph modeling to find that individuals first get interested in a project due to personal motives such as self-interest, mutual interest, collective action and coordination cost. The typical secondary considerations include dyad-based considerations explained by the social theories of homophily, swift trust, so5 cial exchange and co-evolution. Given the qualitative evidence of network effects on team success, there is much value in conducting rigorous quantitative research on team formation. This paper makes the first effort to introduce a comprehensive framework for TFP, based on social network theories. The ability to quantify social network structure is the key to this effort. 3. Social Science Theories and Network Structure Measures In order to formulate a TFP that explicitly incorporates team Social Structure (TFP-SS), such structure needs to be quantified. This section details how this can be accomplished, relying on existing social science theories. Network structure measures (NSM) are the key tools used in the presented framework to construct rigourous, closed-form functions of social structures in TFP. Earlier studies of the behavior of connected individuals sought for social theories that could explain network formation mechanisms (Contractor et al. 2006, Robins et al. 2007, Snijders et al. 2010). These efforts resulted in the use of network structures, i.e., simple geometric constructs corresponding to the underlying social theories, in mathematical modeling. Prior to establishing how these social network theories can be useful for explaining outcomes in TFP, some definitions and notation are in order. 3.1. Definitions and Notation Used to Quantify Social Structure Let graph G(V, E) with a set of nodes V , |V |= N , and a set of edges E, represent a social network of individuals. With the notation vi used for node i, i = 1, . . . , N , set eij = 1(0) if there exists (does not exist) an edge between nodes i and j. Let wij denote the weight of an edge, which indicates the strength of a social tie. Define NG (vi ), i = 1, . . . , N , as the local network of node vi , i.e., the set of all neighbors of i in G. Define M as the total number of teams and Xj ⊂ G, j = 1, . . . , M , as the network of the members of team j, with |Xj |= nj as the number of members in team j. Let NXj (vi ), i = 1, . . . , N , j = 1, . . . , M , denote the local network, also known as the ego network (Everett and Borgatti 2005), of node i in team j. With a slight abuse of notation, an individual represented by node i in G is said to belong to team j, vi ∈ Xj , if node i is contained in Xj . In an attributed graph, set the binary variable Iri = 1(0), r ∈ A, if node i has (does not have) the rth attribute (e.g., certain expertise or ability), where set A = 1, 2, . . . , A contains the indices of the attributes relevant to a problem at hand. 6 According to earlier works in other SNA applications, the meaningful part of a social environment (climate) in a community is captured by the community’s social network. Social scientists theoretically explore the conncetions between individuals in a social network, which in turn, can be expressed in a graph via basic network structures (see Figure 2). Using the introduced notation, a full dyad (also known as reciprocal tie), with nodes i and p is the structure where eip = 1 and epi = 1. Similarly, in a directed graph, a triplet of nodes i, j and k is a three-cycle whenever eip = epq = eqi = 1; in an undirected graph, such triplet is simply called a triangle. full dyad * 2-star * 2-mixed star ** triangle * 3-star * ࡵ࢘ 2-out star ** Undirected Network * 2-in star transitive triplet ** 3 cycle ** Directed Network ** Undirected Attributed *** actor*transitive triplet **** Directed Attributed **** Figure 2: Examples of basic network structures in undirected, directed and attributed networks. Network structure measures (NSMs) are functions of network structures capturing the tendencies highlighted by social theories; they can also be viewed as properties of social network graphs. For instance, the number of reciprocated ties measures the tendency for reciprocity in a community (Snijders et al. 2010). Let Fl (NXj (vi )), l ∈ L, denote the lth NSM in the local network of node i in team j, where L is the set of indices of network structure types of interest to a researcher. For example, the number of edges in a local Pnetwork of node vi is found using the respective NSM as Fedge (NXj (vi )) = p∈Xj ,p6=i eip . Social science theories and their respective NSMs are explored next to explain how one can interpret the observed NSM quantities in a team or an individual’s local network. 7 3.2. Social Science Theories for Interpreting Team Network Structure There are several theories related to social networks that may explain relations between the social structure within a team and the team’s workrelated outcome. Network theories rooted in social science elucidate the creation, maintenance, dissolution, and reconstitution of organizational networks (Contractor et al. 2006), and also, interpret social structures from communication and individual attributes perspectives. The theories relevant in the team formation context include (1) social exchange, (2) homophily, (3) transitivity, (4) contagion, (5) network evolution, and (6) structural holes theories. Their connection with network structure quantifiable by NSMs can be established as follows. The social exchange theory states that the inclination to have a communication tie from individual A to individual B is predicated on the presence of a communication tie from individual B to individual A (Contractor et al. 2006, Zhu et al. 2013). The main concern of exchange and dependence theories is the mutual relationships between pairs of network actors, called reciprocity. Since reciprocity facilitates information, knowledge and experience sharing between team members, it is an indicator of cooperation in the team. The number of full dyads in a P local network of node vi in team j can be expressed as Ff ulldyad (NXj (vi )) = p∈Xj eip epi . Homophily as a node level theory suggests that individuals with similar attributes are more likely to properly communicate with one another. This theory explicitly takes into account individuals’ attributes such as gender, age, education, organization type, etc.; attributes such as professional skills, knowledge, intelligent, leadership skills, job satisfaction, problem solving skills, flexibility and motivation are vital factors in the team success (Zhang et al. 2001, Zhu et al. 2013). Network structures pertaining to the homophily theory must incorporate individual attributes into P models, with corresponding measures of the form Fego−altr (NXj (vi )) = p∈Xj eip epi Iri Irp , r ∈ A. Transitivity is an important factor impacting team outcome due to the role of information flow and communication between team members. This theory stresses an inclination toward consistency in relationships within a community, and hence, expects better-functioning teams to exhibit higher levels of transitivity (Contractor et al. 2006). Different triplet type-based NSMs inform a researcher of different variations of transitivity (Robins et al. 2007). As an example, the number of three-cycles in a weighted graph can P be expressed as Fwightedtriangle (NXj (vi )) = p,q∈Xj ,p6=q¬i wip wqi wpq eip epq eqi . 8 The contagion theory focuses on the tendency to ”follow the crowd” in social networks. Detected by the prominence of k-star structures, the tendency indicates the popularity of certain individuals in a network. In directed networks, k-in star and k-out star structures illustrate the level of popularity. The presence of these social structures implies that individuals with higher indegree, or higher outdegree, are more attractive to individuals looking to form new ties (Snijders et al. 2010). In the team formation context, high contagion signals strong team core, team cohesiveness, but may also indicate over-reliance of a team on a single individual in performing tasks. Individuals with high degree of popularity also help to maintain an effective advice network within a team and facilitate the spread of information (Borgatti and Halgin 2011). The number of k-stars can be computed as P Fk−star (NXj (vi )) = p∈Xj ,i6=1,...,k ei1 ei2 . . . eik (∀ k = 2, . . . , K andK ≤ nj −1). The network evolution theory states that social networks are dynamic, which means that ties emerge and change over time (Snijders et al. 2010, Zhu et al. 2013). These relational changes may be viewed as a function of the existing social structure in a network. This idea implies that all nodes in the network act to incase their personal utility, or “happiness”. The relevance of this theory for TFP is conceptual, since it motivates the consideration of local networks in team performance studies. The theory of structural holes (Ronald 1992) argues that the shape of a local (ego-centered) network influences the amount of information that the ego-node receives. As a result, the ego is supported by more non-redundant information at any given time, which provides the ego with the capability of performing better or being perceived as the source of new ideas (Borgatti and Halgin 2011). A network position where an ego benefits from the information flow within a team is called a structural hole (Ronald 1992): the abundance of structural holes in a network can be assessed by counting the k-star and triangle structures in it. Table 1: Social network theories, Social structure and Team outcome Social network theories Social exchange Homophily Transitivity Contagion Network evolution Structural holes Impact on Team Outcome cooperation individual’s attributes information sharing in team leadership individual’s attributes personal performance 9 Social Structures full dyad ego-alter triangle, k-star, edge k-star k-star, dyadic covariate triangle Models based on the theories summarized in Table 1 have been previously used in other SNA applications, particularly in network formation studies. A large branch of SNA literature develops Exponential Random Graph Models (Robins et al. 2007). Stochastic actor-based models introduced by Snijders et al. (2010) are also widely-used; they also successfully utilize network structures based on individuals’ attributes. Snijders et al. (2010) were the first to focus research attention on individuals’ local networks. Following the same logic, this paper primarily considers TFP-SS instances where aggregate, additive utility of team members is maximized. 4. Expressing Work-related Measurements Using NSMs The objective of TFP is to optimize some aggregate measure of team performance. An integral component of a TFP-SS framework should relate team outcome with measurable network effects. Social network theories motivate and justify the use of NSMs in quantifying social structure. The next step is to establish an explicit relation between NSMs and team performance (e.g., taken from available data). This task can be accomplished similar to the way that other social network properties such as diameter, centrality measures, etc., have been exploited in the earlier TFP literature (Abbasi and Altmann 2011, Lappas et al. 2009). According to the theory of structural holes, in many situations, the workrelated outcome of each team can be represented as a function of the NSMs computed over the team members’ local networks. Note that in general, relying exclusively on local networks in TFP-SS performance computations may be incorrect (e.g., this approach is not valid for evaluating consulting teams). However, with nursing, rescue, and police teams, the consideration of local networks can certainly be justified. In most real-world applications, data of prior individual performance of team members is recorded and can be accessed. Therefore, performance function P(Xj ) = H(N(Xj ) (vi )) can be approximated using any general parametric models and parameter estimation techniques. There exists a variety of techniques that can be used to estimate H from empirical data: regression, spline interpolation, neural networks, and machine learning, among others. Given the abundance of available literature on this topic (including the papers references above), this paper will not focus on such methods in detail. Note only that in certain situations, the dependencies between team member outcomes must be taken into account, 10 in which case simple methods such as linear regression may not work (Aral et al. 2009), and more complicated estimation techniques must be explored. Assume that each individual’s outcome is recorded, and also, the information of their local network structure is available. For illustrative purposes, assume that H(Xj ) can be expressed as a linear function of NSMs computed over each team member’s local network (in team j). Although the linearity assumption may reduce the accuracy of the model, as discussed, it offers a simple way to aggregate social structures and estimate the overall workrelated outcome for team j, P(Xj ) = nj X X θl Fl (NXj (vi )). (1) i=1 l∈L Recall that the NSMs do allow one to focus on individual local networks. In (1), θl is the weight quantifying the contribution of the network structure, i.e., the strength of its corresponding theory, represented by a corresponding NSM: each such weight should be estimated using the available data. Thereafter, a TFP-SS instance can be formally stated. 5. TFP-SS Formulation Consider the problem of partitioning a group of N individuals embedded in a social network, represented by graph G(V, E), into M teams so as to achieve the best outcomes across all the teams. The TFP-SS encompasses a variety of models; to specify a problem instance, a researcher must consider the following modeling components. Individual or team network: the objective is to optimize a criterion function over all teams considering individuals’ local networks or the teams’ networks. This distinction should be made based on the adopted outcome evaluation approach. As such, the team’s network may be preferred for forming consultant teams; on the other hand, the theory of structural holes and network evolution theory support the idea of using individuals’ local networks in problems where the team outcome is the sum or the averages of the individual member outcomes (Zhu et al. 2013). Set of NSMs: as a part of formulating TFP-SS, a researcher should select set L of network structure types, corresponding to social network theories relevant in the problem setting. For example, individual attributes as well as the corresponding NSMs may be less important in certain applications. 11 Objective function: with the outcome of each team evaluated as in (1), a researcher should define a proper objective function. In measuring team outcome, one might be interested in the following objectives (among others): (1) optimizing the average outcome across all teams: n PM P(Xj ) o j=1 . max Xj M (2) maximizing the outcome of the weakest team: n o max min P(Xj ) . Xj j=1,...,M (3) minimizing the absolute deviation from the average outcome: PM j=1 P(Xj ) min P(Xj ) − . Xj M (4) minimizing the squared deviation from the average outcome: h PM min P(Xj ) − Xj P(Xj ) i2 . M j=1 Network type: networks with different types of edges, e.g., (un)directed and weighted, may lead to different models. For instance, 2-out stars are not even defined on undirected networks. The goal of this paper is to present a general methodology for treating TFP-SS problems. In order to illustrate the application of the resulting framework, a special case instance of TFP-SS is considered in detail. 5.1. TFP-SS: A Special Case Instance The presented framework is designed to generate and treat TFP-SS models using NSMs. In order to investigate the tractability of the resulting models, this section first considers a non-trivial TFP-SS Special case instance (TFP-SSS). The TFP-SSS is defined on an undirected, unweighed graph, representing a social network of individuals with identical skills. The choice of network structure types included for modeling TFP-SSS is limited to edge, 12 2-star, 3-star and triangle, L = {edge, 2 − star, 3 − star, triangle} - these social structures are most common in network formation modeling. In the ensuing computational study, the experiments with TFP-SS instances based on directed networks are also included, so as to demonstrate the comprehensiveness and flexibility of the presented framework: the instances with directed networks use full dyad, 2-in star, 2-out star, 3-cycle and transitive triplet NSMs. As stated, TPS-SSS is a realistic problem relevant for assembling professional teams (e.g., nursing, rescue, police, security, sport etc.), where a minimum level of expertise is uniformly required of all team members. Such teams usually complete tasks under stressful conditions, and the effectiveness of working within a team structure is more important in this case than small differences in individual qualification. In TFP-SSS, the average outcome of teams is maximized based on the individuals’ local networks, which is mathematically equivalent to maximizing the sum of the outcomes over all the teams, max M X P(Xj ). (2) j=1 It should be noted that the local network of any given node includes the node itself, its immediate neighbors and the links between them all. To visualize a particular instance of TFP-SSS, consider a social network of size 16 in Figure 3. 2 Team 1 Team 4 1 3 4 Team 2 Team 3 Figure 3: Four teams in the given social network. 13 Table 2: Network structure measure values for nodes in Team 1 Node 1 2 3 4 No. of edges 3 4 1 3 No. of 2-stars 3 3 0 3 No. of 3-stars 0 1 0 0 No. of triangles 1 1 0 1 Quantifying the outcomes of four teams of size four amounts to computing NSMs as illustrated in Table 2 for nodes in Team 1. The described instance of TFP-SSS is formally stated: Instance: A graph G(V, E), |G|= N ; nj ∈ Z + for 1 ≤ j ≤ M ; a partition of disjoint sets X1 , X2 , . . . , XM , where Xj ⊆ G(V, E) for 1 ≤ j ≤ M and θl ∈ R for l ∈ L. Question: Is there a partition of PVM into M disjoint subsets X1 ∪ X2 ∪ . . . ∪ XM , with |Xj |= nj , such that j=1 P(Xj ) is maximized, where P (Xj ) = PM Pnj P i=1 j=1 nj = N ? l∈L θl Fl (NXj (vi )) for 1 ≤ j ≤ M and The following theorem states that TFP-SSS is NP-hard. Theorem 1. TFP-SSS with M teams to be formed out of N individuals is NP-hard. Proof. The presented TFP-SS instance is NP-hard by polynomial time reduction from Partition into Triangles (see the Appendix). The number of all feasible solutions in TFP-SSS is P −1 N N − n1 N− M j=1 nj , Γ(N, M ) = ... nM n1 n2 with n1 + n2 + . . . + nM = N . Quantity Γ(N, M ) shows how quickly the number of the feasible solutions grows, as N and M increases. Since TFPSSS is NP-hard, one cannot expect to obtain an exact algorithm to solve it in polynomial time. However, for small size problems, an exact method can be designed to search for an optimal solution. The next section presents an Integer Programming (IP) formulation of TFP-SSS. 5.2. An IP Formulation of TFP-SSS Define a set of decision variables for TFP-SSS: 1 if node i is assigned to team j for every i ∈ I and j ∈ J , yi,j = 0 otherwise, 14 where I = {1, . . . , N } and J = {1, . . . , M }. The objective function of TFPSSS is nonlinear since it includes the products of the decision variables, max M X N X θ1 eip yij ypj + θ2 p=1 j =1 i =1 + θ3 N X N X N X N X N X N X eip eiq yij ypj yqj p=1 q=1 eio eip eiq yij yoj ypj yqj + θ4 o=1 p=1 q=1 N X N X (3) ! eip epq eqi yij ypj yqj . p=1 q=1 To linearize (3), variables wiopqj , zipqj and xipj are introduced to replace the terms yij yoj ypj yqj , yij ypj yqj , and yij ypj , respectively. An integer programming formulation for the TFP-SSS is given, max M X N X j =1 i =1 + θ3 θ1 N X eip xipj + θ2 p=1 N X N X N X N X N X eip eiq zipqj p=1 q=1 eio eip eiq wiopqj + θ4 N X N X (4) ! eip epq eqi zipqj , p=1 q=1 o=1 p=1 q=1 st : N X i=1 M X yij = nj yij = 1 ∀ j, (5) ∀ i, (6) j=1 wiopqj ≥ yij + yoj + ypj + yqj − 3 ∀ i, o, p, q, j, o 6= p 6= q, yij + yoj + ypj + yqj wiopqj ≤ ∀ i, o, p, q, j, o 6= p 6= q, 4 zipqj ≥ yij + ypj + yqj − 2 ∀ i, p, q, j, p 6= q, yij + ypj + yqj zipqj ≤ ∀ i, p, q, j, p 6= q, 3 xipj ≥ yij + ypj − 1 ∀ i, p, j, yij + ypj xipj ≤ ∀ i, p, j, 2 yij , xipj , zipqj , wiopqj ∈ {0, 1}. 15 (7) (8) (9) (10) (11) (12) where θl , l ∈ L, and eip are known parameters and o, p, q ∈ I. The constraints in (5) ensure that team j, j = 1, . . . , M , has exactly nj members (alternatively, these equality constraints can be replaced by upper bound constraints). The constraints in (6) ensure that no individual is assigned to more than one team. The constraints in (7)-(12) are needed to linearize the model. Finally, all the decision variables are required to be binary. Note that the TFP-SS instances with directed networks have similar IP formulations to that of TFP-SSS. Observe that while the present model is linear, the number of constraints in it quickly increases with problem size. For N individuals to be assigned to M teams, the IP has M (N 4 − 5N 3 + 9N 2 − 4N ) variables and 2M (N 4 − 5N 3 + 9N 2 − 5N ) + M + N constraints. For example with N = 30 and M = 5 (a small instance at first glance), the IP has 3,414,900 variables and 6,829,535 constraints; however, one can still expect the IP to solve successfully for smaller problems. Note that in most real-world applications, the number of individuals to be managed is more than 50 (e.g., think of a nursing department in a typical health care center). Hence, an efficient sub-optimal algorithm is in order for dealing with such instances as well as other versions of TFP-SS. 6. Variable Depth Neighborhood Search Algorithm for TFP-SS This section presents an efficient algorithm for TFP-SS, called LK-TFP. The idea of variable depth neighbourhood search is well recognized for its success in treating Traveling Salesman Problem (TSP). Proposed by Lin and Kernighan (Lin and Kernighan 1973), the idea was revised and implemented in an exceptionally effective heuristic LKH for symmetric TSP by Helsgaun (2000). LKH algorithm performs λ-opt sequential moves, where in each step, λ links in the network representing the current solution are replaced by other λ links (Helsgaun 2000). The variable depth neighborhood idea based on λ-opt search has not found success in applications beyond TSP and vehiclerouting problems (Kothari and Ghosh 2013, Salari and Naji-Azimi 2012). Being similar to TSP in that the team assignments can be visualized as a Hamiltonian tour, TFP-SS appears to be a suitable problem for implementing the sequential move idea. 16 1 1 1 1 5 2 3 a) 2-move 4 2 b) 3-move 2 4 2 d) 5-move c) 4-move 3 3 Figure 4: λ-move for λ=2,3,4 and 5 6.1. LK-TFP Algorithm for TFP-SS In LK-TFP, one feasible solution is within λ-move from another if such a move improves the objective function value by exchanging λ individuals from different teams (similar to replacing λ links in TSP); see examples in Figure 4. Updating the objective function of TFP-SSS is a relatively expensive computational operation. A naive implementation re-computes the objective function using an order of O(n4 ) operation driven by recomputing the most complex network structure measure (in TFP-SSS, it is 3-star structure). The idea of a branching algorithm for TFP-SS is to build a tree of solutions to be traversed (see Figure 5). The solutions at the branches of the tree are obtained by executing λ-moves. LK-TFP traverses the tree to arrive at such a transitive solution that improves the objective function, signifying the completion of a current sequential move. LK-TFP identifies good branches of the tree and avoids visiting too many non-improving solutions by cutting off the search space. To describe the tree traversal in detail, some terminology is required. Level i, i ≤ N , of the tree encompasses all the solutions reachable by a single i-move from the initial solution (i.e., the one at the root of the tree). In the tree, the root node (a single node at level 1) represents a feasible solution of TFP-SS which can be a random initial solution or a current best known solution, necessarily feasible. The internal tree nodes (at level i, i ≤ N ) represent solutions resulting from i-moves performed on the root solution. Importantly, each internal node also stores an incomplete, i.e., infeasible solution. Leaf nodes (at the lowest possible tree level) are those where the algorithm must stop the search because of the pre-set branching rules (e.g., any team can have only one member participating in the same move). A branch between two or more nodes represents a directed correspondence between two or more solutions. Let egoij denote the individual that has been removed from a team, perturbing the solution at node j, j ∈ Ji , at level i, i ∈ I, of the tree, where Ji 17 Level 1 1-2-3 4-5-6 7-8-9 Initial Solution Level 2 1-2-3 5-2-3 4-1-6 4-1-6 7-8-9 7-8-9 1-2-3 6-2-3 4-5-1 4-5-1 7-8-9 1-2-3 7-8-9 4-5-6 7-2-3 4-5-6 1-2-3 4-1-6 1-8-9 1-8-9 1-2-3 9-2-3 4-5-6 4-5-6 7-8-1 7-8-1 Infeasible Feasible 2 is leaving team 1 Level 3 1-2-3 2-5-3 4-1-6 4-1-6 7-8-9 7-8-9 1-2-3 5-2-3 4-1-6 4-1-6 7-8-9 8-2-3 7-8-9 4-1-6 7-5-9 1-2-3 9-2-3 7-5-9 4-1-6 4-1-6 7-8-5 7-8-5 Infeasible Feasible 4 is leaving team 2 Level 4 1-5-3 5-4-3 2-1-6 2-1-6 7-8-9 7-8-9 1-5-3 7-5-3 4-1-6 4-1-6 1-5-3 2-8-9 9-5-3 2-8-9 4-1-6 4-1-6 7-8-2 7-8-2 Infeasible Feasible Figure 5: General Search tree structure of LK-TFP. The algorithm sequentially traverses the infeasible solution space until one infeasible solution with an acceptable (improving) feasible counterpart is found. and I are the number of nodes at level i of the tree and the number of levels in the tree, respectively. An individual is called a friend of the ego if there exists a social (i.e., network) link between them. A set of egoij ’s friends is denoted by Fij , j ∈ Ji , i ∈ I. When egoij joins another team, one of this team’s current members that are not friends with egoij must leave the team. These candidates for leaving the team form a set denoted by Lij , j ∈ Ji , i ∈ I. The algorithm starts by branching from the root, selecting individuals one-by-one to be ego; each ego attempts to join friends at other teams. If no such friends exist, two individuals within the minimum distance from the ego are selected. Then, leaving individuals are determined. For each leaving node in Lij , a branch is added, pointing at a new node at the next level of the tree. For instance, suppose that ego node 1 is on team A, with its friend nodes 4 and 8 on teams B and C, respectively. As node 1 is added to team B or C, with some individuals who are not friends with node 1 leaving that team, a branch is added to the tree. As the potential leaving nodes join other teams (in turn), branches are added to the lower levels of the tree. 18 At every tree node, e.g., with individual j as an ego at level i, the algorithm studies two solutions: one feasible and one infeasible. Infeasible solutions are generated by replacing the ego with one of the leaving nodes (thus creating a duplicate of the latter). Feasible solutions are obtained by completing an infeasible move, i.e., having the ego replace the leaving node. Whenever an improving feasible solution is found, the algorithm is restarted, with the new best solution at the root of a new tree. If the feasible solution is non-improving, then the gain of the corresponding infeasible solution is computed as gij = f (vp ) − f (vq ), where gij is the gain of the current i-move at tree node j, and f (vp ) and f (vq ) are the incremental changes in the objective function incurred by switching individuals p and q. The algorithm continues to branch if gij is positive; otherwise, the tree node and all its subsequent branches are fathomed. Whenever a solution tree is completely traversed with no improving feasible solution found, the algorithm stops. In the computational experiments reported in this paper, LK-TFP was implemented with the depth first tree search strategy, thus benefitting from more efficient memory usage. The algorithm’s key steps are outlined pseudocodes for Algorithms 1 and 2. Algorithm 1 LK-TFP Algorithm for TFP-SS public void main() 1. 2. 3. 4. 5. 6. 7. 8. SM T Initialize Xj , j = 1, . . . , M such that M j=1 Xj = V ; j=1 Xj = ∅ and PM Calculate Objective Function j=1 P(Xj ); for (t = 1; t ≤ Tmax ; t++) depthNeighborhoodSearch(); /* executing a tree based search*/ if ( bestSolutionValue ≤ currentSolutionValue) recordBestSolution(); /* recording the best solution*/ end if end for end main In the presented version, LK-TFP was found to be very efficient in solving TFP-SS instances. A discussion of possible reasons of such performance follows. 19 Algorithm 2 Depth Neighborhood Search public type depthNeighborhoodSearch() 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. while (notVisitedNodes.hasNext()) for (int i = 1; i ≤ I; i++) /* level i */ for (int j = 1; j ≤ Ji ; j++) /* node j inlevel i */ do find eij = vq , Fij and vp ∈ Lij ; exchange vq and vp ; /*exchanging the ego*/ replace vq with vp ; /* replacing the ego */ gij = f (vp ) − f (vq ); P Gi = i gij ; /* calculate gain*/ depthFirstSearch(); /* Depth First Search */ while (Gi > 0) end for updateVisitedNodes(); /* update the list of the visited nodes */ end end while 6.2. Performance Analysis of LK-TFP LK-TFP is remarkably efficient in solving TFP-SS, similar to LKH for TSP, and deserves a discussion of the reasons of such performance. Algorithms using λ-opt search usually face a serious drawback. In order to provably find an optimum, n-opt should be applied with large n, which is computationally infeasible in non-trivial problem instances. LK-TFP avoids this difficulty by employing an intelligent search strategy. At each step, the algorithm checks for the necessity of increasing the value of λ in λ-opt moves and it considers a growing set of potential exchanges. If these exchanges improve the objective function and the exploration succeeds in a better feasible solution, they are always accepted and λ increases by one. In terms of tree search, increasing λ is equivalent to going deeper into the search tree. Starting with a feasible solution, the algorithm repeatedly executes the exchanges guided by the incremental gain gi , until the whole tree is traversed and the algorithm stops, or an improving feasible solution is reached, in which case the algorithm is restarted. Additionally, the sequential exchange criterion (Helsgaun 2000) in LKTFP enables internal nodes, which would otherwise be missed due to a pre20 node 1 +1 +3 move 5 move 1 node 5 -2 move 2 move 4 node 2 -3 move 3 node 4 node 3 +2 Figure 6: An example of an improving sequence of moves. mature fathoming, to be visited. To fathom a node, the positive gain criterion plays an important role. At each level of the P tree, gi = f (vj ) − f (vk ) is the gain resulting from the exchange, and Gi = ik=1 gk , is the gain obtained. Using Gi as a branch cutting criterion has a significant impact on the algorithm efficiency. The gain consideration prevents the algorithm from traversing very deeply in the tree and enables more rapid fathoming when there is no improvement in exchanging. At first glance, this stopping criterion may appear to be too restrictive. However, thanks to multiple permutations, i.e., sequences in which the same exchanges can be performed, whenever an improving exchange is aborted, it is still guaranteed to be discovered on a brach under another internal node in the tree. Lin and Kernighan proved this by showing that for a sequence of numbers with a positive sum, there exists a cyclic permutation of these numbers such that every partial sum is positive (Helsgaun 2000). Consider Figure 6, eliciting the relation of gain gij to the partial sum Gi for a five-opt sequential move. Suppose a current explored subsequence is negative in gain (e.g., G2 = −2 < 0) resulting from move 2; if it is a part of an improving larger sequence, then this subsequence will be found later by traversing the tree from another node (Starting from node 5). It means that an improving sequential move will be found. Furthermore, there are some other rules which are useful to increase the search efficiency. Recently performed exchanges are stored in memory for a small number of iterations. This memory helps to escape from local optima. Moreover, LK-TFP uses the two best friend rule similar to LK which restricts the search to the five nearest neighbours. The idea of this rule is to place friends in the same teams. This heuristic rule is based on expectations of which individuals are 21 more probably teammates in the optimal solution. In other words, this rule treats the problem like a puzzle and attempts to complete it by placing right individuals in teams in each step. After each run, the best solution is placed in the root of tree and the search process continues. Results of LK-TFP for different size problems are reported in the next section. 7. Computational Experiences with TFP-SS This section explores the performance of LK-TFP on small-, medium- and large-sized TFP-SSS instances, including those incorporating directed networks. These problems are solved by both IP (using CPLEX) and LK-TFP (implemented in JAVA). The experiments were performed using a desktop with Intel(R) Core(TM)i5 3GHz processor, 8GB RAM and 64 bit operating system. Overall, LK-TFP was executed at least 30 times for each instance to explore the variability of the results over different runs, and thus, gauge the robustness of the algorithm. CPLEX is only able to solve small size problems (less than 20 nodes) but for larger problems more memory is required. 7.1. TFP-SSS with Undirected Social Networks A majority of existing social network datasets are undirected. TFP-SSS was first solved for several real-world undirected networks including Zachary Karate Club (ZKC), Florentine Families (FF), Kapferer Mine (KM), Taro Exchange (TE), Western Electric Employees (WEE), Thurman Office (TO) and Bernard & Killworth (BK) (retrieved from http://vlado.fmf.uni-lj. si/pub/networks/data/) as well as for synthetic random networks. Without loss of generality, the weights θl , l ∈ L in (1) are set to be equal. In the first experimental setup, ZKC data capturing a friendship network of 34 karate club members who were observed over a 2-year period at a US university in the 1970 is exploited (Girvan and Newman 2002). Figure 7 depicts the network divided into two teams of 17: node colors indicate team assignments. Table 3 shows the results of both CPLEX and LK-TFP runs with the ZKC network and varied team sizes. The FF network represents the social relations including business ties and marriage alliances among 16 Renaissance Florentine families. Figures 8(A) and 8(B) depict the division of the FF network into three and four teams, respectively: nodes of the same colors are on the same teams. Table 4 reports the performance of CPLEX and LK-TFP with the FF net22 Figure 7: Dividing ZKC network into two teams of size 17. Table 3: Simulation Results of the Heuristic Algorithm and CPLEX runs across ZKC network (NA: Not Available). Network Teams Optimum ZKC ZKC ZKC 2 4 5 NA NA NA LK-FTP Solution Min Ave. Max 872 874.8 877 103 123.75 141 218 230.7 239 CPLEX Solution NA NA NA (A) LK-TFP Time (s) Min Ave. Max 12 156.2 360 1 17.3 36 67 379.15 731 CPLEX Time (s) NA NA NA (B) Figure 8: Dividing Florentine Families network into (A) three and (B) four teams. work, with varied number of teams. Kapferer Mine network connects 15 miners working on the surface in a mining operation in Zambia (then Northern Rhodesia). The KM network with two and three teams is illustrated in Figure 9. 23 Table 4: Simulation Results of the Heuristic Algorithm and CPLEX runs across FF network. Network Teams Optimum FF FF FF 2 3 4 137 96 59 LK-FTP Solution Min Ave. Max 137 137 137 96 96 96 51 56.5 59 CPLEX Solution 137 96 59 LK-TFP Time (s) Min Ave. Max 0.08 0.9 4 1.3 2.95 8.5 3.5 10.5 21 CPLEX Time (s) 327 511 1630 Taro Exchange network represents the relation of gift-giving among 22 (A) (B) Figure 9: Dividing Kapferer Mine network into (A) two and (B) three teams. households in a Papuan village. Figure 10 depicts the TE network divided into two and three teams, respectively. Western Electric Employees network captures relations between 14 Western (A) (B) Figure 10: Dividing Taro Exchange network into (A) two and (B) three teams. Electric (Hawthorne Plant) employees from the bank wiring room participating in horseplay. Using LK-TFP, the network is devided into 2 and 3 teams, 24 as shown in Figure 11 (A) and (B), respectively. Thurman Office network outlines the interactions among 15 employees in (A) (B) Figure 11: Dividing Western Electric Employees network into (A) two and (B) three teams. the overseas office of a large international corporation based on informal relationships. Figure 12 shows the results of optimal team assignment with two and three teams. Table 5 summarizes the results of running LK-TFP and CPLEX across all networks mentioned above (A) (B) Figure 12: Dividing Thurman Office network into (A) two and (B) three teams. Similarly, the results of both CPLEX and LK-TFP runs for the generated problems are presented in Table 4. The obtained results confirm that LKTFP effective and efficient in solving even large scale TFP-SSS instances. As the results illustrate, IP techniques can handle only small problems (up to 16 individuals and five teams). For larger problems, more computational resources are required. For the instances with up to 30 individuals and four teams, CPLEX was not able to report any incumbent before getting out of memory. However, LK-TFP quickly identified optimal solutions for small 25 Table 5: Simulation Results of LK-TFP and CPLEX runs across KM, TE, WEE, TO and BK networks. Network Teams Optimum KM KM TE TE WEE WEE TO TO BK BK 2 3 2 3 2 3 2 3 2 3 133 72 309 259 548 247 262 69 NA NA LK-FTP Solution Min Ave. Max 133 133 133 65 71.6 72 297 306.72 309 247 253.4 259 548 548 548 247 247 247 255 259.5 262 62 65.36 69 4035 4112.8 4150 932 968.21 978 CPLEX Solution 133 72 309 259 548 247 262 69 4060 NA LK-TFP Time (s) Min Ave. Max 3.51 8.23 12.9 2.15 6.4 18.5 2.3 17.6 45 38.45 93.9 256.1 1.1 3.5 8.1 .9 11.12 23.5 12.91 30.73 45.65 21.5 101.6 151.39 68 269.5 329.2 214.5 516.3 1263.83 CPLEX Time (s) 325 616 6048 19308 109 352 1244 17039 133380 NA Table 6: Simulation Results of the Heuristic Algorithm and CPLEX runs across the generated problems. Candidates Teams Optimum 6 10 16 16 16 20 20 20 30 30 30 30 40 40 50 50 2 2 2 4 5 2 4 5 2 4 5 10 2 10 5 10 24 108 317 111 78 900 NA NA NA NA NA NA NA NA NA NA LK-FTP Solution Min Ave. Max 24 24 24 108 108 108 317 317 317 111 111 111 78 78 78 900 900 900 240 247 254 158 163.2 169 2805 2841.9 2916 451 474.7 555 441 460.7 495 93 98.2 101 4723 4804.1 4943 1253 1281.6 1323 1722 1849.7 1931 426 439.2 461 CPLEX Solution 24 108 317 111 78 900 240 149 1523 NA NA NA NA NA NA NA LK-TFP Time Min Ave. X X 0.001 0.07 0.001 0.26 0.02 4.1 0.05 1.76 1 16.3 1.2 14.03 0.9 8 61 578.65 12 44 16 39.59 9 12.8 135 1244.74 41 127.26 71 218.5 15 39.48 (s) Max X 0.11 0.89 13 4 50 28 18 1305 84 85 26 2307 237 419 64 CPLEX Time (s) 1 43 628 28,348 31,591 37,405 192,674 163,423 218,912 NA NA NA NA NA NA NA problems. For medium- and large-sized problems, the runtime of LK-TFP is still remarkable. With large N and small M , LK-TFP does take longer to identify good solutions; naturally, the individuals’ local networks get larger in such instances. 7.2. TFP-SS with Directed Social Networks There exist cases where the relationships between individuals in a social network are not necessarly bidirected. Incorporating the direction of relations into TFP-SS requires the use of other types of NSMs such as full diad, 2-out star, 3 cycle and transitive triplet. The directed TFP-SSS instances were created with Dickson Bank Wiring (DBW),Thurman Organizational Chart (TOC), Succor Teams (ST) (available on http://vlado.fmf.uni-lj.si/ pub/networks/data/) and Advogato networks. 26 Dickson Bank Wiring network consists of 14 Western Electric employees helping each other with work. Figure 13 shows how this directed network can be split into two and three teams, respectively. Thurman Organizational Chart (TOC) indicates formal relationships (the (A) (B) Figure 13: Dividing Dickson Bank Wiring network into (A) two and (B) three teams. organizational chart) of 15 employees of a large international corporation. Based on this directed network, employees are assigned into two and three teams as depicted in Figure 14. Succor Teams network represents that relationship between 35 countries in terms of exporting succor players. Here, a link from country A to Country B means that country A exports succor players to country B. Figure 15 shows how these countries can be divided into two subgraphs (teams). Table 7 summarizes the results of Cplex and LK-TFP runs for the directed networks described above. Table 7: Simulation Results of LK-TFP and CPLEX runs across BWD, TOC and ST networks. Network Teams Optimum BWD BWD TOC TOC ST ST 2 3 2 3 2 3 36 30 48 21 761 NA LK-FTP Solution Min Ave. Max 36 36 36 30 30 30 42 45.8 48 21 21 21 749 754.9 761 345 350.1 368 CPLEX Solution 36 30 48 21 761 NA 27 LK-TFP Time (s) Min Ave. Max .05 .9 3.1 1.09 2.8 7.9 4.6 12.8 38 9 32.1 58.03 239.8 580.15 789.4 1208 1875.7 2349.8 CPLEX Time (s) 363 324 468 568 42757 NA (A) (B) Figure 14: Dividing Thurman Organizational Chart network into (A) two and (B) three teams. Figure 15: Dividing Succor Teams network into two teams. Importantly, in the settings where many people are required to be grouped into working teams, e.g., in emergency situations or large projects, the scalability of computational methods becomes an issue. This section explores this issue with TFP-SS instances with directed networks. To this end, a sample of the trust network of Advogato with 500 users in an online com28 munity platform for software developers constructed in 1999 (KONECT: http://konect.uni-koblenz.de/networks/advogato) is taken to formulate three TFP-SS instances. Table 8 summarises the results. This trust Table 8: Simulation Results of the Heuristic Algorithm and CPLEX runs across Advogato network. Candidates Teams Optimum 500 500 500 5 10 20 NA NA NA LK-FTP Solution Min Ave. Max 1820 1991.8 2191 590 606.3 632 458 520.1 584 CPLEX Solution NA NA NA LK-TFP Time (s) Min Ave. Max 787 7846 21,940 896 11918.5 32101 453 9976.8 28356 CPLEX Time (s) NA NA NA network is an appropriate instance indicating how one can adapt TFP-SS to find teams in which their members highly prefer reliable teammates. Such formulation may be particularly used when a close cooperation is required within a large number of teams in emergency situations with participating NGOs and multiple volunteers. 8. Conclusion and Discussion This paper presents a mathematical framework which explicitly incorporates social structure in treating Team Formation Problem. The presented framework introduces models that quantitatively exploit the underlying network structure in team member communities. Importantly, this paper also opens broader research opportunities in the area of prescriptive SNA modeling. The presented framework sheds light on the relationship between social network theories and social structures, and discusses how to quantify social structure using information provided by the underlying graph. In order to assess team performance, network structure measures quantifying both social relations and individual attributes are given. The paper explores TFP-SS instances with measures based on network structures as edges, full dyads, triplets, k-stars, etc. For a proven NP-Hard instance of TFP-SS, called TFP-SSS, an integer programming formulation is presented for exact solution computation. In order to tackle problem instances based on TFP-SS, an efficient LK-TFP heuristic based on variable depth neighborhood search is developed for small, medium- and large-sized instances with both real and randomly generated networks. The idea of λ-opt sequential search, introduced and developed by Lin, Kernighan and Helsgaun for solving large TSP instances, is also successfully applied to solve TFP-SS instances with undirected and directed 29 networks. This paper describes the resulting LK-TFP heuristic as a tree search, and explains the roots of its efficiency, confirmed by computational results. Observe also that TFP-SS instances can be interpreted as certain clique relaxation problems, by identifying the NSMs that can be used in TFP-SS to match clique relaxation-based formulations. As such, consider a TFP-SS instance, where only the number of edges is considered as the NSM and the objective is to maximize the minimum over all the teams’ outcomes. This optimization problem can be interpreted as that of finding M network subsets such that each of them is a relaxed clique of the pre-defined minimal quality (e.g., an s-defective clique for a given fixed value of s). This observation signal that LK-TFP framework, and hence, LK-TFP can be useful for current and future efforts in clique relaxation domain, where clustering problems have been dominant so far. While this paper demonstrates the advantages of the presented framework to prescriptively implement SNA theories in TFP, some potential directions for further improvement exist. While the framework is able to generate a range of models for TFP based on social structures, the question of selecting the best model for a given application deserves more attention. Also, this paper avoided an extended discussion on estimating the function relating NSMs and observed team outcomes: this issue can be addressed in the future. Furthermore, LK-TFP can be further tested and implemented to address other, similar problems, e.g., clique relaxation problems. Since TFP-SS presents computationally challenging problems, other optimization algorithms can be designed for treating TFP-SS models; exact methods are of particular interest. Finally, the presented framework’s ideas can be extended to problems beyond the team formation problem. Network clustering, information influence, community detection, and scheduling (e.g., of work shifts) problems are especially promising. References Abbasi, A., Altmann, J., 2011. On the correlation between research performance and social network analysis measures applied to research collaboration networks. In: 44th Hawaii International Conference on Systems Science (HICSS44), Jan. 4-7, Hawaii, USA. Agustı́n-Blas, L. E., Salcedo-Sanz, S., Ortiz-Garcı́a, E. G., Portilla-Figueras, A., Pérez-Bellido, Á. M., Jiménez-Fernández, S., 2011. Team formation based on 30 group technology: A hybrid grouping genetic algorithm approach. Computers & Operations Research 38 (2), 484–495. Albert, R., Barabási, A.-L., 2002. Statistical mechanics of complex networks. Reviews of modern physics 74 (1), 47. Aral, S., Muchnik, L., Sundararajan, A., 2009. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences 106 (51), 21544–21549. Arulselvan, A., Commander, C. W., Elefteriadou, L., Pardalos, P. M., 2009. Detecting critical nodes in sparse graphs. Computers & Operations Research 36 (7), 2193–2200. Balkundi, P., Barsness, Z., Michael, J. H., 2009. Unlocking the influence of leadership network structures on team conflict and viability. Small Group Research 40 (3), 301–322. Bettinelli, A., Liberti, L., Raimondi, F., Savourey, D., 2013. The anonymous subgraph problem. Computers & operations research 40 (4), 973–979. Borgatti, S. P., 2005. Centrality and network flow. Social networks 27 (1), 55–71. Borgatti, S. P., Halgin, D. S., 2011. On network theory. Organization Science 22 (5), 1168–1181. Ceravolo, D. J., Schwartz, D. G., FOLTZ-RAMOS, K. M., Castner, J., 2012. Strengthening communication to overcome lateral violence. Journal of Nursing Management 20 (5), 599–606. Chen, S.-J., Lin, L., 2004. Modeling team member characteristics for the formation of a multifunctional team in concurrent engineering. Engineering Management, IEEE Transactions on 51 (2), 111–124. Contractor, N. S., Wasserman, S., Faust, K., 2006. Testing multitheoretical, multilevel hypotheses about organizational networks: An analytic framework and empirical example. Academy of Management Review 31 (3), 681–703. Dorn, C., Dustdar, S., 2010. Composing near-optimal expert teams: a trade-off between skills and connectivity. In: On the Move to Meaningful Internet Systems: OTM 2010. Springer, pp. 472–489. Fitzpatrick, E. L., Askin, R. G., 2005. Forming effective worker teams with multifunctional skill requirements. Computers & Industrial Engineering 48 (3), 593–608. Garey, M. R., Johnson, D. S., 1979. Computers and intractability. Vol. 174. freeman New York. Gaston, M., Simmons, J., DesJardins, M., 2004. Adapting network structures for efficient team formation. In: Proceedings of the AAAI 2004 Fall Symposium on Artificial Multi-agent Learning. 31 Girvan, M., Newman, M. E., 2002. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 99 (12), 7821– 7826. Goyal, A., Bonchi, F., Lakshmanan, L. V., 2010. Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on Web search and data mining. ACM, pp. 241–250. Hahn, J., Moon, J. Y., Zhang, C., 2008. Emergence of new project teams from open source software developer networks: Impact of prior collaboration ties. Information Systems Research 19 (3), 369–391. Helsgaun, K., 2000. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research 126 (1), 106– 130. Juang, M.-C., Huang, C.-C., Huang, J.-L., 2013. Efficient algorithms for team formation with a leader in social networks. The Journal of Supercomputing, 1–17. Kargar, M., An, A., Zihayat, M., 2012. Efficient bi-objective team formation in social networks. In: Machine Learning and Knowledge Discovery in Databases. Springer, pp. 483–498. Kempe, D., Kleinberg, J., Tardos, É., 2003. Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 137–146. Kothari, R., Ghosh, D., 2013. Insertion based lin–kernighan heuristic for single row facility layout. Computers & Operations Research 40 (1), 129–136. Lappas, T., Liu, K., Terzi, E., 2009. Finding a team of experts in social networks. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp. 467–476. Leite, A. R., Borges, A. P., Carpes, L. M., Enembreck, F., 2011. Improving the distributed constraint optimization using social network analysis. In: Advances in Artificial Intelligence–SBIA 2010. Springer, pp. 243–252. Lin, S., Kernighan, B. W., 1973. An effective heuristic algorithm for the travelingsalesman problem. Operations research 21 (2), 498–516. Manser, T., 2009. Teamwork and patient safety in dynamic domains of healthcare: a review of the literature. Acta Anaesthesiologica Scandinavica 53 (2), 143– 151. Nascimento, M. C., Pitsoulis, L., 2013. Community detection by modularity maximization using grasp with path relinking. Computers & Operations Research 40 (12), 3121–3131. 32 Newman, M. E., 2006. Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103 (23), 8577–8582. Pirim, H., Ekşioğlu, B., Perkins, A. D., Yüceer, Ç., 2012. Clustering of high throughput gene expression data. Computers & operations research 39 (12), 3046–3061. Robins, G., Pattison, P., Kalish, Y., Lusher, D., 2007. An introduction to exponential random graph (¡ i¿ p¡/i¿¡ sup¿*¡/sup¿) models for social networks. Social networks 29 (2), 173–191. Ronald, B., 1992. Structural holes: The social structure of competition. Cambridge: Harvard. Ruef, M., Aldrich, H. E., Carter, N. M., 2003. The structure of founding teams: Homophily, strong ties, and isolation among us entrepreneurs. American sociological review, 195–222. Salari, M., Naji-Azimi, Z., 2012. An integer programming-based local search for the covering salesman problem. Computers & Operations Research 39 (11), 2594–2602. San Segundo, P., Rodrı́guez-Losada, D., Jiménez, A., 2011. An exact bit-parallel algorithm for the maximum clique problem. Computers & Operations Research 38 (2), 571–581. San Segundo, P., Tapia, C., 2014. Relaxed approximate coloring in exact maximum clique search. Computers & Operations Research 44, 185–192. Shi, Z., Hao, F., 2013. A strategy of multi-criteria decision-making task ranking in social-networks. The Journal of Supercomputing, 1–16. Snijders, T. A., Van de Bunt, G. G., Steglich, C. E., 2010. Introduction to stochastic actor-based models for network dynamics. Social networks 32 (1), 44–60. Squillante, M., 2010. Decision making in social networks. International Journal of Intelligent Systems 25 (3), 225–225. Wasserman, S., Faust, K., 1994. Social network analysis: Methods and applications. Vol. 8. Cambridge university press. Wi, H., Oh, S., Mun, J., Jung, M., 2009. A team formation model based on knowledge and collaboration. Expert Systems with Applications 36 (5), 9121– 9134. Zhang, L., Zhang, X., 2013. Multi-objective team formation optimization for new product development. Computers & Industrial Engineering 64 (3), 804–811. Zhang, Z.-x., Luk, W., Arthur, D., Wong, T., 2001. Nursing competencies: personal characteristics contributing to effective nursing performance. Journal of Advanced Nursing 33 (4), 467–474. 33 Zhong, X., Huang, Q., Davison, R. M., Yang, X., Chen, H., 2012. Empowering teams through social network ties. International Journal of Information Management 32 (3), 209–220. Zhu, M., Huang, Y., Contractor, N. S., 2013. Motivations for self-assembling into project teams. Social Networks 35 (2), 251–264. 9. Appendix Theorem 1 Consider an instance of TFP-SSS, with M teams to be formed out of N individuals, Instance: A graph G(V, E), |G|= N ; nj ∈ Z + for 1 ≤ j ≤ M ; a partition of disjoint sets X1 , X2 , . . . , XM , where Xj ⊆ G(V, E) for 1 ≤ j ≤ M and θl ∈ R for l ∈ L. Question: Is there a partition of PV into M disjoint subsets X1 ∪ X2 ∪ . . . ∪ XM , with |Xj |= nj , such that M j=1 P(Xj ) is maximized, where P (Xj ) = Pnj P PM i=1 j=1 nj = N ? l∈L θl Fl (NXj (vi )) for 1 ≤ j ≤ M and The presented problem is NP-Hard. Proof: The proof proceeds by a polynomial-time reduction from Partition into Triangles. An arbitrary instance of Partition into Triangles is given. Instance: A graph G(V, E), with |G|= 3q, for a given fixed integer q. Question: Can the vertices of G be partitioned into q disjoint sets V1 , V2 , . . . , Vq , each containing exactly 3 vertices, such that for each Vi = ui , vi , wi , 1 ≤ i ≤ q, the three edges {ui , vi }, {ui , wi }, and {vi , wi } all belong to E (Garey and Johnson 1979)? Consider a particular instance of TFP-SSS with N = 3q and Xj = Vj for 1 ≤ j ≤ q and M = q. Set θl = 1 for l = {triangle} and θl = 0 otherwise (i.e., use the number of triangles as the only NSM in the objective function of TFPSSS. Finally, set nj = 3 for 1 ≤ j ≤ q. To demonstrate that there is a one-toone correspondence between the described Partition into Triangles and TFPSSS instances, suppose that Xj∗ , j = 1, . . . , q is the optimal solution to TFPP SSS. Therefore, one has that |Xj∗ |= 3, and also, M j=1 Ftriangle (NXj (vi )) = q is the maximal value of the objective function. Observe that Vj∗ , with |Vj∗ |= 3, which assigns three nodes to partition j is equivalent to Xj∗ . Note that these three nodes form a triangle. Suppose that Xj∗ 6= Vi∗ for some i’s and j’s, with 1 ≤ i, j ≤ q. Then, there exists at least one partition / E. Therefore, one with 3 nodes {ui , vi , wi } ∈ Xj∗ such that {eui , evi , ewi } ∈ PM has j=1 Ftriangle (NXj∗ (vi )) < q, which is a contradiction since there are q 34 partitions and at least one of them does not have a triangle. Hence, Xj∗ = Vi∗ is an optimal solution for both problems. This completes the proof. 35 Violent Conflict and Online Segregation: An analysis of social network communication across Ukraine’s regions* Dinissa Duvanova† Department of International Relations Lehigh University Alexander Nikolaev Department of Industrial and Systems Engineering University at Buffalo, SUNY Alex Nikolsko-Rzhevskyy Department of Economics Lehigh University Alexander Semenov Department of Computer Science and Information Systems, The University of Jyvaskyla, Finland Abstract Does the intensity of a social conflict affect political division? Traditionally, social cleavages are seen as the underlying cause of political conflicts. It is clear, however, that a violent conflict itself can shape partisan, social, and national identities. In this paper, we ask whether social conflicts unite or divide the society by studying the effects of Ukraine’s military conflict with Russia on online social ties between Ukrainian provinces (oblasts). In order to do that, we collected original data on the cross-regional structure of politically relevant online communication among users of VKontakte social networking site. We analyze the panel of provinces spanning the most active phases of domestic protests and military conflict and isolate the effects of province-specific war casualties on the nature of inter-provincial online communication. The results show that war casualties entice strong emotional response in the corresponding provinces, but do not necessarily increase the level of social cohesion in inter-provincial online communication. We find that the intensity of military conflict entices online activism, but activates regional rather than nation-wide network connections. We also find that military conflict tends to polarize some regions of Ukraine, especially in the East. Our research brings attention to the underexplored areas in the study of civil conflict and political identities by documenting the ways the former may affect the latter. JEL Code: K42, H56, O5, P2, Z1 Keywords: Ukraine, social media, war, terrorism The authors would like to thank Ruben Enikolopov, Konstantin Sonin, and participants of the Symposium for a Special Issue of the Journal of Comparative Economics for helpful comments. The research of A. Semenov is supported by the Academy of Finland Grant #268078 "Mining social media sites" (MineSocMed). † Corresponding author. Assistant professor, Department of International Relations, Lehigh University, email: [email protected]. * 1 Electronic copy available at: http://ssrn.com/abstract=2664949 1. Introduction Does the intensity of social conflicts affect political divisions? On the one hand, the traumatic experience of violence may reinforce polarizing identities (Wilkinson 2004). On the other, violent conflict may help consolidate the society and promote social capital (Russett 1990, Voors et al. 2012, Blattman 2009). How exactly do violent conflicts reshape the society? We attempt to answer this question by studying the effects of Ukraine’s "Revolution of Dignity" and the military conflict with Russia on online social ties between Ukrainian provinces. Researchers identified digital communication and online activism as increasingly consequential forms of behavior, as well as important mechanisms of political change. Social media has been shown to affect civic engagement, political participation, and economic choice.1 Digitally enabled forms of political communication have also become increasingly important sources of attitudinal and behavioral data. As the digital revolution gives rise to new electronic forms of mass communication and virtual association, it opens greater opportunities to study how people form attitudes, express their opinions, and engage in collective behavior. In this paper, we explore such opportunities by analyzing ways in which political information shapes online social engagement. We examine online political activism and engagement during the period of political contention spanning the anti-regime Euro Maidan protests, annexation of Crimea, and armed insurgency and foreign intervention in Eastern Ukraine. We believe that Ukraine’s case presents advantageous settings not only for investigating online activism as an increasingly popular form of civil engagement, but also for evaluating long-standing questions about the role of violent 1 Researchers find empirical links between exposure to digital communication technology and political attitudes (Kerbel 2009), voting (Christakis and Fowler 2009, Vitak et al. 2011, Bond et al. 2012), civic engagement (Jennings, and Zeitner 2003, Jensen at al. 2007, Bennett & Segerberg, 2013), campaign contributions (Hamilton & Tolbert 2012), support for political parties (Norris 2003), and collective action (Earl 2011, Bennett and Segerberg 2013). Scholars studying authoritarian regimes link social media to oppositional attitudes and protest behavior (Tang, Jorba, and Jensen, 2012, Howard & Hussain, 2013; Lim, 2012; and Tufekci & Wilson, 2012). Social media is also identified as an effective tool of governance that affects policymaking (Kerbel, 2012; Baum, 2012; Lawless, 2012). Enikolopov et al. (2015) demonstrate that blogs affect stock market and corporate governance. See Fox & Ramos, 2012, and Jensen et al., 2012, for a review of the related literature. 2 Electronic copy available at: http://ssrn.com/abstract=2664949 conflict in promoting social change. Ukraine’s turbulent politics provide a rich context for studying social conflict. In Ukraine, the relative weakness (Aslund and McFaul 2006, Birch 2000) and “fluidity” (Zielinski et al. 2008) of institutional mechanisms of routine public engagement (political parties, unions, advocacy groups) make virtual communication a particularly important venue of political engagement. The ability to freely exchange opinions and share relevant political information is the cornerstone of a democratic society (Huckfeldt and Sprague 1995). In new democracies that lack proper institutions, social networking websites such as Facebook may not only provide easily accessible venues for political expression, but also serve as substitutes for underdeveloped institutions of civil society.2 In order to analyze such increasingly accessible and important mechanisms of political expression, we collected original data on the cross-regional structure of politically relevant online communications. In particular, our dataset contains user-created posts and comments from public political groups in VKontakte (VK.com)—the largest, by the number of registered users and daily visits, social networking site in Ukraine. Each post and related comments, in addition to the date and time stamp, contain user-specific information such as their name, self-reported home and current cities, education, the list of languages spoken, etc. Unfortunately, due to the privacy concerns, the individual-level data, while present in the original dataset, had to be first aggregated to the group level before we could use it for our analysis. Nevertheless, we are able to utilize information on contributors to specific discussions, including their regional composition. Our analysis of virtual communication carried out in VKontakte discussion groups reveals that the intensity of military aggression as captured by the widely publicized information about army fatalities while uniting some, segregates other parts of the Ukrainian social media community. We analyze volumes of cross-provincial communication and engagement in online discussion groups and find that information about civil protests (Maidan) and war casualties 2 See Beissinger (2012) for a related discussion. 3 leads to greater online activism on the part of users from the affected provinces. Such engagement, however, remains mostly localized and does not affect other parts of the country. We also find considerable variation in the ways different parts of the country respond to the violent conflict as measured by the number of casualties.3 Political information enticing a strong emotional response has a polarizing effect in Eastern oblasts (provinces), but not in the rest of the country. While in Western oblasts war casualties result in increased disengagement from the rest of the country, in the East they lead to greater polarization. This finding corroborates previous studies that link online communication to political fragmentation and polarization (Rozenblat and Mobius 2004; Duvanova et al. 2015; Bennett and Segerberg, 2013; Webster, 2007; Prior, 2007).4 To our knowledge, it is the first study to systematically examine the causal effects of divisive news on the digitally enabled form of public engagement. The paper is organized as follows. The next section develops our argument, analytical model, and hypotheses. We then describe our data collection, methods, and aggregation procedures in Section 3. Section 4 presents our empirical analyses. Conclusions that summarize our results and contributions are presented in Section 5. 2. Insurgency, casualties, and social network communication 2.1 Theoretical Considerations Traditionally, social identities are seen as the underlying causes of political contention (Fearon & Laitin 2000, Montalvo & Reynal-Querol 2005, Montalvo and Reynal-Querol 2010). Deep and persistent cleavages polarize the society and may contribute to violent conflict (Fearon & Laitin 2003, Blattman & Miguel 2010, Jackson & Morelli 2011).5 At the same time, tribal, 3 We divide Ukraine into Western, Center, Southern, and Eastern parts according to the established convention. The list of oblasts falling into each region is presented in Figure 1. 4 Gentzkow and Shapiro (2011) on the contrary, find that the ideological fragmentation in online news consumption is higher than in offline media, but low in absolute terms and in comparison to face-to-face communication. While we do not make any claims about the absolute levels of social media fragmentation in Ukraine, our finding that information about intensity of fighting heightens regional segregation of virtual network does not contradict Gentzkow and Shapiro’s conclusions and further extends this line of research. 5 Studies have shown that mass media can increase the salience of ethnic identities, heighten racial animosity (Della Vigna et al. 2014, Adena et al. 2015), and influence political behavior (Enikolopov et al. 4 ethnic, and national identities are shaped by collective experiences of war and violence. Instead of asking whether social divisions affect political conflict, we investigate the reverse causal link between the intensity of violent conflict (measured by the number of fatal casualties) and networking of diverse groups of online population. Scholars have long recognized war as a companion and catalyst of nation building (Anderson 1982, Mylonas 2012) and state-making (Tilly 1992, Thies 2005, Acemoglu and Robinson 2006). External threats (international conflict or outer-group attacks) reduce the salience of internal divisions, suppresses internal dissent, and helps build the sense of group identity. In the face of external threats—violent conflicts in particular—citizens tend to re-define the relationship between their partisan, social, and national identities. The studies of the “rally ‘round the flag” effect demonstrate that, if considered as such by the majority of the country, external threat tends to unite its citizens (Mueller 1970, Russett 1990). According to this theory, foreign threats activate non-divisive identities and have a unifying effect in politically diverse groups. When studying Ukraine, however, it is unclear whether the war in the Donbas would have such a unifying effect. An ethno-linguistically diverse nation, Ukraine might not have yet developed a strong sense of national identity. Since the introduction of competitive elections, political preferences and voting patterns of the Ukrainian citizens largely overlapped with ethnolinguistic divisions (Clem and Craumer 2008). Anti-government protests were fueled by ethnic and cultural cleavages, rather than by universalistic democratic principles (Beissinger, 2013). Russian media have had an important influence on the political behavior of exposed Ukrainians, reinforcing a pro-western—pro Russian schism (Peisakhin and Rozenas 2015). With most Ukrainians being fluent in Russian, self-selected exposure to biased media explains why the very nature of Ukraine’s armed conflict remains unclear. While the nationalist, pro-western outlets see the conflict in terms of foreign intervention, Russian state-controlled media cover it as an 2011, and Della Vigna & Gentzkow 2010 for review). Similarly, social media that harnesses social networks of like-minded people is seen as an important source of influence in politics (Murphy and Shleifer 2004, Christakis and Fowler 2009, Bond et al. 2012). 5 insurgency, separatism, and civil war. In more general terms, one might debate whether the “rally ‘round the flag” effect would be appropriate in a deeply divided nation in the midst of a war. The ethno-linguistic as well as regional dimensions of the conflict may further fragment the society along these fault lines. A cursory analysis of online communication might suggest that in fact the Ukrainian virtual public is getting increasingly engaged in a way that bridges geographically defined linguistic and political barriers. Figure 2 plots the strength of online communication ties connecting users of VKontakte social networking platform at two points in time.6 The first graph maps cross-provincial communication during the early phase of Maidan protests (November 2013). The second graph does the same for the period of intense fighting between the Ukrainian army and insurgent forces (August 2014). It appears that online communication has a greater density and more pronounced cross-regional nature during the war, comparing to the period before the start of insurgency. On the other hand, the Southern part of Ukraine (especially the occupied Crimean peninsula) is generally excluded from interprovincial communication in 2014, although in 2013 it was closely connected to the Central and Western oblasts. Does that mean the Ukrainian public becomes more or less united in response to the violent conflict in the Donbas? Visual analysis of online communication might be misleading. First, the overall importance of inter-regional ties cannot be adequately assessed without explicit reference to intra-regional communication. Second, communication does not mean an agreement. Heated debates might entice participation, but at the same time, may further polarize virtual communication. 6 We first aggregate individual-level data on online political communication to the level of the province. For each pair of provinces (dyads), we identify discussion topics with participants from each province. For every dyad, the intensity of shared communication is measured as the number of topics that both provinces contributed to. In order to account for the value of a topic in determining communication intensity, we use the percentages of messages contributed by the regions with respect to their total message volumes. In other words, for each topic we calculate its share in the total number of wall posts contributed by each of the provinces. Then we add the products of provinces’ contributions (as the share of total posts) to each of the shared topics for the corresponding dyad. The resulting measure of the inter-provincial ties ranges from 0 to 1, with the higher numbers corresponding to stronger ties. Our measure has an advantage of being balanced with respect to the total number of users from the specific province and the number of topics they discuss. 6 Although this is a grossly simplified characterization, the political cleavage lines that have been observed since the early post-independence period closely follow the East-South vs. WestCenter geographical divide. Previous research has identified ethnic, linguistic, socio-economic, and cultural-historical cleavages as mutually reinforcing factors accounting for regional variation in public attitudes and political behavior (Birch 2000; Aslund and McFaul 2006; Clem and Craumer 2008, Beissinger 2013). While researchers disagree on the precise composition and relative theoretical merits of the underlying social factors, it appears that Ukrainian political preferences clearly and consistently vary across regional lines, with Russian-speaking Eastern and Southern Ukraine, which is more industrialized and dominated by highly concentrated industry, being significantly different from the Ukrainian-speaking Western and Central provinces, which are characterized by small enterprises and tertiary-sector domination. Making use of such persistent and clearly defined social divides, our identification strategy consists of using the province-specific Ukrainian military fatalities as a measure of a conflict’s intensity. Ukrainian media routinely reports personal information on all military personnel and volunteers killed in Eastern Ukraine, including one’s place of origin. This makes it possible for the online groups to link casualties with specific provinces. Soldiers and officers from all parts of the country serve in Ukrainian armed forces; solders from the South, East, West, and Center are equally likely to die in fighting separatists. The key advantage of considering casualties is that they can be, for the most part, considered exogenous and due to their specific geographical attributes we can derive province-specific expectations about unifying/polarizing effects. Conflict casualties were first reported in March 18, 2014, and peaked in August 2014 at over 500 dead during the Ilovaisk massacre. Figure 3 graphs the monthly casualties data over time. We identify two dimensions of cross-regional communication: levels of cross-regional fragmentation and the degree of its polarization. Fragmentation and polarization are two distinct 7 aspects of network communication. We define fragmentation of online communication as the lack of connections (exchange of information) between different groups in the society. For the purpose of our analysis, we concentrate on the inter-provincial connections. But how do we know the information is being exchanged? Open communication platforms allow all network users to access the information, but this does not guarantee all users are exposed to such information. In fact, prior studies clearly documented “selective exposure” in pluralistic information environments (Sunstein 2001; Stroud, 2011). We explore inter-regional fragmentation using three different empirical metrics capturing: 1) how active users from a given province are in contributing to online communication; 2) to what extent the online communication (in a group or discussion forum) brings together users from different provinces; and 3) how “influential” the content of one province’s posts and comments is in enticing a response from other members of the social network. In our specific case, the response is captured by the number of comments the original message (post) attracts. Of course, active communication among users from various provinces does not necessarily imply a constructive dialogue. Therefore, we consider how polarized the sentiment is in each discussion. Because we analyze online discussions on a diverse set of political topics, we measure polarization as the difference in the intensity of positive and negative sentiment captured by the content analysis of the comments. 2.2. The Model and Hypotheses We model all online communication as falling into two types. Type I communication is carried between users residing in the same province. Type II communication takes please between users from any two different provinces. Interprovincial communication, in our view, is more likely to be carried out by the friendship, familial, occupational, and off-line association networks and, due to geographical patterns of ethno-linguistic and political cleavages, unite people of proximate political preferences and ethno-linguistic background. Type II communication, on the contrary, is less likely to be anchored by physical connections and more 8 likely to engage people of different backgrounds and political preferences. While we are unable to differentiate between in-group ties based on physical connections and impersonal out-group ties, it is more likely that inter-provincial communication would fall into an impersonal outgroup type because of the greater physical and cultural distance. We assume that the users incur time costs and social benefits by posting messages and comments. Furthermore, social benefits increase with the sense of community, solidarity, and social efficacy. Hence, a user is more likely to contribute a costly effort to a conversation that is more socially relevant to her. 7 In our analysis, political events, such as Maidan revolution and escalation of the war in the Donbas, increase social relevance of some online communication. We also assume that VKontakte users are generally aware of ethno-linguistic backgrounds and political preferences of the authors of wall posts. They may deduce this information from the author’s choice of language (Ukrainian or Russian), content and tone of the message, previous history of public communication, and, perhaps, from the home town and/or town of residence associated with users’ profiles.8 We model communication as initiated by user i, who posts a message on the VKontakte wall. User j ≠ i may respond to i’s posts by a number of comments xij. If both i and j reside in the same province, xij will constitute Type I communication. If i and j are from different provinces, Type II communication takes place. In our model, the social network users utilize the information on the intensity of social conflict to update their political beliefs and preferences. They may rely on various media sources to obtain such information. Because the media outlets may be selective in covering various aspects of the conflict, we chose to concentrate on the widely publicized objective measure of the intensity of social conflict—war-related fatalities of pro-Ukrainian armed forces. This measure has an important advantage of being specific to our unit of aggregation. The VKontakte users may obtain province-specific war causalities from official and unofficial (i.e., Wikipedia) sources. Besides, they may be exposed to province-specific casualties via physical interactions 7 8 Relevance may mean both agreement and disagreement. Our analysis excludes users whose profiles do not identify the place of residence. 9 with people whose family members and acquaintances serve in the army. If the military intervention in Eastern Ukraine promotes solidarity and strengthens an allUkrainian national identity, one would expect Ukrainians of all ethno-linguistic backgrounds and political preferences engage in online discussions (Type II communication). On the contrary, if the conflict activates divisive ethnic, regional, or political identities, one would expect VKontakte communication be more localized, as the people would resort to the safety of familial, friendship, occupational, and generally more localized networks (Type I communication) where they are more likely to be accepted. Moreover, solidarity with the defenders of Ukrainian independence should be compatible with the inclusive Ukrainian national identity. As such, if Ukrainians are becoming more united in their patriotism and opposition to Russia, they should be more likely to engage in Type II communication with provinces experiencing war casualties on the Ukrainian side to show emotional support and express sympathy. If, on the contrary, the war polarizes rather than unites the nation, province-specific war casualties should increase Type I communication: provincial, rather than all-Ukrainian identity should be activated in response to province-specific fatalities. The first set of hypotheses, therefore, addresses the general nature of inter-provincial communication carried over the digital social networking platform: Hypothesis 1a: Intensification of violent conflict should lower fragmentation of online communication. Hypothesis 1b: Intensification of violent conflict should lower the degree of polarization in online communication. These hypotheses are consistent with the notion that the war in the Donbas unites Ukrainians and promotes an inclusive scene of patriotism. The opposite expectations are consistent with the notion that the war further fragments and polarizes the nation. To evaluate these hypotheses, we analyze the entire daily panel of Ukrainian provinces between January 1, 2013 and December 31, 2014. Given the regional clustering of Ukraine’s socio-political and ethno-linguistic cleavages, 10 we develop a set of complementary hypotheses to test whether the patterns of inter-provincial fragmentation and polarization differ in West-Central and East-Southern provinces: Hypothesis 2a: Intensification of violent conflict should reduce fragmentation and polarization of online communication in the subset of provinces with a strong prior sense of Ukrainian identity, e.g., West-Central Ukraine. Hypothesis 2b: Intensification of violent conflict should increase fragmentation and polarization of online communication in the subset of provinces lacking a very strong sense of Ukrainian identity, e.g., East-Southern Ukraine. To evaluate these hypotheses, we split the sample into Western, Central, Eastern, and Southern provinces. After describing our data in the next section, the following “Analysis” section evaluates these two sets of theoretical expectations against the observed patterns of online communication. 3. Data 3.1 Data Collection Methods In order to investigate how social conflict affects mass political communication, we collected data on online groups and users of VKontakte (vk.com or simply “VK” previously— vkontakte.ru), which is the most visited social network site in Ukraine.9 With its three official languages—Russian, Ukrainian, and English—VKontakte has over 200 million users who primarily reside in Russia, Ukraine, Kazakhstan, Moldova, Belarus, and Israel. In Ukraine, the site’s share of total Internet searches is second only to the amount carried out using the Google search engine. According to Ukraine Business Online,10 out of 30 million Ukrainians who used social networks in 2012, 20 million had VKontakte accounts. VKontakte allows users to post public or private messages and share audio, video, and text content as well as create groups, public pages, and events. In what follows, we analyze user-created groups that have explicitly identifiable political content in the body of their wall posts and comments.11 9 Alexa, the web information service, http://www.alexa.com/topsites/countries/UA, retrieved in Feb. 2015 10 http://www.ukrainebusiness.com.ua/news/7110.html 11 It should be noted that following the 2014 politically motivated resignation of VKontakte founder P. Durov, the site was subject to increased control by the Russian secret services (http://www.ewdn.com/2014/04/22/durov-says-he-gave-up-vkontakte-share-because-of-anti-maidan11 VKontakte offers its users the possibility to create a profile and fill it with various personal and non-personal information, such as name, gender, date and place of birth, pictures, etc. Each user has her own wall, where she could post various messages. These messages, along with the user’s data, can be seen on the user’s profile page, which can be accessed by a URL. Users can create explicit communities: groups and public pages. Communities have separate pages. These pages contain various community details, such as name, description, logo, community’s wall, where community news are displayed, and community discussion boards. Users may join communities. When they do, their public information becomes accessible under community “members” field. Depending on privacy settings, communities may be public (available to anyone) or closed communities, where an invitation to join is required. Settings also regulate who can post on a community wall, e.g., all users or administrators only. VKontakte allows searching communities by different keywords. Each community has a different URL, by which it can be accessed. The data were collected from public communities using social media monitoring software described in Semenov and Veijalainen (2013) and Semenov, Veijalainen, and Boukhanovsky (2011). Collection relied on the application programming interface (API) of VKontakte. This API interface exports data in JSON format. At the end, the data were placed in a repository based on PostgreSQL database. Initially, the groups were identified by communities search over all their posts using one or more of the following keywords “Ukraine”, “Украина”, “Украïна”, “Майдан,” “Євромайдан” with all their grammatical variations.12 In total, the search identified 14,777 public communities. Then, all public posts (message text and date) were downloaded from each community wall. Comments attached to selected posts were gathered as pressure-from-fsb/). Following Durov’s resignation, pundits predicted mass migration of pro-Ukrainian users to Facebook. Despite this, VKontacte remains the most visited networking site in Ukraine, far surpassing the number of Facebook users. It should be taken into account, however, that with some antiRussian users leaving VKontakte in protest, the subset of remaining users has become biased towards supporting Russia. The self-selection of users into different networking platforms should bias our analysis against finding fragmentation and polarization. 12 Including keywords such as “war”, “protest”, and “conflict” produced redundant results because these overlapped with “Ukraine” and “Ukrainian.” 12 well. Next, city and country, listed in users’ public profiles, were gathered for those users who were members of the mentioned communities, as well as those who posted commenters of the wall. No personal information was ever collected or stored. We gathered 19,430,445 wall posts, which jointly had 62,193,711 comments. During wall post and comment gathering, we downloaded 71 GB of text data. To capture a temporal dimension to the development of online networks, we separate posts contributed to the discussion topics into daily segments based on the timestamps of the user-supplied posts.13 In effect, our analysis only includes discussion groups that had active contributors during the analyzed day. Figure 3 displays the changes in the number of wall posts and comments contributed during the analyzed period. Relying on the user-supplied information about the place of residence, we can identify the regional dimension of politically motivated Internet communication, which we expect to follow the existing ideological fault lines of the present-day Ukrainian politics. Our analysis groups users’ comments by oblasts.14 For each discussion group, we compute the share of participants from each province. 13 Aggregating our data to weekly frequency does not affect our conclusions. Results can be found in Appendix B, Tables B1-B5. 14 VKontakte users do not identify their oblasts, but only use city and country names. Cities were matched to their corresponding oblasts using the open-source dataset Geonames (http://download.geonames.org/export/dump/). For those cities presented in the dataset more than once, the one having maximal population was selected. Based on the user-provided place of residence, we group users by provinces. Some users kept their information private. In those cases we weren’t able to identify their home province, we assigned them to a separate “unidentified” group. 13 In order to investigate how regional patterns of online communication respond to the intensification of political tension and violent conflict, we supplement our online networks data with data tracing province-specific unfortunate war casualties. There are several potential official government sources of the casualties data (e.g., the Ministry of Defense of Ukraine) as well as several unofficial sources, collected by individual volunteer organizations (e.g., the Wings of Phoenix). While the former lacks biographical information on the fallen (thus making it impossible to match them to their home cities) as well as data on volunteer fighters, the latter might be incomplete. For that reason, we decided to use a crowd sourced Wikipedia webpage on Ukrainian casualties that appears to be more complete than other individual sources, and in addition to the name, rank, and the date of death, contains a short biographical sketch.15 Using the supplied information, we were able to calculate the number of casualties each oblast suffered each day from January 1, 2013 through the last day of 2014. Among the 1,466 casualties, 560 and 398 are from the Western and Central oblasts (65% of the total), while 204 and 304 are from the Southern and Eastern oblasts (35% of the total). Our identification strategy is based on the assumption that if war casualties affect the way people engage using VKontakte platform, the number of comments originated in oblast should respond positively to casualties from the same oblast. And this is the first thing we check in our empirical analysis. This alone, however, will tell us little about whether different users are engaged in inter-provincial dialog and even less about the nature of the dialogue. To quantify and measure these effects we identify two dimensions of cross-regional communication: fragmentation, or the lack of connections between regions, and polarization of the sentiment of each discussion. 3. 2. Measuring fragmentation To capture the extent to which social network communication unites users from 15 Biographical data were missing or incomplete on several soldiers, which did not allow us to identify their home city. In those cases their oblast in the dataset was coded as “unknown.” 14 different provinces, we adopted the reverse of “group separation” measure by Rosenblat and Mobius (2004). We define interprovincial cohesion (reverse of fragmentation) as the share of total comments falling in Type II communication. The larger the share, the less compartmentalized online communication is. To capture the same concept of cohesion in a regional dyads analysis (either West-Center vs. East-South or any pair of regions identified on Figure 1) that is more relevant for our purpose, we modify the above measure for the case of only two groups. The idea is that if a share of comments to a particular post coming from any of the two regions equals to either one or zero, this would mean that there is little or no interaction between the regions.16 If, instead, the share is close to .5, neither region dominates the discussion, so users from different parts of the country exchange opinions on relevant political issues. To account for that, we construct the dyadic cohesion index as: ℎ where = 1– 2 × % ∑%& !" # !" $!" −( (1) is the number of comments to wallpost ) on day coming from region . Thus, in both measures, values closer to zero mean no interprovincial conversation, and values close to one mean full engagement. While reflective of the extent of “connectedness” with users residing outside a given province, this index does not take into account the overall volume of conversation carried by the province and the popularity of the original posts (often assessed by the number of comments they attract). Therefore, it ignores informational and persuasive properties of inter-provincial dialog. To capture the extent to which the content produced by users from a given province can attract inter-provincial response, we use a recently introduced measure of engagement capacity to assess the ability of groups of online forum users to entice peer response from other oblasts To correctly calculate the shares, we had to adjust for the comments coming from other countries, as well as comments from users whose location we weren't able to identify. Also, we limited our consideration to posts with at least 10 comments to remove noise from the data. 16 15 (Godre et al., 2015). The online social network or forum communication directly depends on the content continuously created by the users, and in particular, the synergy they gain from responding to each other, thereby building relationships, and in particular, finding common political interests, sharing/spreading attitudes, establishing themselves as a cohesive selfidentifying entity. The engagement capacity index quantifies the share of any user, or user group, in this synergistically created value, under the assumption that a forum post is only valuable if it is able to attract peer responses, which can be either in support of or in opposition to the original post. The engagement capacity measure relies on the principles of cooperative game theory that calculates “fair-share” equilibria in settings with agents that form coalitions to achieve a common goal. The first widely used measure of this sort is known as the Shapley value (Shapley, 1953). While the Shapley value is useful for evaluating the fairly contributed values with unstructured coalitions, the engagement capacity works with directed trees of forum threads (Godre et al., 2015). The engagement capacity index takes into account an initiator/responder tradeoff. This means that in a single exchange it gives more value to the “originator,” or more precisely, all the users, who have posted in the thread up until this moment. We view VKontakte posts as discussion seeder/originator actions, and the ensuing comments as the responder actions. We posit that VKontakte users form groups, by oblast, and calculate the groups’ engagement index as the measure of their ability (as originators) to engage other groups (as responders) in political communication. We track the engagement values for all oblasts over time, daily, and study whether information available through public channels (reported war casualties from a given province), serves as a statistically significant predictor of the ability of each particular oblast “to be heard.” Figure 4 plots the engagement capacity index for West-Central and East-Southern oblasts. We can see that before the Revolution started, both West-Central oblasts and South-East oblasts had similar power in starting valuable discussions that were followed by residents of other oblasts. This clearly changed after November 2014. Now the 16 majority of influential discussions that attract a wide array of participants originate in the West. 3. 3. Measuring Polarization Our second dependent variable—political polarization—is conceptualized as the distance in sentiment, or the intensity of comments’ positive and negative connotation (Pang and Lee, 2008). We capture this by performing a content analysis of all wall posts using Ukrainian and Russian “bag of words” datasets that include 8,863 positive and 24,299 negative words in both Russian and Ukrainian.17 In economic analyses, the “bag of words” methods had been previously used to study media market segregation (Gentzkow and Shapiro 2010), consumer confidence and political opinion (O'Connor et al. 2010), and economic uncertainty (Baker et al. 2013). Our comments data jointly include about 417 million words. In an automated analysis using “the bag of words” databases, we identified the degrees of negative and positive sentiment by matching the content of the comments with the positive or negative words from our “bag of words” datasets. Of the total number of words used, 29 million (or 7 percent) were identified as positive or negative words. We use the results of the content analysis to construct two variables that capture the overall number of positive and negative words in the content supplied by each oblast. Both variables, in our opinion are crucial in understanding changes in attitudes and political identification. If we see these variables responding similarly to casualties, this would indicate greater polarization of sentiment. If, on the other hand, these variables move in the opposite direction, this would indicate greater convergence in the overall sentiment. 4. Analysis 4. 1. Casualties and online activism Before examining the relationship between our measures of fragmentation and polarization on the one hand and violent conflict on the other, we ensured that online communication in fact responds to war casualties. Hence, we start by testing the mechanism. We define the time variable as the number of months past the start of the 2013 Maidan protests For the review of the “bag-of-words” method, as well as alternative approaches to sentiment analysis see Maragoudakis et al., 2011 and Liu, 2012. 17 17 (which, therefore, is negative before November 2013 and positive after that). In Figure 3, we can clearly see the exponential rise in user activity. In fact, the average number of comments to political posts grew from about a thousand per month in early 2013 to several millions in 2014. We organize the data as a time-oblast panel, where for each day, each oblast Casualtiesit lists the number of soldiers native to that oblast who died in conflict on that day.18 We first want to see if the news about lost lives increases the number of comments from residents of the affected oblasts relative to the rest of Ukraine. To test this, we estimate the following fixed effects model19: = * + ,- + ∑8 &9 4 ∙ , 5,6 ℎ + . ∙ ,0 1 + 2 ∙ 3 7 + (2) 31 + Where Commentsit is the total number of comments left by residents of oblast i at time t, ci are oblast fixed effects that take care of unobserved heterogeneity among oblasts, dayt and montht are two sets of time dummies, aftert is the time-dummy, equal one after March 18, 2014 (the date the first Ukrainian soldier died defending Crimea), TimeTrendt is the quadratic time trend, and eit is the error term.20 Our full-sample results are presented in Table 1, Columns 1-3. We can see that regardless of model specification, casualties heighten the online activity of the affected oblasts’ residents— they significantly increase the number of comments. Moreover, the effect appears to last for at Note that we use the actual dates of casualties rather than their announcement in official media. To allow for the news to reach the public we include time lags. Effectively, we assume that the public becomes aware of the casualties from a wide variety of sources, including unofficial sources, rumors, and word-of-mouth. We do not want to discount such alternative methods of obtaining information. Moreover, the ministry of Defense has published daily updates of casualties’ data, making the lags of official reporting sufficiently uniform. Another potential issue is that many army battalions are formed on territorial basis, hence making the natives of particular oblasts overrepresented in daily death tolls. The model’s province-level fixed effects, however, help accounting for the presence of territorial battalions. The rotation of forces in the regular army is slow and usually exceeds three months. Since we base our analysis on daily fluctuations, Casualties, for the most part, are exogenous in the model. 19 We used six lags to take into account cyclicality of weekly work and news schedules (contemporaneous value plus six lags produces one full week). Our results are robust to adding an extra seventh lag and could be seen in Appendix A, Table A1. 20 We estimate all regressions in levels. Log-log regressions, although less sensitive to outliers, are not a good fit for our data because the dependent and independent variables in equation (2) equal zero for most of 2013. Table A3 demonstrates, however, that re-estimating the model in logs does not affect our conclusions. 18 18 least a week as indicated by the joint significance of the six lags of Casualties in addition to the variable itself. Depending on the set of controls, the point estimate for the contemporaneous effect ranges from 48 to 212. One of the caveats of the regression specified above is that before Maidan, which roughly coincides with the middle of our sample, casualties were zero and the number of political discussions and therefore comments was very small. After Maidan, on the other hand, both increased substantially. Hence, despite the fact that we attempted to control for time trends in the data, there is a possibility of finding a spurious relationship. To make sure that our results are not driven by the zero values of the variable before November 2013, we run the same set of fixed-effects regressions, while restricting the sample to the post-Maidan period (Columns 4-6). There is no clear trend in the number of comments, nor is there a trend in war casualties in 2014. Nevertheless, it can be seen from Table 1 that while the size of point estimates are expectedly smaller for this subsample, the coefficients remain statistically significant. [Table 1 about here] Table 2 re-estimates our model in the regionally divided subsamples. We see that the previous results are driven by the users from Eastern-Southern provinces responding to their casualties, while the results for West-Central provinces are not significant. [Table 2 about here] It appears that unlike VK users from Eastern-Southern provinces, residents of WestCentral provinces either do not respond at all or are no more sensitive to their home province casualties than the casualties suffered by other provinces. To examine whether VK users indeed react to casualties from other provinces, in Table 3 we estimate the effect of casualties attributable to all other provinces except those from province i and specify the independent variable as: ℎ 1 , 5,6 =∑ , 5,6 − , 5,6 . Columns 1-3 in Table 3 present different model specifications for the entire sample. Columns 4 and 5 re-estimate the models for the regionally-defined subsamples. These show that 19 our expectations about the effect of casualties on online activisms hold for the entire sample and for West-Central provinces, but not for the East-South. VK users from East-Southern oblasts do not respond to war casualties from outside of their oblasts. While the news about war casualties heighten online activism across the entire country, our results suggest that in West-Central provinces, users show more solidarity with causalities from other provinces, while the online activism of their East-Southern counterparts is driven primarily by their home province losses.21 [Table 3 about here] 4. 2. Assessing Fragmentation We have already uncovered some non-trivial differences between West-Center and South-Eastern parts of Ukraine. Next we proceed to assess how the intensity of armed conflict affects online fragmentation. Prior research indicates that regional fragmentation of online communication was typical during the times of electoral political mobilization (Duvanova et al. 2015). First, we would like to test whether Maidan and the war in Eastern Ukraine have changed this and opened up a dialogue between different parts of the country. We examine whether online discussions bring together users from the parts of the country historically separated by linguistic and political divides. Had the majority of comments to individual political posts come from either SouthEastern or West-Central Ukraine, or is there a balanced mix of the two thus indicating an inclusive dialogue? Figure 5 illustrates how our dyadic cohesion index (1), which measures fragmentation of communication between any two different groups of provinces, evolved over time in 2013 and 2014.22 One common element in the nature of political communication across different groups of provinces is its high volatility before Maidan, which reflects the fact that political discussions were relatively infrequent before the Revolution of Dignity. The volatility 21 In a recent experimental study, Bauer et al. (2013) found that individual experiences with war strengthen preference for in-group parochialism, but reduce trust in strangers. One possible psychological mechanisms accounting for our findings is that the information about war casualties may heighten the sense of belonging to a narrowly defined group and alienate the out-group members. 22 We had to take monthly averages to smooth out daily fluctuations in the index, particularly in the first half of the sample with generally fewer political posts and comments. 20 subsided significantly for most groups of provinces after the Maidan protests. Although on average the levels of communication between West-Central and East-Southern parts of the country remained the same, after Maidan, cohesion capacity slightly decreased for the West vs. East pair, but improved or stayed the same for all other combinations of provinces. It appears that communication became more integrated and inclusive for all regional pairs that don't include the East. For those that do, it has improved only between South and East—the regions that are culturally and historically close to each other. To test whether these changes are statistically significant, we regress the cohesion index on a Maidan dummy, defined as zero before November 21, 2013 and one after that. The regression results are presented in Table 4. The results in Column 1 confirm that engagement between West-Central and East-Southern Ukraine did not increase after the Maidan protests. At the same time, we find that engagement went up for all other groups of provinces except the West—East pair, which experienced a negative, but statistically insignificant change. The effects of statistically significant improvements in communication are larger for the pairs including Central provinces than for the pairs including Southern Ukraine. [Table 4 about here] When it comes to Hypotheses 1a and 2a, both find only partial support. We can see that while there is no change in communication between the broad West and broad East, the intensification of the violent conflict clearly increased communication in all regional pairs except West—East, reducing intra-regional fragmentation. Our measure of inter-provincial cohesion relies on the geographical decomposition of VK users’ comments, which constitute responses to the original messages posted by users residing in a given province, but ignores the posts and their origins. This might bias our analysis, making it more likely to find null results. The engagement capacity index described in Section 3.2 takes into account the original post and measures their capacity to entice response from outside the originators’ home oblasts. In line with Hypothesis 1, we test whether posts from provinces 21 with casualties will attract more attention (and perhaps, sympathy) from the wider online community. Specification 1, and especially specification 2, with additional controls in Table 5 clearly rule out the notion that news about oblast-specific losses entices users from other oblasts to more readily engage in cross-provincial communication. We see no or only marginally significant effects of casualties on the ability of posts from affected provinces to entice strong response from other provinces; moreover, point estimates are negative. Hence, Hypothesis 1a finds no clear support when using both the communication cohesion and engagement capacity measures. Columns 3-6 in Table 5 also rule out both Hypotheses 2a and 2b. We can see that while no effect can be detected in the East, South, and Center subsamples, the West’s ability to engage users from other provinces actually diminishes with casualties (the opposite of Hypotheses 2a is true). This means that online activism on the part of users from the affected provinces is perhaps directed at and influences other users from the same province, rather than bridging across provincial lines. Coupled with our finding that West-Center responds more actively to casualties suffered by other provinces, this result suggests that province-specific casualties make VK communication more region-specific, rather than appealing to broad, cross-regional dialog that engages heterogonous parts of Ukraine. Together our results show that while war casualties tend to mobilize VK users from the affected provinces (as well as users from West-Central provinces regardless of whether their home province incurred losses), such mobilization does not necessarily lead to cross-provincial communication. Not only does the analysis fail to detect evidence that casualties increase interregional engagement, VK users from the West engage in less dialog with the rest of the country as the casualties go up. As a result of such regional compartmentalization, the ability of users’ posts to attract comments from outside of their province (and hence “influence” the social network) diminishes. [Table 5 about here] 4. 3. Assessing the polarization hypotheses 22 Although the above analysis finds users of VKontakte respond to war casualties with increasing fragmentation of inter-provincial communication, this does not necessarily mean virtual conversation becomes polarized as well. In our data, some communication continues to be carried between users from different provinces. It is possible that those users develop bridging cross-regional connections by expressing sympathy and solidarity, or abstaining from divisive political debates. (The latter interpretation would be consistent with our findings of decreased engagement of the West). Without knowing whether users agree or argue over the posted content, it is impossible to rule such interpretation out. In order to differentiate between constructive and destructive conversations, we analyze their context with the help of positive and negative "bags of words." If the news about war casualties increase polarization in online discussions, we should see the rise in negative sentiment (negative connotation words) go hand-in-hand with the increase in positive sentiment (and vice versa). If, as Hypothesis 1b postulates, news about war casualties decreases polarization, we should expect to see the positive and negative words move in opposite directions as a result of the news about war casualties. Table 6 reveals that for Ukraine as a whole, discussions do become more polarized in response to war casualties; both coefficients in columns 1 and 6 are positive and statistically significant. Interestingly, this conclusion does not hold for all parts of Ukraine. Only provinces located in the Eastern part of the country experience simultaneous increases in both positive and negative sentiments. Our analysis shows that war casualties have no discernible effect on the intensity of negative and positive sentiments in other parts of the country. These results are inconsistent with Hypothesis 2a. The fact that war casualties appear to be a divisive subject in Eastern provinces is consistent with Hypothesis 2b: as a province located in the East experiences more war fatalities, the overall sentiment of the comments responding to this province’s posts becomes more emotionally charged and polarized. Piecing together the results of our empirical tests, we can rule out the notion that Ukrainians as a whole are becoming more united in the face of the challenges to the country’s 23 territorial integrity. Most revealing results come from the analysis of regional subsamples that show profound differences in the way West and East Ukrainians react to the conflict. EastSouthern provinces, including Crimea, which is currently occupied by Russia, and Donetsk and Luhansk in active war zone, tend to respond to their own, but not other oblasts’ war casualties with greater participation in online discussions. Yet, oblast-specific death tolls have no discernible impact on the fragmentation of online discussion in East-Southern Ukraine. This is in sharp contrast to West-Central provinces that tend to increase their network participation in response to other oblast’s casualties. West-Central oblasts do not respond to their own casualties with greater online activism when compared to casualties from other oblasts. Our results also show that war casualties have a polarizing effect in the East of Ukraine, but not in West-South-Central Ukraine. As a result, Eastern oblasts appear to be increasingly polarized by the conflict. 5. Conclusion Discussing the ideological battles waged in Russia and Ukraine, Sean Roberts and Robert Orttung (2015) summarized an increasingly popular notion that: “In the context of the war, Ukrainians are becoming only more united in their patriotism and opposition to Russia.” Using the data on cross-regional structure of politically relevant online communication among users of VKontakte social networking site, we put this notion to a test. We examined how political online communication among Ukraine’s residents responds to violent civil conflict and foreign intervention. Ukraine’s geographically concentrated political and ethno-linguistic divisions presented us with a unique opportunity to identify the underlying structure of social cleavages without the use of individual-level data which often are subject to a host of methodological problems and are rarely available in time series. The paper analyzed the panel of provinces spanning the most active phases of domestic protests and military conflict and evaluated how online activism, interprovincial cohesion and peer influence (engagement), as well as the overall discourse sentiment respond to the province-specific war casualties. 24 Our analysis provides little support for the notion that Ukraine’s public as a whole becomes more united in response to the violent conflict in the Donbas – for a number of provinces this does not appear to be the case. We found that generally, war casualties led to increased levels of online activism (measured by the amount of user-contributed context). At the same time, we found no evidence that war casualties impact the level of network cohesion in inter-provincial online communication between West-Central and East-Southern Ukraine. While the intensity of military conflict entices online activism, it mainly activates regional rather than nation-wide connections. Our analysis suggests that at least in the sphere of social network communication among VK users, the war separates rather than unites Ukrainians living far apart. Our analysis also revealed nontrivial differences in the ways different parts of the country respond to the violent conflict. While the users of social network platforms in East-Southern Ukraine react very little to war casualties from other parts of Ukraine, network users from WestCentral provinces respond with increasing activism to war casualties outside of their home provinces. Moreover, we found that while the war intensity does not affect the degree of fragmentation of network communication in East, Central, and, Southern provinces, Western provinces’ communication becomes more, rather than less fragmented as indicated by the engagement index. Together these results suggest that as the war in the Donbas progresses, Ukrainian virtual society, at least its part that uses VKontakte platform, becomes more politically galvanized but fragmented along political and ethno-linguistic divisions. We also found that military conflict tends to polarize network discourse in Eastern Ukraine, but not in other provinces. This shows that while Western oblasts’ residents’ online behavior contributes to the provincial fragmentation of online discussions, Eastern oblasts appear to be increasingly polarized by the conflict. In some sense, the “West” becomes more united as its network communication drifts away from the increasingly polarized “East.” This research makes several contributions. It engages the literature on mass communication and political conflict. Studies of persuasion and public opinion formation 25 suggest that selective exposure and media bias may reinforce partisan preferences and further fragment and polarize the public (Stroud 2010, Durante & Knight, 2012). At the same time, media messages may reinforce the sense of in-group (as in the case of Nazi propaganda in Germany [Adena et al. 2015]) and out-group (as in the case of Serbian radio in Croatia [DellaVigna et al. 2014]) identities. In Ukraine, exposure to Russian media had been linked to the electoral support for pro-Russian parties (Peisakhin and Rozenas 2015), and pre-war national elections had coincided with increased regional fragmentation of social media (Duvanova et al. 2015). Our research suggests that the social media might further reinforce selective exposure, and as a result, contribute to political fragmentation and polarization. Our research also contributes to the study of new, technology-enabled forms of political participation. A growing number of people around the world use social networking platforms not only to consume, but also to produce political information. This paper furthers our understanding of digitally-enabled forms of political participation and its relation to conventional forms of political conflict. Our research contributes to the emerging field in the study of political roles of digital social media. We believe the Ukrainian case is of major significance because it helps highlight the importance of social networks in societies with a deficit in traditional channels of political expression, collective action, and organization. We also make a methodological contribution. We propose practical methods for integrating big data in the study of mass attitudes and social behavior. As the ongoing digitization of social relations produces large repositories of data, social sciences face an important task of developing tools and approaches towards utilizing these resources. Our research contributes to this task. 26 Bibliography Acemoglu, Daron, and James A. Robinson. 2006. Economic Origins of Dictatorship and Democracy. Cambridge and New York: Cambridge University Press Adena, Maja; Enikolopov, Ruben; Petrova, Maria; Santarosa, Veronica; Zhuravskaya, Ekaterina (2015): Radio and the rise of the Nazis in prewar Germany, WZB Discussion Paper, No. SP II 2013-310r Anderson, Benedict. 1982. Imagined communities: reflections on the origin and spread of nationalism. London: Verso. Aslund, Anders, and Michael McFaul. 2006. Revolution in Orange: The Origins of Ukraine’s Democratic Breakthrough. Brookings Institution Press. Baker, Scott R., Nicholas Bloom, and Steven J. Davis, 2013. “Measuring Economic Policy Uncertainty,” Working paper, Chicago Booth Research Paper No. 13-02. Available at SSRN: http://ssrn.com/abstract=2198490 orhttp://dx.doi.org/10.2139/ssrn.2198490 Bauer, Michal, Alessandra Cassar, Julie Chytilová, and Joseph Henrich, 2013. “War’s Enduring Effects on the Development of Egalitarian Motivations and In-Group Biases,” Psychological Science 25(1):47-57. Baum, Matthew. 2012. “Preaching to the Choir or Converting the Flock: Presidential Communication Strategies in the Age of Three Medias.” In iPolitics: Citizens, Elections, and Governing in the New Media Era, ed. Richard Fox and Jennifer Ramos. Cambridge University Press pp. 183– 205. Beissinger, Mark R. 2012. “Russian Civil Societies, Conventional and Virtual,” Taiwan Journal of Democracy 8 (2): 91-104. Beissinger, Mark R. 2013. “The Semblance of Democratic Revolution: Coalitions in Ukraine’s Orange Revolution,” American Political Science Review, 107(3): 574-592. Bennett, Lance W. and Alexandra Segerberg. 2013. The Logic of Connective Action: Digital Media and the Personalization of Contentious Politics. Cambridge University Press Birch, Sarah. 2000. Elections and Democratization in Ukraine. Palgrave Macmillan. Blattman, Christopher and E. Miguel, 2010. “Civil War,” Journal Of Economic Literature, American Economic Association, 48(1): 3-57. Blattman, Christopher, 2009. “From Violence To Voting: War And Political Participation in Uganda,” American Political Science Review, 103(2): 231-247 27 Bond, Robert M., Christopher J. Fariss, Jason J. Jones, Adam D. I. Kramer, Cameron Marlow, Jaime E. Settle and James H. Fowler. 2012. “A 61-million-person experiment in social influence and political mobilization.” Nature 489:295–298. Christakis, N. A. and James H. Fowler. 2009. Connected: The Surprising Power of Our Social Networks and How They Shape Our Lives. Little, Brown, and Company. Clem, Ralph S., and Peter R. Craumer. 2008. “Orange, Blue and White, and Blonde: The Electoral Geography of Ukraine’s 2006 and 2007 Rada Elections.” Eurasian Geography and Economics 49, (2): 127–151. Davenport, Christian. 2009. Media Bias, Perspective, and State Repression: The Black Panther Party. Cambridge: Cambridge University Press. DellaVigna, Stefano and Matthew Gentzkow, 2010. “Persuasion: Empirical Evidence,” Annual Review of Economics 2:643–669. DellaVigna, Stefano, Ruben Enikolopov, Vera Mironova, Maria Petrova and Ekaterina Zhuravskaya, 2014. “Cross-Border Media and Nationalism: Evidence From Serbian Radio in Croatia,” American Economic Journal: Applied Economics, 6(3): 103–132. Driscoll, John C. and Aart C. Kraay, 1998. “Consistent Covariance Matrix Estimation with Spatially Dependent Panel Data,” Review of Economics and Statistics, 80(4): 549-560. Durante, Ruben and Brian Knight, 2012. "Partisan Control, Media Bias, And Viewer Responses: Evidence From Berlusconi's Italy," Journal of the European Economic Association 10(3):451-481. Duvanova, Dinissa, Alexander Semenov, and Alexander Nikolaev, 2015. “Do Social Networks Bridge Political Divides? The Analysis of Vkontakte Social Network Communication in Ukraine,” Post-Soviet Affairs, 31(3): 224-49. Earl, Jennifer and Katrina Kimport. 2011. Digitally Enabled Social Change: Activism in the Internet Age. Cambridge, MA: MIT Press. Enikolopov, Ruben, Maria Petrova, and Konstantin Sonin, 2015. “Social media and Financial Markets: Evidence from Russia.” Unpublished. Enikolopov, Ruben, Maria Petrova and Ekaterina Zhuravskaya, 2011. “Media and Political Persuasion: Evidence from Russia,” American Economic Review, 111(7): 3253-85. Fearon, James D. and David D. Laitin, 2000. “Violence and the Social Construction of Ethnic Identity,” International Organization 54, 4, Autumn 2000, pp. 845–877. Fearon, James D., and David D. Laitin. 2003. “Ethnicity, Insurgency, and Civil War.” American Political Science Review, 97(1): 75–90. Fox, Richard and Jennifer Ramos. 2012. iPolitics: Citizens, Elections, and Governing in the New Media Era, ed. Richard Fox and Jennifer Ramos. Cambridge University Press. 28 Gentzkow, Matthew and Jesse M. Shapiro, 2011. “Ideological Segregation Online and Offline,” The Quarterly Journal of Economics, 126: 1799-1839. Gentzkow, Matthew and Jesse M. Shapiro, 2010. “What Drives Media Slant? Evidence from US Daily Newspapers,” Econometrica, 78(1): 35–71. Godre, S., A.G. Nikolaev, S. Khopkar, and V. Govindaraju, 2015. “Engagement Capacity: a Measure of the Value Created by Users in Online Social Network or Forum Communication” UB technical report. Hamilton, Allison and Caroline Tolbert. 2012. “Political Engagement and the Internet in the 2008 U.S. Presidential Elections: A Panel Survey.” In Digital Media and Political Engagement Worldwide: A Comparative Study, ed. Eva Anduiza, Michael Jensen and Laia Jorba. Cambridge University Press pp. 56–79. Howard, Philip and Muzammil Hussain. 2013. Democracy’s Fourth Wave? Digital Media and the Arab Spring. Oxford University Press. Huckfeldt, Robert, and John T. Sprague. 1995. Citizens, Politics, and Social Communication: Influence in an Election Campaign. New York: Cambridge University Press. Jackson, M. O. And Massimo Morelli, 2011. “Political Bias And War, “ American Economic Review 97 (4): 1353-1373. Jennings, Kent and Vicki Zeitner. 2003. “Internet Use and Civic Engagement: A Longitudinal Analysis.” Public Opinion Quarterly 67:311–34. Jensen, Michael, James Danziger and Alladi Venkatesh, 2007. “Civil Society and Cyber Society: The Role of the Internet in Community Associations and Democratic Politics.” Information Society 23(1):39–50. Jensen, Michael, Laia Jorba and Eva Anduiza. 2012. “Introduction.” In Digital Media and Political Engagement Worldwide: A Comparative Study, ed. Eva Anduiza, Michael Jensen and Laia Jorba. Cambridge University Press pp. 1–15. Kerbel, Matthew. 2009. Netroots: Online Progressives and the Transformation of American Politics. New York: Paradigm Publisher. Kerbel, Matthew. 2012. “The Dog That Didn’t Bark: Obama, Netroots Progressives, and Health Care Reform.” In iPolitics: Citizens, Elections, and Governing in the New Media Era, ed. Richard Fox and Jennifer Ramos. Cambridge University Press pp. 233–258. Lawless, Jennifer. 2012. “Twitter and Facebook: New Ways of Members of Congress to Send the Same Old Message?” In iPolitics: Citizens, Elections, and Governing in the New Media Era, ed. Richard Fox and Jennifer Ramos. Cambridge University Press pp. 206–232. 29 Lim, Merlyna. 2012. “Clicks, Cabs, and Coffee Houses: Social Media and Opposition Movements in Egipt, 2004–2011.” Journal of Communication 62(2):231–48. Montalvo, Jose G. and Marta Reynal-Querol, 2010. "Ethnic polarization and the duration of civil wars,” Economics of Governance, 11(2):123-143. Montalvo, Jose G., and Marta Reynal-Querol. 2005. “Ethnic Polarization, Potential Conflict, and Civil Wars.” American Economic Review, 95(3): 796–816. Mueller, John. 1970. "Presidential Popularity from Truman to Johnson." American Political Science Review 64(1):18-. Murphy, Kevin M. and Andrei Shleifer. 2004. “Persuasion in Politics.” American Economic Review, 94(2):435–39. Mylonas, Harris, 2012. The Politics of Nation-Building: Making Co-Nationals, Refugees, and Minorities. New York: Cambridge University Press. Norris, Pippa, 2003. “Preaching to the Converted: Pluralism, Participation, and Party Websites.” Party Politics 9(1):21–45. O'Connor, B. R. Balasubramanyan, B.R. Routledge, and N.A. Smith, 2010. From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series. In Proceedings of ICWSM. 2010. Pang, B. and Lillian Lee. 2008. “Opinion mining and sentiment analysis,” Foundations and Trends in Information Retrieval 2(1-2): 1–135. Peisakhin, Leonid and Arturas Rozenas, 2015. “Persuasion and Dissuasion with Biased Media: Russian Television in Ukraine,” Unpublished. Prior, Markus. 2007. Post-Broadcast Democracy. New York: Cambridge University Press. Roberts, Sean and Robert Orttung, 2015. “How to understand the post-Soviet ‘war of lapels’,” The Washington Post, May 8. Rosenblat, Tanya S. and Markus M. Mobius, 2004. “Getting Closer or Drifting Apart?” Quarterly Journal of Economics, 119(3):971-1009. Russett, Bruce, 1990. Controlling the Sword: The Democratic Governance of National Security. Cambridge, MA: Harvard University Press, pp. 20-51. Semenov, A. and J. Veijalainen, “A modeling framework for social media monitoring,” Int. J. Web Eng. Technol., vol. 8, no. 3, pp. 217 – 249, 2013. Semenov, A, J. Veijalainen, and A. Boukhanovsky, “A Generic Architecture for a Social Network Monitoring and Analysis System,” in The 14th International Conference on Network-Based Information 30 Systems, Los Alamitos, CA, USA, 2011, pp. 178–185. Shapley, Lloyd S. 1953. "A Value for n-person Games". In Contributions to the Theory of Games, volume II, by H.W. Kuhn and A.W. Tucker, editors. Annals of Mathematical Studies v. 28, pp. 307– 317. Princeton University Press. Stroud, Natalie. 2011. Niche News: The Politics of News Choice, Oxford University Press. Stroud, Natalie Jomini. 2010. “Polarization and Partisan Selective Exposure.” Journal of Communication 60(3):556–76. Sunstein, Cass, 2001. Republic.com. Princeton: Princeton University Press. Tang, Min, Laia Jorba and Michael Jensen. 2012. “Digital Media and Political Attitudes in China.” In Digital Media and Political Engagement Worldwide: A Comparative Study, ed. Eva Anduiza, Michael Jensen and Laia Jorba. Cambridge University Press pp. 221–239. Thies, C. G., 2005. “War, rivalry, and state building in Latin America,” American Journal of Political Science 49(3): 451-465 Tilly, Charles, 1992. Coercion, Capital, and European States, AD 990–1992. Malden, Mass. and Oxford: Blackwell. Tufekci, Zeynep and Christopher Wilson. 2012. “Social Media and the Decision to Participate in Political Protest: Observations from Tehrir Square.” Journal of Communication 62(2):363–79. Vitak, Jessica, Paul Zube, Andrew Smock, Caleb T. Carr, Nicole Ellison and Cliff Lampe. 2011. “It’s complicated: Facebook users’ political participation in the 2008 election.” Cyberpsychology, Behavior, and Social Networking 14(3):107–114. Voors, Maarten J., Eleonora E. M. Nillesen, Philip Verwimp, Erwin H. Bulte, Robert Lensink, and Daan P. Van Soest. 2012. "Violent Conflict and Behavior: A Field Experiment in Burundi." American Economic Review, 102(2): 941-64. Webster, James. 2007. “Diversity of Exposure.” In Media Diversity and Localism: Meaning and Metrics, ed. Philip Napoli. Mahwah, NJ: Erlbaum pp. 309–325. Wilkinson, Steven I. 2004. Votes and Violence: Ethnic competition and ethnic violence in India. New York: Cambridge University Press. Zielinski, Jacob, K. M. Slomczynski, and Goldie Shabad. 2008. “Fluid Party Systems, Electoral Rules and Accountability of Legislators in Emerging Democracies: The Case of Ukraine,” Party Politics (January): 91-112. 31 Figure 1. The geographical breakdown of Ukrainian oblasts. W – Western region; C – Central region; S – Southern region; E – Eastern region. The cities of Kyiv and Sevastopol are included into Kyivs’ka oblasts and Autonomous Republic of Krym, respectively. The cities of Luhansk and Donetsk are excluded from the empirical analysis since they both are war-ridden cities, currently controlled by the Russian Army and pro-Russian rebels. 32 Figure 2. Interprovincial Network Ties: VKontakte discussion groups with active contribution during November 2013 (top) and August 2014 (bottom). Each line represents the intensity of shared communication as the number of discussion groups that both provinces contributed to. For each discussion group, we calculated its share in the total number of messages contributed by each of the provinces. Then we added the products of provinces’ contributions (as the share of total messages) to each of the shared topics for the corresponding dyad. The width and color of the lines reflect the strength of inter-provincial ties. Thicker, darker lines correspond to stronger ties. 33 Figure 3. Monthly Casualties suffered by Ukraine and the total nuber of political post and comments on VK.com Figure 4. Engamenent Index for the West&Center and South&East oblasts of Ukraine 34 Figure 5. Over-time changes in cohesion capacity indices for various groups of Ukrainian provinces. Notes: OY axis is the cohesion index (0; 1). OX axis is time to Maidan (November, 2013), in months. The confidence bands are computed using non-parametric bootstrap. 35 Table 1. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast. Oblast-Level Fixed Effects Regressions. Whole Whole Whole Whole Whole Whole Region Ukraine Ukraine Ukraine Ukraine Ukraine Ukraine (1) (2) (3) (4) (5) (6) Casualtiesit 212.310*** 120.610*** 48.148* 18.788** 16.742** 17.286** (74.288) (41.251) (26.899) (7.019) (6.981) (7.225) Sum of lagged casualties 755.027*** 264.878* 57.579*** 64.084** 48.043* (252.503) (155.727 ) (19.390) (27.702) (25.606) Max lag of casualties 114.316*** 43.018 12.467** 12.902* 12.580* (40.045) (26.283) (5.743) (6.673) (6.684) 964.557*** Year 2014 dummyt (237.415) 2014 only sample No No No Yes Yes Yes M/D/W dummies No No Yes No Yes Yes Daily quadratic time trend No No No No No Yes R-Square 0.017 0.059 0.353 0.002 0.130 0.154 Sample size 16790 16652 16652 8395 8395 8395 Notes: DV: Absolute number of comments from an oblast suffering casualties. Daily oblast-level panel data spanning 2013 and 2014 are used for estimation. M/D/W stands for month, calendar day, and weekday. All specifications include a constant term. All specifications, except specification 1, include six lags of the independent variable in addition to its contemporaneous value. Robust oblast-clustered standard errors are in parentheses.25 * p<0.10, ** p<0.05, *** p<0.01. 25We also estimated our regressions using Driscoll and Kraay (1998) standard errors (Table A2 in Appendix A) and found results to be very similar to those obtained using clustering. 36 Table 2. The Effect of Casualties on the Number of Comments from the Same Oblast. Regional Analysis. Oblast-Level Fixed Effects Regressions. Region Casualtiesit Sum of lagged casualties Max lag of casualties R-Square Sample size West&Center South&East (1) (2) 11.689 (9.104) 13.68993 (16.04082) 8.885 (8.717) 0.092 5840 28.424** (8.779) 110.4895 (50.00867) 24.566** (9.724) 0.405 2555 Notes: DV: Absolute number of comments from an oblast suffering casualties. Daily oblast-level panel data for 2014 only. West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. All specifications include a constant term; daily quadratic time trend; and month, calendar day, and weekday dummies. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 37 Table 3. The Effect of Casualties from outside the Oblast on the Number of Comments from a Given Oblast. Oblast-Level Fixed Effects Regressions. Whole Whole Whole Region West&Center South&East Ukraine Ukraine Ukraine OtherCasualtiesit (1) (2) (3) 9.691*** (2.556) 57.332*** (15.098) 8.577*** (2.251) No No No 0.075 16652 1.523** (0.618) 3.202 (2.862) 0.945** (0.364) Yes No Yes 0.001 8395 1.091*** (0.386) 2.376 ( 1.783) 1.240*** (0.321) No Yes Yes 0.347 16652 (4) (5) 1.105** 1.060 (0.455) (0.801) Sum of lagged OtherCasualties 4.168** -1.740 (1.595 ) (4.593) Max lag of OtherCasualties 1.159*** 1.459* (0.383) (0.639) 2014 only sample No No M/D/W dummies Yes Yes Daily quadratic time trend Yes Yes R-Square 0.238 0.656 Sample size 11584 5068 Notes: DV: Absolute number of comments from a given oblast. OtherCasualtiesit is defined as ΣiCasualtiesit – Casualtiesit. Daily oblastlevel panel data for 2013 and 2014. M/D/W stands for month, calendar day, and weekday. All specifications include a constant term. All specifications, except specification 1, include six lags of the independent variable in addition to its contemporaneous value. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. Table 4. Dyadic Communication Cohesion among Various Regions of Ukraine Maidan Dummy R-Square Sample size West&Center vs. East&South South vs. East West vs. Center West vs. South West vs. East Center vs. South Center vs. East (1) (2) (3) (4) (5) (6) (7) 0.001 (0.012) 0.000 730 0.067*** (0.012) 0.062 713 0.116*** (0.013) 0.146 718 0.066*** (0.012) 0.062 705 -0.010 (0.014) 0.001 716 0.084*** (0.010) 0.118 717 0.026** (0.011) 0.011 721 Notes: Daily time-series data is used for estimation. All regression includes a constant term. Maidan Dummy = 1 after Nov. 2013. West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. Newey-West standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 1 Table 5. The Effect of Casualties on the Engagement Index of a Given Oblast. Oblast-Level Fixed Effects Regressions. Region Casualtiesit Sum of lagged casualties Max lag of casualties 2014 only sample M/D/W dummies Daily quadratic time trend R-Square Sample size Whole Ukraine Whole Ukraine West Center South East (1) (2) (3) (4) (5) (6) -1.008 (0.596) -6.991* (3.902) -0.958* (0.514) No No No 0.004 16629 -0.851 (0.547) -5.671 (3.480) -0.721 (0.458) No Yes Yes 0.011 16629 -0.060* (0.033) -.585 (.401) -0.041 (0.050) Yes Yes Yes 0.403 3650 -0.047 (0.049) -.510 (.348) -0.033 (0.063) Yes Yes Yes 0.258 2190 0.290 (0.313) 2.377 (2.523) 0.445 (0.363) Yes Yes Yes 0.226 1825 0.022 (0.050) .371 (.127) 0.082 (0.024) No Yes Yes 0.410 730 Notes: Daily oblast-level panel data for 2013 and 2014. M/D/W stands for month, calendar day, and weekday. West, Center, South, and East are the regions of Ukraine, defined according to Figure A1. All specifications include a constant term. All specifications, except specification 1, include six lags of the independent variable in addition to its contemporaneous value. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01 2 Table 6. The Relationship Between Casualties and the Number of Positive/Negative Words Used in Comments. Oblast-Level Fixed Effects Regressions. Region Casualtiest Sum of lagged casualties Max lag of casualties R-Square Sample size Ukraine (1) 5.795* (3.219) 13.013 (13.343) 4.825 (3.067) 0.170 8395 Negative Words West Center South (2) (3) (4) -1.223 10.466 25.155 (1.126) (9.478) (18.666) -2.226 2.696 108.1747 (2.738) (8.711) (110.914) 1.595 5.659 19.031 (2.022) (5.958) (25.486) 0.177 0.139 0.393 3650 2190 1825 East (5) 10.289** (3.443) 45.050** (14.471) 13.645*** (1.138) 0.467 730 Ukraine (6) 13.872** (6.472) 34.256* (19.717) 9.932* (4.845) 0.166 8395 Positive Words West Center (7) (8) -0.996 25.695 (1.661) (21.647) 4.370* 33.366 (2.281) (34.959) 4.125 13.422 (4.259) (14.977) 0.189 0.132 3650 2190 South East (9) (10) 45.180 20.991** (27.241) (6.178) 177.465 76.675** (157.331) (26.095) 29.108 20.676*** (36.945) (3.021) 0.381 0.469 1825 730 Notes: Daily oblast-level panel data for 2014 only. West, Center, South, and East are the regions of Ukraine, defined according to Figure A1. All specifications include a constant term; daily quadratic time trend; and month, calendar day, and weekday dummies. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 3 Appendix A. Robustness Checks for Results reported in Table 1. Table A1. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast. Seven-Lags Specification. Oblast-Level Fixed Effects Regressions. Whole Whole Whole Whole Whole Region Ukraine Ukraine Ukraine Ukraine Ukraine (1) (2) (3) (4) (5) Casualtiesit 212.310*** 108.748*** 43.925* 17.424** 15.371** (74.288) (37.669) (24.754) (6.606) (6.655) Sum of lagged casualties 799.759*** 286.719* 64.216*** 75.158** (267.633) (167.543) (21.359) (30.659) Max lag of casualties 102.926** 38.878 11.163* 11.485* (36.734) (24.131) (5.411) (6.281) Year 2014 dummyt 961.455*** (236.023) 2014 only sample No No No Yes Yes M/D/W dummies No No Yes No Yes Daily quadratic time trend No No No No No R-Square 0.017 0.062 0.354 0.002 0.131 Sample size 16790 16629 16629 8395 8395 Notes: See notes to Table 1. * p<0.10, ** p<0.05, *** p<0.01. 4 Whole Ukraine (6) 16.216** (6.923) 56.730* (28.023) 11.478* (6.358) Yes Yes Yes 0.154 8395 Table A2. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast. Oblast-Level Fixed Effects Regressions with Driscoll and Kraay (1998) Errors Whole Whole Whole Whole Whole Whole Region Ukraine Ukraine Ukraine Ukraine Ukraine Ukraine (1) (2) (3) (4) (5) (6) Casualtiesit 212.310*** 120.610*** 48.148*** 18.788** 16.742*** 17.286*** (33.395) (19.337) (8.347) (8.664) (5.690) (6.280) Sum of lagged casualties 755.027*** 264.878*** 57.579 64.084*** 48.043** (128.899) (44.349) (48.813) (20.189) (21.658) 114.316*** 43.018*** 12.467 12.902*** 12.580** Max lag of casualties (25.328) (9.183) (8.136) (4.799) (5.315) Year 2014 dummyt -964.557*** (35.855) 2014 only sample No No No Yes Yes Yes M/D/W dummies No No Yes No Yes Yes Daily quadratic time trend No No No No No Yes Sample size 16790 16652 16652 8395 8395 8395 Notes: See notes to Table 1. * p<0.10, ** p<0.05, *** p<0.01. 5 Table A3. The Effect of Casualties on the Logarithm of the Absolute Number of Comments from the Same Oblast. Log-Log Specification. Oblast-Level Fixed Effects Regressions.. Whole Whole Whole Whole Whole Whole Region Ukraine Ukraine Ukraine Ukraine Ukraine Ukraine (1) (2) (3) (4) (5) (6) log Casualtiesit 1.588*** (0.307) 0.800*** (0.156) 5.283*** (1.003 ) 0.783*** (0.151) No No No 0.031 16790 No No No 0.102 16652 Sum of lagged log casualties Max lag of log casualties 0.213* (0.108) 1.271* (.666) 0.195* (0.103) 0.553*** (0.081) No Yes No 0.619 16652 Year 2014 dummyt 2014 only sample M/D/W dummies Daily quadratic time trend R-Square Sample size 0.090*** (0.020) .387*** (.071) 0.066*** (0.015) 0.051* (0.026) .251** ( .112) 0.057*** (0.020) 0.054* (0.028) .149 (.109) 0.036 (0.021) Yes No No 0.005 8395 Yes Yes No 0.231 8395 Yes Yes Yes 0.285 8395 Notes: See notes to Table 1. Both dependent and independent variables are in logs. To deal with the zero values of the number of comments and casualties, we added 1000 to the number of comments, and added 10 to the number of casualties, before taking logs. * p<0.10, ** p<0.05, *** p<0.01. 6 Appendix B. Weekly Data Analysis Table B1. The Effect of Casualties on the Absolute Number of Comments from the Same Oblast. Weekly Oblast-Level Fixed Effects Regressions. Region Casualtiesit Whole Ukraine (1) 745.045*** (258.912) Year 2014 dummyt 2014 only sample Month dummies Daily quadratic time trend R-Square Sample size No No No 0.062 2415 Whole Ukraine (2) 274.633 (164.483) 6663.461*** (1649.563) No Yes No 0.365 2415 Whole Ukraine (3) 69.596** (27.227) Whole Ukraine (4) 63.555* (31.479) Whole Ukraine (5) 57.967* (30.733) Yes No No 0.003 1196 Yes Yes No 0.165 1196 Yes Yes Yes 0.191 1196 Notes: DV: Absolute number of comments from an oblast suffering casualties. Weekly oblast-level panel data spanning 2013 and 2014 are used for estimation. All specifications include a constant term. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 7 Table B2. The Effect of Casualties on the Number of Comments from the Same Oblast. Regional Analysis. Weekly Oblast-Level Fixed Effects Regressions. Region West&Center (1) 30.831 (30.324) 0.109 832 Casualtiesit R-Square Sample size South&East (2) 101.463* (51.918) 0.505 364 Notes: DV: Absolute number of comments from an oblast suffering casualties. Weekly oblast-level panel data for 2014 only. West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. All specifications include a constant term; quadratic time trend; and month dummies. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 8 Table B3. The Effect of Casualties from outside the Oblast on the Number of Comments from a Given Oblast. Weekly Oblast-Level Fixed Effects Regressions. Whole Whole Whole Region West&Center South&East Ukraine Ukraine Ukraine OtherCasualtiesit (1) (2) 56.655*** (14.976) 4.189 (2.919) (3) (4) (5) 3.762* 5.084** 0.733 (1.858) (1.795) (4.739) Year 2014 dummyt 6873.703*** 5490.030** 1.0e+04*** (1768.482) (2336.605) (2047.761) 2014 only sample No Yes No No No Month dummy No No Yes Yes Yes R-Square 0.079 0.002 0.358 0.244 0.681 Sample size 2415 1196 2415 1680 735 Notes: DV: Absolute number of comments from a given oblast. OtherCasualtiesit is defined as ΣiCasualtiesit – Casualtiesit. Weekly oblastlevel panel data for 2013 and 2014. All specifications include a constant term. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 9 Table B4. The Effect of Casualties on the Engagement Index of a Given Oblast. Weekly Oblast-Level Fixed Effects Regressions. Region Casualtiesit 2014 only sample Month dummies Quadratic time trend R-Square Sample size Whole Ukraine Whole Ukraine West Center South East (1) (2) (3) (4) (5) (6) -0.970* (0.537) No No No 0.013 2415 -0.777 (0.471) No Yes Yes 0.027 2415 -0.086 (0.049) Yes Yes Yes 0.476 520 -0.105 (0.066) Yes Yes Yes 0.292 312 0.356 (0.386) Yes Yes Yes 0.285 260 0.024 (0.054) Yes Yes Yes 0.533 104 Notes: Weekly oblast-level panel data for 2013 and 2014. West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. All specifications include a constant term. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01 10 Table B5. The Relationship Between Casualties and the Number of Positive/Negative Words Used in Comments. Weekly Oblast-Level Fixed Effects Regressions. Region Casualtiest R-Square Sample size Ukraine (1) 16.345 (14.057) 0.223 1196 Negative Words West Center South (2) (3) (4) -3.721 19.768 108.554 (4.559) (20.088) (123.056) 0.236 0.159 0.513 520 312 260 East (5) 39.394 (32.874) 0.637 104 Ukraine (6) 38.835* (22.559) 0.212 1196 Positive West Center South East (7) (8) (9) (10) 0.449 77.928 159.192 61.182*** (5.299) (59.922) (169.949) (23.046) 0.241 0.147 0.499 0.618 520 312 260 104 Notes: Weekly oblast-level panel data for 2014 only. West, Center, South, and East are the regions of Ukraine, defined according to Figure 1. All specifications include a constant term, quadratic time trend, and month dummies. Robust oblast-clustered standard errors are in parentheses. * p<0.10, ** p<0.05, *** p<0.01. 11 Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Omega journal homepage: www.elsevier.com/locate/omega A subjective evidence model for influence maximization in social networks$ Mohammadreza Samadi a, Alexander Nikolaev a,n, Rakesh Nagi b a b Department of Industrial and Systems Engineering, University at Buffalo (SUNY), Buffalo 14260, NY, United States Department of Industrial and Enterprise Systems Engineering, The University of Illinois at Urbana-Champaign, IL 61801, United States art ic l e i nf o a b s t r a c t Article history: Received 6 October 2014 Accepted 30 June 2015 This paper introduces the notion of subjective evidence, which fuels a new parallel cascade influence propagation model. The model sheds light on the phenomena of belief reinforcement and viral spread of innovations, rumors, opinions, etc., in social networks. Network actors are assumed to be testing a Bayesian hypothesis, e.g., for making judgment about the superiority of some product(s) or service (s) over others, or (dis)utility of a given program/policy. The model-based influence maximization solutions inform the strategies for market niche selection and protection, and identification of susceptible groups in political campaigning. The NP-Hard problem of influential seed selection is first solved as a mixed-integer program. Second, an efficient Lagrangian Relaxation heuristic with guaranteed bounds is presented. In small, medium and large-scale computational investigations, we analyze: (1) how the success of an influence cascade triggered in a (sub)community, long exposed to an opposite belief, depends on the structural properties of the underlying social network, (2) to what extent growing (increasing the density of) a consumer network within a market niche helps a company protect the niche, (3) given a competitor's strength, when a company should counter the competitor on “their turf”, and when and how it should look for limited-time opportunities to maximally profit before eventually surrendering the market. & 2015 Elsevier Ltd. All rights reserved. Keywords: Influence maximization Social networks Bayesian inference Evidence Seed selection 1. Introduction and motivation People tend to view product recommendations received from friends or through friends more favorably compared to advertisements offered by commercial mass media channels [15,41]. Social connections enable the propagation of ideas, judgments and opinions; the phenomenon where knowledge transfer between individuals significantly affects their decisions about purchasing a product is known as social influence/contagion [71,63,14]. Social influence and diffusion of innovations in social networks are mainly explored in managerial and sociological studies [75,1,2]. However, the need for simulating information diffusion/peer influence in social networks and solving optimization problems to algorithmically find potent success of cascade initiation strategies led to the introduction of the Influence Maximization (IM) problem. The objective of the IM problem is to find such early starters, termed seeds, for influence spread in a social network that will direct information transfer so as to achieve a desired ☆ This manuscript was processed by Associate Editor Prokopyev. Corresponding author. E-mail addresses: [email protected] (M. Samadi), [email protected] (A. Nikolaev), [email protected] (R. Nagi). n impact on the expected product adoption, or people's decisions/ judgments/opinions with respect to a query of interest [45,16]. Early mathematical formulations of the IM problem in social networks view social ties as indicators of dyadic dependence, where the random graph or Markov random field-based approach is a natural choice for model design [24,59]. More recent literature on the algorithmic analysis of influence spread has been dominated by diffusion-based models [45], in which ties are viewed as information flow channels. The Independent Cascade (IC) and Linear Threshold (LT) models are most notable ones, both allowing for elegant discrete optimization problem statements; these models also provided the basis for a streak of subsequent studies [46,36,72,23]. Application-wise, diffusion models have been found suitable for research studies in marketing [5,15] and health care [60]. However, algorithmic investigations up to date failed to culminate in significant managerial insights and strategies. This is in part due to the fact that existing models do not specify the medium and nature of influence flow through a network, i.e., fail to explain the diffusion of what leads to social influence, and how it does so. This paper takes a previously unexplored approach to modeling the spread of competitive influence in social networks, rooted in Bayesian Inference theory and focused on propagation of evidence. http://dx.doi.org/10.1016/j.omega.2015.06.014 0305-0483/& 2015 Elsevier Ltd. All rights reserved. Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i 2 M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Bayesian inference logic helps quantify social influence under the premise that people treat new information as evidence and update their beliefs in support of or against the null hypothesis. In this approach, network nodes represent intelligent agents (actors) who seek to form judgments about a product/query by testing a relevant hypothesis (e.g., that a particular claim is true), based on their prior beliefs as well as the knowledge acquired through friends. A node's decision to significantly favor the null hypothesis signals the node's “positive activation”; significantly favoring the alternative implies “negative activation”; finally, whenever the collected evidence is inconclusive, the node is labeled “inactive”. This paper presents a Parallel Cascade (PC) diffusion framework for modeling evidence spread through social networks. The flow of information in this PC model is classified as parallel duplication in the typology of flow processes on social networks, introduced by Borgatti [11], which supports the idea of belief reinforcement through subjective evidence duplication in social communication. The paper reports insightful observations, e.g., pertinent to the identification of penetrable market niches and convenient points of initial influence for conquering new market segments, obtained from solving basic instances of the PC model-based IM (PCIM) problem. The paper develops problem-specific optimization schemes for handling medium and large-scale instances of PCIM problem formulated as a Mixed-Integer program. The rest of this paper is organized as follows. Section 2 reviews the literature on diffusion models for IM. Section 3 formally introduces the PC diffusion model, formulates PCIM problem and discusses its application to two empirical case studies. Offering a more computationally efficient approach to the problem, Section 4 presents a Lagrangian Relaxation heuristic tool suit, with solution quality guarantees achieved via two problem-specific heuristics for finding lower bounds for PCIM problem optima. Section 5 reports on the conducted experimental studies. Section 6 summarizes the findings, discusses the potential applications of the proposed methods and outlines future research directions. The paper contains two appendices: Appendix A presents the NPhardness proof for the PCIM problem; Appendix B details the Subgradient Search algorithm for finding an upper bound for PCIM problem optima. 2. The landscape of the social influence research domain The concept of word-of-mouth has received attention in the 1940s as an effective way for diffusion of information (e.g., about new products) and soon became a coined term in the experimental marketing research [53,76,44]. Models of information diffusion over networks, also first introduced in the marketing field, were developed more recently and found use in health care [55], sociology [50,69] and politics [20]. From an experimental point of view, the phenomenon of social contagion is known to be a significant factor affecting the strength of diffusion processes in social networks [52,1,58,3,62,4]. The investigations into the impact of influential people, or opinion leaders, on cascade formation comprise a large part of the literature. Opinion leaders are defined as the individuals that have the ability to strongly affect the opinions or decisions of their network peers [79]. While some studies degrade the value/power of opinion leaders for social cascade progression [7,74], most authors see opinion leader presence as a critical facilitator for cascade emergence [49,68,30,42,70]. Hinz et al. [41] experimentally showed that a wisely selected group of opinion leaders can increase the influence spread rate in a cascade up to eight times. Yet, two questions remain unanswered: How can one select the appropriate opinion leaders for maximizing the spread of influence in a social network? and How does this selection depend the social network structure? While the literature reviewed above is more concerned with exploring the mechanisms of successful cascade propagation, it does not provide a readily available method/ solution/strategy (for a company or a political party) to artificially create a cascade in support of a product or opinion by recruiting the “best” opinion leaders. The latter objective, however, may be highly sought-after by research-aware practitioners. The first organized efforts for identifying influential nodes in social networks relied on centrality-based heuristics [12]. The degree centrality heuristic assumes that any node with a large number of direct connections (called a hub) must be highly influential in a social network. The distance centrality heuristic, on the other hand, considers a node influential if it has short paths to other nodes in the network (called a bridge) [73,41]. The centrality-based heuristics, however, provide no quality guarantee for the solution of the IM problems with multiple required seeds. To formulate an algorithmic approach to finding influential node sets, the term “influence maximization” was coined by Domingos and Richardson [24]. While the first attempts to address the IM problem employed a Markov random field approach [24,59], Kempe et al. [45] were first to re-frame it as a discrete optimization problem. The Independent Cascade (IC) model and the Linear Threshold (LT) model, proposed by Kempe et al. [45], are the most wellknown diffusion models for IM; the optimization problems based on these models are NP-hard [72,17]. Kempe et al. [45] discovered a submodularity property of the IM objective function and presented a greedy seed selection algorithm with guaranteed, albeit loose, optimality bounds. The problem of assuming submodularity lies in the fact that under it, the marginal gain of adding new seeds should be decreasing, which does not support the idea of fixed threshold effects and reveals a manifest shortcoming in the respective diffusion models [37,50]. Furthermore, the original greedy algorithm and even its extensions were found overly demanding computationally [47,36,16,15,72]. A separate branch of literature has explored social influence from the empirical data mining perspective. As discovered by Aral et al. [2], a financially viable cascade initiation requires the selection (buy-in) of no more then 0.2% of the nodes in a network. This finding underlines the value of precise seed selection algorithms that can ensure a desirable cost/returns ratio in cascade seeding. However, one observes a gap between the literature based on data-driven studies and algorithmic research. The latter efforts, unfortunately, have often focused on computational investigations in impractical settings and failed to produce managerial insights. The present paper serves as a bridge between these two research thrusts. It designs a realistic diffusion model, strongly supported by mathematical sociology findings, and solves the seed selection problem optimally over real social network datasets, and hence, paves a way to rigorously explore strategic decision-making in social networks. Note that the original IM problem formulation was concerned with maximizing the expected number of activated nodes at the end of the diffusion process, when the activation status of all the nodes becomes fixed, irrespective of the sequence and timing of node activation. However, in many practical IM applications, an influence campaign has a predefined time window, over which it has to achieve the maximized possible effect. Only recently, Goyal et al. [35] addressed the issue of unconstrained time horizon for IM and introduced MINTIME problem where the objective is to minimize the time until the activation of a predefined number of nodes. Note also that in most IC and LT model-based seed selection problems have ignored the aspect of activation timing; furthermore, they assumed no competition. Meanwhile, the existing literature confirms the co-existence of (competing) opinions in Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ real-world decision-making settings [8,9]. Chen et al. [13] were first to recognize this issue by introducing an IC model that allows negative influence to impede the spread of positive influence. In summary, the diffusion-based IM problems have received attention from the research community for finding influential nodes in social networks and have been confirmed to be useful for treating real-world problems. However, the current diffusion models for IM problems have not been able to produce many managerial insights, in part due to the underspecification of influence flow mechanisms. The present work proposes a mathematical framework for finding exact solutions to seed selection problems, which allows one to more rigorously explore the structure of such optimal solutions. 3. Bayesian inference logic in influence maximization This section formalizes the Parallel Cascade (PC) model for diffusion of subjective evidence through social networks, provides the mathematical model for solving the problem of identifying the best positive seeds, and lastly, illustrates the use of the presented PC model with two case studies. The PC model views the node's adoption of a product or opinion, called activation, as a decisionmaking process based on continuously collected evidence. It has been established that human decision-making can be modeled as Bayesian inference with a high precision whenever the alternatives are given as mutually exclusive hypotheses [66]. These human subject experiments confirm that the collection rate and the impact of evidence on human decision-making process can be studied empirically, and hence, the models presented in this paper can be specified based on real-world data. Naturally, the current paper posits that people use their initial preferences and beliefs, as well as the incoming information they receive from peers, to decide whether a given null hypothesis (H0) or the opposite alternative hypothesis (H1) is more likely to be true. It is assumed that, based on the evidence accumulated and processed through Bayesian updates, a decision-maker may turn from an observer into a supporter of the hypothesis that convincingly appears more likely to them at a particular point in time; the described transition is defined by thresholds (on the evidence scale) as done in a great deal of sociology literature [50,69,80]. Bayesian inference logic uses Bayes' rule to update beliefs in such hypotheses testing (i.e., update the probability that a particular hypothesis is true) as new evidence is received and processed; here, evidence is an objective quantity, that valuates the new information regarding a hypothesis, e.g., as a result of observing a new fact. However, in reality, beliefs are not necessarily updated based on such facts. When the source of in-coming information is not given (not traceable or forgotten), people still treat the information (supposedly new to them) as evidence, which we term subjective, and update their beliefs [18,33]. Fig. 1 demonstrates the effect of subjective evidence spread on updating beliefs in social networks. The PC model views positive activation as the event when an actor begins to significantly favor one hypothesis over the other. Once a network node becomes active, it begins to deliver the messages in support of their favored hypothesis to their connected peers. The evidence accumulation can be mathematically expressed by using the “Odds” function (O), defined as the probability that “H0 is true” divided by the probability that “H0 is false”. Taking the logarithm of the Odds leads to an additive evidence function. The evidence function for H0 is given as PðdjH 0 RÞ ; ð1Þ eðH 0 jRdÞ ¼ 10log 10 ðOðH 0 jRdÞÞ ¼ eðH 0 jRÞ þ 10 log 10 PðdjH 1 RÞ where R is the prior knowledge of the null hypothesis (before the 3 Evidence Source Fig. 1. Belief reinforcement through subjective evidence spread in a social network. Consider a triplet of actors who are testing the same hypothesis, e.g., that a new phone service is reliable. Suppose node 1 observes a fact supporting the hypothesis, e.g., using the new phone service for a month and experiencing few dropped calls, and presents their impression to nodes 2 and 3. Both nodes 2 and 3 update their beliefs about the hypothesis, and then, node 2 shares the absorbed information to node 3 without providing the source of information. Node 3 captures the information (in fact, the rumor) from node 2, treats it as if it provides new evidence supporting the hypothesis and updates its belief again. This process shows how a person's belief about a hypothesis can be reinforced multiple times as a result of a single external test/fact. In social networks, edges serve as channels that permit evidence duplication, and hence, can enable (unfounded) belief reinforcement. evidence diffusion begins) and d is one signal (a piece of new information) that supports the null hypothesis [43]. Thus, the evidence function combines the prior evidence and observed evidence. With no prior information (data) available, equal probabilities are typically assigned to the null and alternative hypotheses. When a sequence of multiple signals (data) D is received and processed, the updated evidence is given as X Pðdi jH 0 RÞ eðH 0 jRDÞ ¼ 10log 10 ðOðH 0 jRDÞÞ ¼ eðH 0 jRÞ þ 10 : log 10 Pðdi jH 1 RÞ i ð2Þ þ The increment of the positive evidence (e ) resulting from a single observation supporting the null hypothesis (d), and the increment of the negative evidence (e ) resulting from a single observation 0 supporting the alternative hypothesis (d ) are respectively given by e þ ¼ eðH 0 jdÞ ¼ 10n log 10 PðdjH 0 RÞ log 10 PðdjH 1 RÞ ; ð3Þ 0 0 0 e ¼ eðH 1 jd Þ ¼ 10n log 10 Pðd jH 1 RÞ log 10 Pðd jH 0 RÞ : ð4Þ Therefore, upon collecting and processing multiple observations D with n þ positive and n negative signals, the evidence supporting the H0 becomes eðH 0 jDRÞ ¼ eðH 0 jRÞ þ e þ ðn þ Þ e ðn Þ: The evidence increment values (e meters in the PC model. þ ð5Þ and e ) are used as para- 3.1. The parallel cascade diffusion model Define an influence graph as a directed graph G ¼ ðN; AÞ, with a set of nodes N and a set of arcs A. Let the sets of positive and negative seeds, i.e., the initial sets of evidence propagators, be denoted by S þ and S , respectively. Note that the notion of “positivity” of evidence is arbitrary: the hypothesis that postulates a claim preferred by the grand policy-maker will hereafter be viewed as positive, hence the distinction between positive and þ negative evidence. For each node iA N, let θi Z 0 (θi Z 0) denote a positive (negative) threshold for the evidence that a node must accumulate, in support of (against) the null hypothesis, to become positively (negatively) activated. In a given problem, θ þ and θ can be set using Bayesian logic: these values should reflect the desired levels of assurance for a node not to make a mistake (about the product/query) when it gets positively or negatively activated [43]. Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i 4 M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ At each discrete time period t ¼ 0; 1; 2; …; T, let Lit Z 0 and K it Z 0 denote the cumulative levels of positive and negative evidence for node i, respectively. Node i is said to be positively þ (negatively) activated at period t if and only if Lit K it Z θi (K it Lit Z θi ); otherwise, it maintains the inactive status. Let þ Li0 ¼ θi and K i0 ¼ 0 denote the initial evidence levels of node i A S þ , at time t¼ 0. Equivalently, Lj0 ¼ 0 and K j0 ¼ θj denote the initial levels of evidence for node j A S , at time period t¼ 0. Each node is assumed to accumulate evidence incoming from its activated neighbors, regardless of its own activation status. For each node i A N, the nodes that have arcs toward (coming from) i are termed in-neighbors (out-neighbors) of i; Nin(i) and Nout(i) are the sets of in-neighbors and out-neighbors of i, respectively. At time period t ¼ 0; 1; …; T, let N tþ ðiÞ D Nin ðiÞ Nt ðiÞ DN in ðiÞ denote the set of positively (negatively) activated in-neighbors of i. A node p A N tþ ðiÞ sends positive feedback (positive evidence) toward i and n A N t ðiÞ provides negative feedback (negative evidence) for i. The numerical values of the positive and negative evidence provided by node i A N at time t 4 0 to its out-neighbors are denoted by Eitþ Z 0 and Eit Z 0, respectively. If node i is positively activated at time t, Eit is zero; if it is negatively activated at time t, Eitþ is zero; finally, when i is inactive at time t, both Eit and Eitþ are zero. The evidence value provided by a node to its outneighbors in the time period immediately following positive (negative) activation is given by e þ (e ). Note that e is defined as the absolute value of the negative evidence calculated using Bayesian logic (i.e., e 4 0). At the end of each time period, each node updates its cumulative evidence levels (positive and negative) by adding the newly received evidence to the current evidence levels, and possibly, updates its activation status (to be used for the next time period). Once an activated node loses its activation (becomes inactive), its ability to propagate evidence is immediately revoked. Note that node activation does not have to be followed by an action (e.g., product purchase): the specific application of the model will dictate a desirable assumption in this regard [9]. In order to realistically capture the effects of information transfer and evidence accumulation in social networks, two decay factors are incorporated in the presented evidence propagation model, one pertaining to evidence provision and the other pertaining to evidence collection and processing. The value of positive (negative) evidence provided by activated nodes decreases by α þ (α ) as time passes from the last positive (negative) activation. As a result, the effect of the transferred information in updating the evidence level of out-neighbors is expected to diminish. Furthermore, “forgetfulness” rate β þ (β ) is introduced into the PC model to allow nodes to forget (discount as old) a part of the positive (negative) evidence they previously collected. Forgetfulness rate, that has been well studied in marketing literature [51,10], causes the recently observed evidence to make a greater contribution to the decision making process. Also, with time, the nodes will become indifferent to the query, as it often occurs in practice. Fig. 2 illustrates the dynamics of PC model-driven evidence propagation over a small network. Sets S þ and S include the Influence graph nodes that are positively and negatively activated, respectively, at t¼ 0; set S is given; the nodes in S þ are to be selected by the decision-maker solving the IM problem. The diffusion process is terminated after a pre-set (practically relevant) number of time periods (T). Following the traditional setup, the PC model-based IM (PCIM) problem is n concerned with populating S þ so as to maximize some measure of the evidence spread in the network. The measure taken in this paper accounts for both the earliness and sustainment of node activation: PCIM amounts to maximizing the count of time periods with positive activation (Γ G ðS þ ; S Þ) while minimizing the count of time periods with negative activation (ΔG ðS þ ; S Þ) over all the nodes, n Sþ A ðS þ arg max D NjS D N;S þ \S ¼ ∅Þ Γ G ðS þ ; S Þ ΔG ðS þ ; S Þ : It thus makes the model applicable for such marketing, political and military problems where the timing and duration of activation matter. For example, when activation stands for subscription for a service, each node generates profit in each time period that it's positively activated. As a result, the total duration of a node's positive activation determines its contribution to the objective function. Note that a positively activated node still observes both positive and negative evidence: as such, a positively activated node can become negatively activated after receiving enough negative evidence, and vice versa. Note also that in the absence of negative seeds, when communication can only reinforce the nodes' beliefs, the PC model with þ the decay factors set to α þ ¼ 1 and β ¼ 0, reduces to a special case equivalent to the original LT model introduced by Kempe et al. [45], with the fixed threshold values. By accommodating conflicting evidence and thanks to its objective function, the PCIM problem can inform decisions even in situations where the decision-maker stands to eventually loose its market position(s). Via the threshold values and forgetfulness rates, the PC model also easily accommodates the non-symmetry of positive and negative influence effects in social networks, i.e., the phenomenon known as the “Negativity Bias”, which, e.g., reflects the fact that only a few negative product feedback comments can turn a potential buyer away [54,6,65]. 3.2. Optimization model specification and solution methodology In this section, a mixed-integer program is constructed for finding exact optimal solutions to the PCIM problem. It is first noted that the PCIM problem is NP-hard. Theorem 1. The PCIM problem is NP-hard. Proof. By a polynomial Turing reduction from the Maximum Coverage Problem (see Appendix A). The PCIM problem is now formally stated, with the notation summarized in Table 1. As stated earlier in the paper, at every time period, each node is either positively activated, negatively activated or inactive. At the end of each time period, every node collects all the incoming evidence and updates its cumulative evidence level to determine its activation status for the next time period. The mixed-integer programming model (P) for the PCIM problem is given as ðPÞ max Z ¼ jNj X T X ðX it Y it Þ ð6Þ i¼1t ¼0 Subject to: Y it Z ððK it Lit Þ θi Þ=M; i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T; þ 1 X it Zðθi ðLit K it ÞÞ=M; X it þ Y it r1; i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T; Lit ¼ β þ Lit 1 þ ∑ Ejtþ 1 ; ðj;iÞ∈A i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T; K it ¼ β K it 1 þ X Ejt 1 ; ð7Þ ð8Þ ð9Þ i ¼ 1; 2; …j Nj ; t ¼ 1; 2; …; T it ; ð10Þ i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T; ð11Þ ðj;iÞ A A þ Li0 ¼ X i0 ðθi Þ; i ¼ 1; 2; …jNj; K i0 ¼ Y i0 ðθi þ ϵÞ; i ¼ 1; 2; …jNj; Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i ð12Þ ð13Þ M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 0 -2 0 -2.5 0 0 0 0 0 -5 2 -10 23.3 2.1 -0.4 28.4 32.2 6.7 -10 94.4 4.2 49.7 11.7 51 -7.9 7.1 40.1 42.6 49.7 6.3 21.3 65.9 t=3 87.2 t=4 inactive node 56.8 71 14.2 44.6 11.3 71 25.9 28.4 28.4 39.7 39.7 66 8.8 15.6 18.4 18.4 37.6 7.1 t=2 -5.4 6.4 6.7 14.2 0 t=1 -2.8 -2.5 14.2 14.2 0 t=0 15.9 -0.4 0 0 2 -5 -2.5 7.1 -2.5 9.2 7.1 0 6.7 2.1 0 0 2.1 0 7.1 0 -5 -12 -2.5 0 0 0 -2.5 -2 0 5 t=5 positively activated node negatively activated node Fig. 2. The spread of positive and negative evidence through a network with jNj ¼ 15, T ¼ 5, e þ ¼ 7:1, e ¼ 2:5, α þ ¼ α ¼ 1:0, β þ ¼ β ¼ 1:0, θ þ ¼ 2 and θ ¼ 2. The net activation value (Lti K it ) for each node is found beside each node. Each graph reports the activation status of each node at a singe time period t. Table 1 Definition of indices, input parameters and decision variables in mathematical problem. Indices Node indices Time period index i; j t Inputs GðN; AÞ jNj T jS þ j θiþ θi eþ e αþ α βþ β Si The Influence Graph; a directed graph with a set of nodes N and a set of arcs A Total number of nodes in the network Total number of time periods in the time horizon considered in the problem Total number of positive seeds The value of positive threshold for ith node The value of negative threshold for ith node Maximum value of positive evidence a node can send in a single time period Maximum value of negative evidence a node can send in a single time period Discount rate for the value of positive evidence sent by a positively activated node Discount rate for the value of negative evidence sent by a negatively activated node The rate that each node forgets the previously received positive evidence The rate that each node forgets the previously received negative evidence ( 1 if ith node is a negative seed; 0 otherwise; Decision variables X it 1 if node i is positively activated at time t 0 otherwise 1 if nodei is negatively activated at time t 0 otherwise cumulative level of positive evidence for ith node at time t Cumulative level of negative evidence for ith node at time t The value of positive evidence that ith node provides for its neighbors at time t Y it Lit Kit Eitþ Eit Eitþ r ðα þ Eitþ 1 Þ þ ð1 X it 1 Þe þ ; The value of negative evidence that ith node provides for its neighbors at time t Eit Z ðα Eit 1 Þ þ ðY it Y it 1 Þe ; i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T; i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T; ð16Þ ð14Þ Eitþ r e þ ðX it Þ; i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T; ð15Þ Eit r e ðY it Þ; Y i0 ¼ Si ; i ¼ 1; 2; …jNj; t ¼ 1; 2; …; T; i ¼ 1; 2; …jNj; ð17Þ ð18Þ Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 6 þ Ei0 ¼ X i0 e þ ; i ¼ 1; 2; …jNj; ð19Þ ¼ Y i0 e ; Ei0 i ¼ 1; 2; …jNj; ð20Þ jNj X To the best of our knowledge, this paper presents a first mixedinteger program for solving IM problems. The experimental results with (P) are reported in Section 5. 3.3. Case studies with the PC model X i0 r jS þ j; ð21Þ i¼1 0 r Lit ; K it ; Eitþ ; Eit ; Y it ; X it A f0; 1g; i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T; i ¼ 1; 2; …jNj; t ¼ 0; 1; …; T: ð22Þ ð23Þ The objective function in (6) takes into account the timing of node activation through the counts of positively and negatively activated nodes in each time period. Note that removing the timing of activation from the objective function in (6) would generate the problem of maximizing the number of positively activated nodes and minimizing the number of negatively activated nodes at the end of the diffusion process, i.e., in period T, which can be solved as a special case of (P). The constraints (7) and (8) ensure that each node gets positively activated when its net evidence level (the difference between cumulative positive evidence and cumulative negative evidence) is greater than or equal to the positive threshold (θ þ ), and gets negatively activated when the net evidence level is less than or equal to the negative threshold (θ ). In constraints (7) and (8), M is a large positive number greater than or equal to þ ½maxi A N ðθi þ θi Þ þ ϵ þðjNj 1ÞðT þ 1Þe . Constraint (9) guarantees that, at each time period, each node is either positively or negatively activated, or otherwise inactive. Constraints (10) and (11) ensure the correct updates of the level of cumulative evidence for each node at each time period. The diffusion process starts with the cumulative level of positive and negative evidence set to zero for all the nodes except the seeds. Constraints (12) and (13) ensure that the cumulative level of positive evidence for each positive seed is greater than the positive threshold (θ þ ), and the cumulative level of negative evidence for each negative seed is greater than the negative threshold (θ ). This is required to ensure that the seeds do not lose their ability for propagating influence immediately following the initial time period. As the objective function favors reducing the number of negative activations (deactivates a negatively activated node in case that its negative level of evidence is exactly equal to its negative threshold), a very small positive parameter ϵ, as small as 0.0001, is needed to force the model to keep the negative seeds negatively activated at the end of the initial time period. Such a parameter is not needed in constraint (12) because the objective function favors keeping the positive seeds positively activated when the level of positive evidence and the positive threshold are the same. Note that assigning a large value to ϵ and adding it to both (12) and (13) would increase the time over which positive and negative seeds can sustain their respective activation. Constraints (14) and (15) set the value of the positive evidence that any node can propagate over a single time period t Z 0 (Eitþ ); they guarantee that: (a) Eitþ is zero when node i is not positively activated at time t (X it ¼ 0), (b) Eitþ is equal to e þ when i has become positively activated at time t (X it X it 1 ¼ 1), and (c) Eitþ is equal to α þ Eitþ 1 , otherwise. Constraints (16) and (17) set the value of the negative evidence that any node can propagate over a single time period t Z 0 (Eit ). Constraint (18) ensures the initial activation of the negative seeds. Constraints (19) and (20) set the initial value of the evidence propagated by each node in the network. Constraint (21) ensures that the total number of positive seeds at time t ¼0 does not exceed the pre-defined number of seeds in the problem. The nonnegativity and binary constraints for the decision variables in the problem are defined in (22) and (23). The most meaningful and valuable IM modeling efforts, reported in the literature, allow for the characterization of the properties of optimal solutions, derived from the analyses of distinct small- and medium-sized problem instances; such findings can then be extrapolated to more general, large problem instances. This section presents two examples using data from real-world networks that illustrate the power of the PC model in explaining the consequences of seed selection decisions when positive and negative influences clash in social networks with specific structure. The PC model reveals how and why the optimal strategy for positive influence spread depends on the selected seeds' positions, on the time length of the window of opportunity the decision-maker has, on the network structure, on the locations of the opponent's seeds, and on the specifications of the evidence accumulation mechanism. Case study 1: This example studies the flow of information over the Zachary's karate club network, a well-known network in the literature of social network analysis. The dataset contains 34 members of a karate club who were observed for two years and the friendship links were extracted based on the interactions among members outside of the club-related activities [81]. During the data collection course, a disagreement grew between the club's administrator and instructor, which led to the club's break-up into two clubs. Fig. 3 shows the Zachary's karate club social network, named Network 1, where node 1 denotes the instructor who is the central node in the first cluster (C1) and node 34 denotes the administrator who is the central node of the second cluster (C2). The clusters depict the eventual student memberships in the two separated clubs [31]. In order to define a PCIM problem on Network 1, consider it as a new market for a vitamin supplement product. Through personal connections, the students can share information with each other and observe each other using the product: consequently, they can process such observations as evidence in support of the hypothesis that the new product is good. Firm F1, the producer of a particular model (variation) of the new product, plans to offer the product at a discounted price to two people in the network, as seeds, to encourage other people to adopt the product, which they were reluctant to adopt otherwise. Meanwhile, a competing producer F2, that produces an alternative product's model, has also identified Network 1 as a niche and has already incentivized node 34, the administrator who has the highest degree in Network 1, to serve as their seed. It is assumed that each person, when exposed to both F1's and F2's products, tests the null hypothesis that “F1's model is better than F2's” versus the alternative hypothesis that “F2's model is better than F1's”. The acceptance (rejection) of the null hypothesis by any node corresponds to the adoption of F1's (F2's) model, while staying undecided signals that the node has not yet adopted the product. The set position of the negative seed serves as a constraint in the problem that F1 formulates, with the objective of locating its own seeds more efficiently. In competing against each other, each company (F1 and F2) not only tries to maximize its own profit, but also tries to minimize the competitor's profit. Without any further assumptions, if F2 has a significantly stronger brand image than F1, the best intuitive strategy for F1 is to locate its seeds far away from the negative seed to influence a group of people and reap some profit in the limited time window before all the people adopt the F2's product. A more challenging problem arises when F1's brand image is as Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ strong as F2's. In this situation, F1 can assign both seeds to cluster C1 to influence all people in this cluster, while a more reasonable strategy is to assign one seed to cluster C2, in the neighborhood of the negative seed to cancel it out, and assign the other seed to cluster C1. Turning the described intuition into exact PCIM solutions, however, is not trivial. To this end, one can use program ðPÞ; see the results in Table 2. In the solved PCIM instances, all the nodes process evidence in the same manner (i.e., the community is homogeneous), and the evidence threshold values are set to θiþ ¼ 2, θi ¼ 2, for every node iA N. The positive seeds are gradually strengthened over several problem instances, by increasing the value of positive evidence increment (e þ ), which leads to the changing optimal seed locations. The time horizon in every problem instance is set to T ¼5 (note that, since the network is small, with the diameter of five links, then any two positive nodes can potentially reach the whole network within five time periods). In Table 2, the first column shows the experiment index, the second and third columns show the evidence increment values, reflective of the relative quality levels of the F1's (e þ ) and F2's (e ) products, and the last column reports the optimal seed set for each instance. The analysis of the optimal seed sets showcases the transition in the optimal seed allocations, as the problem parameters are varied. When the positive evidence increment value is too small, the optimal positive seeds find themselves in the locations most distant from the negative seed. As the positive evidence strength grows in the subsequent instances, the optimal positive seed locations first gradually move toward the negative seed and then spread out evenly over the network. These results are well in line with the intuition. 25 26 29 30 24 28 34 32 4 13 8 33 15 10 3 2 23 31 19 5 16 9 1 7 27 21 20 11 14 17 18 6 12 22 Fig. 3. The IM problem on Network 1 with jNj ¼ 34, T ¼ 5, α þ ¼ α ¼ 0:8, β þ ¼ β ¼ 1:0, θ þ ¼ 2 and θ ¼ 2. Square nodes represent members of the first cluster (C1) and circle nodes represent members of the second cluster (C2). Node 34 is the club administrator, who serves as the negative seed. 7 In order to study the effect of the different time horizon settings on the optimal solution for F1, the experiments are repeated with various time horizons and the results are reported in Table 3. Tracking the changes in the optimal positive seed locations with the varied T reveals that the decision-maker should become more conservative as the time horizon for the problem increases. In order to further study the patterns in the optimal solution formation with the growing T, assume that the decisionmaker (F1) earns (loses) one dollar per positive (negative) activation per time period. Then, the PCIM objective can be interpreted as the amount of money that the decision-maker earns by the end of the marketing campaign. Taking any action other than the optimal one leads to a regret compared to the objective value that would be obtained under the optimal seed selection. As such, a decision-maker that relies on centrality-based heuristics, will always select the nodes (1, 33) as the positive seeds, as they have the highest degree and betweenness centrality values (except for node 34, which cannot be selected), irrespective of the evidence values and T. Table 3 shows the regret of the heuristic solution; the regret increases with T, which in part explains why the decision-maker becomes more conservative as T increases. Note that the regret values should be standardized to allow for proper comparison across that problem instances with different time horizons. As the maximum amount of money that F1 can theoretically make in each instance is NðT þ 1Þ, termed the maximum theoretical revenue (MTR), the heuristic regret of each problem is divided by MTR and the standardized regrets are plotted in Fig. 4. When the positive seeds are weak, the negative evidence conquers the whole network, and vice versa. The peak in Fig. 4 corresponds to the case where the groups of positive seeds and the negative seed are almost equally strong – this is when calculated seed selection can have a big impact. The calculations of the area under the standardized regret curve on Fig. 4 reveal that the regret value of the heuristic-based seed selection increases with T. Overall, these results emphasize the importance of optimal seed selection in (a) the problems with a large time horizon, and (b) the problems where neither positive nor negative evidence is overly dominant. Note that when the positive evidence increment (e þ ) becomes much larger than the negative evidence increment (e ), such that any strategy eventually leads to full positive activation in the network, then the optimal strategy is indifferent to both the location of the negative seed and the time horizon, and places the positive seeds so as to minimize the time of reaching all the nodes. Interestingly, this observation brings up the idea of minimizing the maximum distance (or the average distance) of nodes to positive seed(s) as a heuristic method for locating positive seeds in social networks, when positive evidence strongly dominates the negative evidence. This finding connects the problem of locating positive seed(s) for maximizing the spread of evidence in a noncompetitive social network to the p-center and p-median Table 2 Optimal positive seeds competing the negative seed in node 34 for T ¼ 5 (Network 1). Exp. index eþ e Opt. seeds Remarks 1 2 3 4 5 6 7 8 9 0.5 0.8 1.1 1.7 2.1 2.6 3.5 4.4 6 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 (6, 7) (5, 6) (1, 2) (1, 33) (3, 33) (32, 33) (3, 33) (1, 33) (1, 33) Both seeds are as far away from the neg. seed as possible Both seeds are far away from the neg. seed Both seeds inch closer to the neg. seed, still in C1 One seed stays in C1 and the other one moves to counter the neg. seed in C2 One seed moves to the bridge of C1 and C2 and the other one is still in C2 Both seeds move to C2 to block the neg. seed in its own cluster One seed stays close to the neg. seed and the other one begins to move away The seeds spread out over the network, without regard to the neg. seed The seeds spread out over the network, without regard to the neg. seed Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 8 Table 3 The effect of time horizon on the optimal position of positive seeds (Network 1). Exp. index eþ e Opt. seeds T¼2 Heu. reg. Opt. seeds T¼4 Heu. reg. Opt. seeds T¼7 Heu. reg. Opt. seeds T¼9 Heu. reg. Opt. seeds T ¼ 15 Heu. Reg. 1 2 3 4 5 6 7 8 9 0.5 0.8 1.1 1.7 2.1 2.6 3.5 4.4 6 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 3.5 (7, 14) (1, 2) (1, 2) (1, 33) (1, 33) (1, 33) (1, 33) (1, 33) (1, 33) 3 5 10 0 0 0 0 0 0 (7, 11) (5, 6) (1 ,2) (1, 33) (3, 33) (32, 33) (3, 33) (1, 33) (1, 33) 6 10 35 0 9 31 20 0 0 (6, 7) (5, 6) (1, 2) (1, 33) (3, 33) (32, 33) (3, 33) (1, 33) (1, 33) 6 13 48 0 24 116 29 0 0 (6, 7) (7, 11) (1, 2) (1, 3) (3, 33) (32, 33) (3, 33) (1, 33) (1, 33) 6 12 48 2 21 184 31 0 0 (6, 7) (7, 11) (5, 6) (1, 3) (3, 33) (32, 33) (3,33) (1, 33) (1, 33) 7 13 55 9 104 393 31 0 0 0.8 Standardized Regret 0.7 0.6 0.5 T=2 T=4 T=5 T=7 T=9 T=15 0.4 0.3 0.2 0.1 0 0 -0.1 1 2 3 4 5 6 7 Positive Evidence Value Fig. 4. The standardized regret value of the centrality-based heuristics in Table 3. problems in facility location literature, that try to locate facilities so that it minimizes the maximum or average distance of facilities to the points of demand [38,64]. Case study 2: This example studies the stability of judgments in a network in the presence of external influence. To this end, a new concept reflecting the consequences of subjective evidence reinforcement is introduced, and its utility is illustrated in application to the Florentine families' marriage network [56]. Define a network cluster's “defendability” as the number of its nodes that withstand the pressure of an external judgement, i.e., do not change their opinions/decisions (e.g., related to product purchasing, political party support, etc.). This case study showcases how the defendability of a cluster depends on its interconnectedness and the timing of an external “attack”. The Florentine families' marriage network, named Network 2, contains 16 elite families in Florence in which the links represent the inter-family marriages in the time period 1394–1434. Padgett and Ansell [56] illustrated how Medici family took power through creating strategic marriage links in this network. It is of interest to explore how the growing number of within-cluster marriage links would help Medici remain in power, if a new family were to emerge from the outside and attempt to impose its own influence on the cluster (see Fig. 5(b)–(d)). Without loss of generality, the Medici family is taken as a positive seed in Fig. 5: it is assumed to begin a political campaign at time period t ¼0. After d time periods, a new family Bruno, taken as a negative seed, creates a marriage link to Lambertes family, a peripheral node in the original network, hoping to initiate an oppositional campaign. The negative influence is assumed stronger than positive (e þ ¼ 1 and e ¼ 3:5) ensuring that the negative influencer has the potential to penetrate the cluster. Intuitively, one expects the network cluster to reinforce a particular view as it is exposed to it for a long time. Also, the number of within-cluster connections should accelerate the information exchange, and thereby, make the cluster more defendable. With both positive and negative seed locations given, the PC diffusion model is employed for evaluating the spread of evidence through the network over time (up to T ¼50) until the optimal solution no longer changes with the growing T. In order to gauge the impact of the cluster density on defendability, two marriage links first are added to the original network: Acciaioul to Pazzi, and Albizzi to Bischeri (Fig. 5(b)); then, two additional links are added: Ridolfi to Albizzi, and Strozzi to Guadagni (Fig. 5(c)); and finally, two more links: Medici to Guadagni and Ridolfi to Bischeri (Fig. 5(d)). Table 4 summarizes the results for the four clusters (2(a)–2 (d) shown in Fig. 5(a)–(d)): the first row gives the cluster labels; the second row reports the clusters' densities. The first column of Table 4 reports the delay (in the number of time periods) after which the cluster gets exposed to the negative influence. For each cluster in Table 4, the first (second) column reports the total number of nodes (families) that adopt the positive (negative) political opinion by the end of the diffusion process. The results reported in Table 4 quantify how the cluster defendability increases with the growing density, confirming the claim of Easley and Kleinberg [25] about the association of pluralistic ignorance and the number of direct contacts in the network. The clusters are also observed to become more defendable after a certain delay period, termed critical delay threshold, which depends on the evidence strengths, cluster connectivity and proximity of the point of attack to the positive seed in the cluster. In the real-world scenario, Bruno family would be unlikely to be able to marry into any family in the core of Network 2. A link to Pucci, an isolated node, would hardly be useful. The results of the diffusion process with the same settings as in Table 4, but with Bruno targeting Acciaiiuol and Pazzi (low-degree families), are reported in Table 5. The results showcase the fact that attacking a network through a point far from the positive seed provides a better opportunity for the external evidence to succeed. In order to see how the distance of the point of attack from the positive seed affects the success of the external influence to spread in the cluster, the same experiments are repeated when a link is added to the network to connect the point of attack (Lambertes) to the positive seed (Medici) – see Table 6. The comparison of Tables 4 and 6 reveals that decreasing the distance between the positive seed and the point of attack hurts the prospects of the negative seed. More generally, reinforcing a network with more links makes it more defendable against an opposing influence. Case study 2 highlights the fact that investments into influencing a well-connected community must be carefully calculated. Both the community structure and intervention timing are impor- Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 9 GINORI GINORI SALVIATI SALVIATI LAMBERTES PAZZI LAMBERTES PAZZI ALBIZZI ALBIZZI BRUNO GUADAGNI TORNABUON MEDICI ACCIAIOUL PUCCI STROZZI BRUNO TORNABUON MEDICI ACCIAIOUL PUCCI BISCHERI GUADAGNI STROZZI RIDOLFI BISCHERI RIDOLFI BARBADORI BARBADORI PERUZZI CASTELLAN PERUZZI CASTELLAN GINORI GINORI SALVIATI SALVIATI LAMBERTES PAZZI LAMBERTES PAZZI ALBIZZI ALBIZZI BRUNO BRUNO GUADAGNI TORNABUON MEDICI ACCIAIOUL GUADAGNI MEDICI ACCIAIOUL TORNABUON PUCCI STROZZI PUCCI BISCHERI STROZZI RIDOLFI BISCHERI RIDOLFI BARBADORI BARBADORI PERUZZI CASTELLAN PERUZZI CASTELLAN Fig. 5. The IM problem on Network 2 with jNj ¼ 16, T ¼ 50, α þ ¼ α ¼ 0:7, β þ ¼ β ¼ 1:0, θ þ ¼ 2 and θ ¼ 2: (a) the initial setup, (b) the red edges are added to the network, (c) the green edges are added to the network, (d) the blue edges are added to the network. Adding edges to the cluster (increasing the density of the cluster) increases its defendability and makes it more difficult for the Bruno family (negative seed) to penetrate the network cluster. Table 4 The counts of positively and negatively activated nodes in Network 2 at time T ¼ 50. Table 5 Attacking low-degree families. Network (density) 2 (a) (0.167) 2 (b) (0.183) 2 (c) (0.2) 2 (d) (0.217) Network (density) Attacking Pazzi (0.167) Attacking Acciaioul (0.167) Delay (þ) () (þ) () ( þ) () (þ) () Delay (þ) () (þ) () 0 1 2 3 4 5 6 7 – – – – – 1 6 13 15 15 15 15 15 11 9 1 – – – 3 8 9 9 13 15 15 15 9 4 4 3 1 – – – 14 14 14 14 14 15 15 15 1 1 1 1 1 – 14 14 14 14 14 14 14 15 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 13 13 13 13 13 13 13 13 2 2 2 2 2 2 2 2 – 14 14 14 14 14 14 14 15 1 1 1 1 1 1 1 tant to such ventures. Note that in a marketing problem of occupying and protecting a market niche, the delay considered in this section can be viewed as that of introducing a competing product. In this context, the PC diffusion model can help valuate long-term marketing strategies, i.e., assess the trade-off between an earlier yet more expensive or a delayed but less expensive product introduction. In summary, Section 3.3 showcases the value of the PC diffusion model for expressing the spread of evidence in practical IM problems. Furthermore, the provided case studies exemplify how sensitive the optimal solutions to IM problems may be to the numerical values of problem parameters. Notably, the present section connects the seed positioning problem in social networks to the facility location problem, a well-studied problem in the literature of Operation Research, that opens a door to applying Table 6 Attacking Lambertes family. Network (density) Attacking Lambertes (0.175) Delay (þ) () 0 1 2 3 4 5 6 7 – – 13 13 13 13 13 13 15 15 1 1 1 1 1 1 Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 10 Table 7 Computational results for small- and medium-sized PCIM problem instances. Instance Dataset jS þ j ¼ 4 (BP) jS þ j ¼ 5 jS þ j ¼ 6 jS þ j ¼ 7 Opt. sol. (jNj ¼ 35, T ¼ 10) Inclusion of BP solution (%) Opt. sol. (jNj ¼ 40, T ¼ 30) Inclusion of BP solution (%) Opt. sol. (jNj ¼50, T ¼ 25) Inclusion of BP solution (%) MPE (2, 10, 12, 20) – (6, 8, 17, 19) – (10, 19, 25, 43) – (2, 10, 12, 18, 20) 100 (6, 8, 17, 19, 25) 100 (10, 16, 19, 25, 43) 100 (2, 10, 12, 14, 20, 31) 100 (6, 8, 17, 19, 25, 34) 100 (10, 16, 19, 25, 43, 48) 100 (2, 10, 12, 14, 20, 29, 31) 100 (5, 6, 8, 17, 19, 25, 34) 100 (10, 16, 19, 25, 43, 47, 48) 100 FB1 FB2 location theory models for IM problems in social networks. Section 4.2.1 explains how the methods from the location theory literature can inform new IM heuristics. 4. A set of Lagrangian heuristic tools for PCIM As mentioned in Section 3 (and proved in Appendix A), the PCIM problem is NP-hard. The sources of complexity of the PCIM problem include the number of nodes in the Influence Graph, the total number of time periods for spreading the evidence and the number of positive seeds. An efficient approach is required for solving PCIM problem instances with large Influence Graphs. This section presents a guaranteed-performance heuristic method for PCIM using Lagrangian Relaxation. It works by relaxing a preselected subset of constraints in (P) and including weighted penalty terms for violating the relaxed constraints into the objective function. Lagrangian Relaxation has been applied for solving optimization problems in various areas, including supply chain network design [57], scheduling [21], network planning [61] and data clustering [22]. A Lagrangian Relaxation heuristic for PCIM is designed in Section 4.1: it identifies good feasible solutions in reasonable time, while returning a tight upper bound for the optima. To achieve the latter, a Subgradient algorithm is presented in Appendix B as a method for finding the lowest upper bound for PCIM. Finally, two heuristic methods are presented in Section 4.2 for finding near-optimal feasible PCIM solutions and obtaining tight lower bounds for the optima. 4.1. Lagrangian relaxation for finding an upper bound for PCIM solutions 4.2. Obtaining the lower bounds for optimal PCIM solutions By definition, incorporating the removed PCIM constraints into its objective function results in a valid “relaxation” of the original formulation [39,28]. A Lagrangian Relaxation problem (LRu) for (P) is given, ðLRu Þ max Z LRu ðuÞ ¼ jNj X T X i¼1t ¼0 ðX it Y it Þ þ uðjS þ j jNj X X i0 Þ; ð24Þ i¼1 Subject to: u Z 0; ð25Þ ð7Þ–ð20Þ ð22Þ–ð23Þ: Each feasible solution of (P) is feasible for the corresponding (LRu), since (LRu) is at most as constrained as (P). In order to make (LRu) a valid relaxation for (P), a non-negativity constraint is required for the Lagrangian multiplier (u). As a result of defining (LRu) as a relaxation for (P), each feasible solution for (LRu) provides an upper bound for (P). In an effort to obtain tight upper bounds for PCIM, a Lagrangian dual problem (LDu) is formulated, ðLDu Þ Z LDu ¼ Minu0 Z nLRu ðu0 Þ; Lagrangian dual problem (LDu) can be iteratively solved for finding the dual multipliers that minimize the optimal solution of (LRu) to obtain the best (lowest) upper bound for (P). To make the iterative search procedure of solving LDu more efficient, a loose relaxation of (LRu) is preferable. Tighter relaxations, however, are expected to provide better bounds for (P); thus, a trade-off arises between executing fewer iterations of the search procedure for solving (LDu) with a tighter relaxation and executing more iterations of the search procedure for solving (LDu) with a less tight relaxation. Such relaxations that keep Xit (i ¼ 1; 2; …; jNj; t ¼ 0; 1; …; T) binary and keep the negative seeds fixed are computationally easier because adding (dropping) positive seeds to (from) their optimal solutions can provide valid feasible solutions for (P). With this idea in mind, constraint set (21) is relaxed with dual multiplier u. Although the selected relaxed problem removes the constraint for the exact number of positive seeds in (P), the maximum number of positive seeds in (P) is still constrained by the sets (9) and (18), which do not allow a node to be a negative seed and a positive seed at the same time. In this paper, a Subgradient search procedure, a famous hill climbing algorithm [29], is applied to solve the Lagrangian dual problem (see Appendix B). There are other methods including simplexbased methods and multiplier adjustment methods proposed in the literature for solving Lagrangian dual problems, but Subgradient-based procedures, in general, achieve better computational performance [28,29]. Two heuristic methods are proposed next for finding near-optimal feasible solutions for (P) to provide the lower bound for calculating the heuristic gap and updating the step size for the Subgradient algorithm. ð26Þ where Z nLRu ðu0 Þ is the optimal solution of (LRu), for a given u0 . The Each feasible solution for (P) presents a valid lower bound for the optimal solution for (P). The presented PCIM problem always has at least one feasible solution if the total number of positive seeds and negative seeds is less than or equal to the total number of nodes in the Influence Graph. The simplest method for finding a feasible solution for (P) is to trivially select any jS þ j nodes, which are not negative seeds, as positive seeds. Although such solutions satisfy the stopping criterion in the Subgradient algorithm, the resulting lower bound is not necessarily tight. In this section, two heuristic methods are presented for finding near-optimal feasible solutions. 4.2.1. The iterative seed removal (ISR) algorithm The PCIM problem has three properties, discovered through experimental studies with the mathematical model (P) over Network 1, Netowrk 2 and real Facebook datasets from SNAP collection [48], and presented in this section as observations. The ISR algorithm to be presented utilizes these properties to efficiently find near-optimal feasible solutions for PCIM problem. Observation 1. In the PCIM problem, when the positive seed locations are given at time t¼0, the calculation of the resulting objective function value in (P) takes OðTjNj2 Þ time. Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Observation 2. For the PCIM problems with the varying number of positive seeds to be selected (jS þ j), the solution time is a concave function of jS þ j. Observation 3. The intersection of the sets of optimal seeds for two instances of the PCIM problem with a different number of positive seeds is generally non-empty; moreover, in a vast majority of cases, the optimal seed set for a PCIM problem instance is a subset of the optimal seed set for the PCIM instance with the same parameter specification but more positive seeds. The analysis of PCIM run-times in Section 5.2 experimentally confirms Observations 1 and 2. As a piece of evidence for Observation 3, Table 7 gives the results for three PCIM problems with the growing number of positive seeds. The first problem uses the Mexican Political Elite (MPE) network that contains the significant friendship, kinship, political and business connections within a powerful political group in Mexico [19]. The next two problems use Facebook data subsets, FB1 and FB2, found in the SNAP collection [48]. In each case, the number of positive seeds in the original PCIM problem BP (Base Problem) is equal to four, and new PCIM problems are generated by iteratively incrementing the number of positive seeds by one, and then, solved to see if the optimal solution of a problem with more positive seeds includes the seeds from the optimal solution for BP. The results confirm that the optimal solution for all the problems with more than four positive seeds contain the solution for BP. Note that Observation 3 experimentally authenticates the utility of the Greedy algorithm of Kempe et al. [45] and the facility location Stingy algorithm (also known as the Greedy-Drop algorithm) of Feldman et al. [27], for solving PCIM problems. The ISR algorithm employs the PCIM problem properties captured in the three presented Observations to efficiently obtain good and tight solutions to practical problem instances. Consider an instance of the PCIM problem with jS þ j positive seeds to be identified (henceforth referred to as the original problem). The ISR algorithm increases the number of positive seeds and defines a Dummy problem with jSdþ j4 jS þ j positive seeds. The Dummy problem is exactly the same as the original PCIM problem in the Influence Graph and input parameters, but it seeks for a greater number of positive seeds. An optimal solution for the Dummy problem is necessarily infeasible for the original PCIM; the ISR algorithm works to iteratively obtain the best combination of the seeds to be removed from the optimal solution for the Dummy problem and obtain a good feasible solution for the original problem. The number of positive seeds in the Dummy problem is chosen to be large, but not too large, so that it can be solved fast, and also, the seed removal procedure can be efficient. According to Observation 2, it is always possible to find a simple Dummy problem with jS þ jþ jS jr jNj. According to Observation 3, an optimal solution for the Dummy problem is expected to include the positive seeds present in the optimal solution for the original problem. Hence, the ISR algorithm executes the greedy algorithm of Kempe et al. [45] backwards. At each iteration of the ISR algorithm, the problem with more positive seeds is called a superior problem because its objective function value is necessarily greater than or equal to that of a subproblem achieved by removing one seed, hence called an inferior problem. To begin with, let the first superior problem have jSdþ j positive seeds and define an inferior problem as a maximization problem for finding the best set of jSdþ j 1 positive seeds. Instead of solving the inferior problem, the ISR algorithm traverses all the distinct combinations of jSdþ j 1 positive seeds in the solution of the superior problem and selects the combination that maximizes the objective function of the inferior problem. According to Observation 1, computing the objective function value of the 11 inferior problem for each possible combination of jSdþ j 1 positive seeds in the optimal solution of the superior problem takes OðTjNj2 Þ time. To proceed, the ISR algorithm keeps removing the positive seeds until it obtains a set of jS þ j positive seeds, and reports it as a feasible solution for the original PCIM problem and a lower bound for the optimal solution. The total number of inferior problems that the ISR algorithm solves to obtain the feasible solution for the original PCIM problem with jS þ j positive seeds using a Dummy problem with jSdþ j positive seeds is ! ! ! jSdþ j 1 jSdþ j jS þ jþ 1 þ þ⋯þ þ þ jSd j 1 jSd j 2 jS þ j ¼ jSdþ j þ jSdþ 1jþ ⋯ þ jS þ j þ1 ¼ j Sdþ j 2 þ jSdþ j jS þ j j S þ j 2 : 2 ð27Þ The ISR algorithm elegantly employs the PCIM problem properties. However, its main drawback is its independence from the Subgradient algorithm: the upper bound for the PCIM problem, found by the Subgradient algorithm, does not feed into the ISR algorithm. Furthermore, the ISR algorithm may not work well if all the available Dummy problems are hard to solve. Algorithm 1. The ISR Algorithm for PCIM. Initialize jS þ j and jSdþ j in a Dummy problem, with jSdþ j 4 jS þ j; Initialize bestSolutionValue with -M and currentSolutionValue with 0; /n M is a large positive number n/ Solve the Dummy Problem with jSdþ j positive seeds; Store the solution in Snsup ; for t ’ 0 to ðjSdþ j jS þ j 1Þ do for i ’ 1 to ðjSdþ j tÞ do createNewSolution(i); /n This function removes ith seed from Snsup to obtain a new solution for the inferior problemn/ evaluateNewSolution(i); /n This function evaluates the objective function for the new solutionn/ storeCurrentSolution(i); /n This function stores the objective value of the new solution in currentSolutionValue*/ if currentSolutionValue Z bestSolutionValue then recordBestSolutionIndex(); /n This function records i as the index of the best solution for PCIMn/ updateBestSolutionValue(); /n This function updates the best solution for PCIMn/ end if end for removeOneSeed(); /n This function removes the seed with best solution index from seed set and updates Snsup n/ updateBestSolutionValue(); /n This function initializes bestSolutionValue with -Mn/ end for 4.2.2. The adaptive subgradient-based (ASB) algorithm The ASB algorithm is designed to utilize the information obtained in executing the Subgradient algorithm to iteratively improve the lower bound for PCIM. The optimal solution of (LRu) does not necessarily provide a feasible solution for (P), since (LRu) is not constrained by the number of positive seeds. Let SLþ be the set of positive seeds in the optimal solution of (LRu). At each iteration of the Subgradient algorithm, if jSLþ jZ jS þ j, the ASB algorithm selects the first jS þ j positive seeds (with respect to a fixed random ordering) from SLþ to obtain a feasible solution for (P) with jS þ j positive seeds. On the other hand, when jSLþ jo jS þ j, the ASB algorithm selects all the positive seeds in SLþ and randomly selects jS þ j jSLþ j positive seeds from the nodes in the Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 12 network that are neither in SLþ nor in S . In the early iterations of the Subgradient algorithm, the ASB algorithm blindly selects jS þ j positive seeds, however, the selection process becomes more precise as the Subgradient algorithm runs further. Section 5.2 focusing on run-time and discusses the sources of complexity in the PCIM problem. 5.1. Lagrangian relaxation performance Algorithm 2. The ASB Algorithm for PCIM. Initialize jS þ j; Initialize bestSolutionValue with solution of ISR algorithm and currentSolutionValue with 0; while gapValue r acceptableGapdo storeLagrangianSolution(); /n This function stores the solution of LRu to be used for finding the lower boundn/ createFeasibleSolution(); /n This function creates a feasible solution for (P) using SLþ n/ evaluateNewSolution(); /n This function calculates the objective value of PCIM for the stored solutionn/ storeCurrentSolution(); /n This function stores the objective value of the new solution in currentSolutionValuen/ if currentSolutionValue Z bestSolutionValue then recordBestSolution(); /n This function updates the best feasible solution for PCIMn/ end if end while The ISR and ASB algorithms are very efficient when used together in practice. The ASB algorithm first stores the best feasible solution obtained by the ISR algorithm as the best feasible solution of (P) found so far. At each iteration of the Subgradient algorithm, the ASB algorithm extracts a feasible solution for (P), using Algorithm 2, and quickly finds the corresponding objective value (see Observation 1). At each iteration of the Subgradient algorithm, the feasible solution for (P), obtained by the ASB algorithm, is accepted only if it provides a greater objective function value than the current best feasible solution. To summarize, the presented algorithms form a Lagrangian heuristic toolbox for obtaining near-optimal PCIM problem solutions with rigorously evaluated bounds; the utility of and relationships between the algorithms are explained in Fig. 6. In order to analyze the performance of the presented Lagrangian Relaxation heuristic, this section solves the PCIM problem instances formulated on four Facebook networks found in SNAP collection [48]. The network size- and structure-dependent statistics of these undirected datasets, indexed F1, F2, F3 and F4, are reported in Table 8. The nodes in these networks are labeled; in order to evaluate the performance of the heuristic method, each experiment takes a sub-network of the main dataset with jNj nodes. In this work, the mixed-integer program and the Lagrangian Relaxation heuristic are implemented using Concert Technology in JAVA and the commercial solver CPLEX 12.5. All the experiments have been performed on a desktop with Intel(R)Core(TM)i3 3.3 GHz processor, with 8 GB RAM and 64 bit operating system. Table 9 shows the computational results for small and mediumsized problems, all solved to optimality using CPLEX. The availability of the optimal solutions for these problems permits calculating both the optimality gap and heuristic gap. For the small problems, CPLEX outperforms the Lagrangian Relaxation heuristic in terms of solution time. As the problem size increases the PCIM solution time with CPLEX increases rapidly (see Section 5.2 for the sources of the PCIM problem complexity), while the Lagrangian Relaxation heuristic remains fast. Note that in the majority of the PCIM problem instances reported in Table 9, the ISR and ASB algorithms have found the optimal solution (the optimality gap is equal to zero). The results of the computational study with large-sized problems are given in Table 10. For these problems, CPLEX runs out of computer memory and fails to return optimal solutions. In such cases, the Lagrangian Relaxation heuristic runs in a reasonable computational time and provides an acceptable heuristic gap. For large problems, the optimality gap is unknown, due to unknown optimal solution, and the heuristic gap remains the only criterion Table 8 Dataset statistics. Dataset Nodes Edges Directed Density F1 F2 F3 F4 150 747 534 1034 1693 30,025 4813 26,749 No No No No 0.151 0.108 0.034 0.05 5. Computational results This section presents the computational results with the PCIM instances on some real social networks. Section 5.1 studies the performance of the Lagrangian Relaxation toolbox for PCIM. The Lagrangian Heuristic Toolbox Returns near optimal solutions Provides the Upper Bound for a given u Finds u that minimizes LR(u) Subgradient Search Algorithm ISR Algorithm ASB Algorithm Guides the stopping Criterion Provide/ Improve the Lower Bound UB Heuristic Gap The original Problem (P) Provides a feasible or an infeasible solution Lagrangian Relaxation (LR (u)) LB Fig. 6. The Lagrangian heuristic toolbox: an overview of the components. Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 13 Table 9 Computational results with small- and medium-sized PCIM problem instances. Dataset jNj T jS þ j LR time (s) LR LB LR UB Cplex time (s) Cplex sol. (opt.) Opt. Gap (%) Heu. gap (%) Iter. # F1 F1 F1 F2 F2 F2 30 40 60 45 60 85 40 100 50 64 50 75 6 9 7 7 9 14 11.53 72.01 81.32 26.59 32.06 49.66 1167 4020 2849 2795 2887 6233 1186 4021 2931 2870 2929 6289 0.69 7.89 74.32 7.71 45.84 3425.23 1167 4020 2853 2795 2887 6233 0 0 0.1 0 0 0 1.6 0.02 2.7 2.6 1.4 0.8 20 20 20 20 20 30 Table 10 Computational results with large-sized PCIM problem instances. Dataset jNj T jS þ j LR time (s) LR LB LR UB Cplex Cplex time (h) gap (%) Heu. Iter. gap (%) # F1 F1 F2 F2 F3 F3 50 60 70 70 50 70 9 10 10 10 11 10 3839 5764 8021 8021 3691 8021 3923 5981 8280 8280 3836 8280 44 44 44 44 44 44 2.1 2.4 3.1 3.1 3.7 3.1 80 100 120 120 80 120 70.91 112.99 259.73 259.73 78.65 259.73 4 195 4 190 4 198 4 198 4 192 4 198 60 60 60 60 60 60 Table 11 Computational results with large-sized PCIM problem instances. jS þ j LR time (s) Dataset jNj T F2 F2 F3 F4 550 720 480 1034 50 40 35 60 30 84 30 100 594.96 831.72 522.73 1589.34 LR LB LR UB 6. Discussion Heu. gap (%) Iter. # 26,940 27,494 2.0 24,467 25,070 2.4 13,977 14,208 1.6 28,991 29,822 2.8 Fig. 7(b) shows the effect of the total number of time periods, T, on the run-time of (P) revealing a linear trend. Problems with a small number of nodes appear to remain tractable even with large T. The results of the third set of experiments (see Fig. 7(c)) show how the solution time of (P) is affected by the number of positive seeds in the PCIM problem. These results authenticate the second observation given in Section 4.2.1. As shown in Fig. 7(c), the solution time resembles a concave function in the number of positive seeds, which motivates the ISR Algorithm: the number of positive seeds in a hard PCIM problem instance needs to be just slightly increased to find a Dummy problem with a significantly lower solution time. 60 60 60 60 for the evaluation of the heuristic's performance. The run-time for the Lagrangian Relaxation heuristic smoothly increases with the dimensions of PCIM problem instances and it illustrates the supreme contribution provided by the heuristic approach. In order to assess the scalability of the Lagrangian Relaxation heuristic for solving practical PCIM problems, it is executed with large Facebook networks, where CPLEX cannot even create a feasible solution in the computer memory. It is observed that the Lagrangian Relaxation heuristic still provides acceptable bounds for optimal PCIM solutions. Table 11 reports the results of a computational study with five large problems where the only concerns are the heuristic gap and solution time of the Lagrangian Relaxation heuristic. The results of computational studies in Table 11 show that the proposed heuristic method provides encouraging results for large-scale problems, establishing its practical value. 5.2. A sensitivity analysis of the PCIM problem run-time dynamics Three elements affect the solution time of program (P) for PCIM: the number of nodes in the Influence Graph, number of time periods for evidence spread and number of positive seeds to be selected in the problem. In this section, the PCIM input parameters are varied selectively, and three sets of experiments are performed with F3 dataset; in each case, only one of the three aforementioned factors is changed to see how it affects the solution time. The results in Fig. 7(a) show that the solution time increases rapidly with the growing number of nodes in the Influence Graph, e.g., solving a problem with 90 nodes takes about 50 times more time over a problem with 80 nodes. This section discusses the limitations of the presented models and methods, and concludes the paper. 6.1. Study limitations This paper provides insightful findings and develops a framework for modeling the spread of influence in a social network. However, this study has limitations worth mentioning. First, while the PC model relies on the theory and findings established in the sociology literature for human decision-making [66], it treats stochasticity only implicitly (through Bayesian updates) and does not emphasize the differences between network actors and the uncertainty in capturing such differences. The deterministic diffusion process makes the PCIM problem mathematically tractable, i.e., allows one to solve it as a mixed-integer program, design efficient heuristics exploiting known and fixed network characteristics, and make insightful observations (after all, linear programs are often found useful in practice even though real-world problems are rarely truly deterministic). Future work, however, can involve stochastic optimization for PCIM. Second, data-focused studies are needed to uncover and address potential challenges in specifying the model parameters, i.e., learning how people really process subjective, as opposed to objective, evidence. The investigations with the latter have been previously conducted [77,34], which gives promise to the expansion of the presented research in this direction, too; such studies, however, should lie in the consumer psychology domain. On a positive note, from the modeling and algorithmic perspectives, the PC model can be used with user-defined parameters, and its ability to produce practical insights is confirmed through the reported case studies. 6.2. Concluding remarks This paper models social influence as a consequence of subjective evidence transfer, and quantitatively derives general insights about cascading behavior and belief reinforcement in social networks. The presented Parallel Cascade (PC) diffusion Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 14 Fig. 7. Sensitivity analyses of (P) with α þ ¼ α ¼ 0:7, β þ ¼ β ¼ 0:9, e þ ¼ 1:2, e ¼ 1:25: (a) run-time dynamics as a function of the total number of nodes in the network (T¼ 10, jS þ j ¼ 4, jS j ¼ 3), (b) run-time dynamics as a function of the number of time periods (jNj ¼ 50, jS þ j ¼ 4, jS j ¼ 3), and (c) run-time dynamics as a function of the number of positive seeds (jNj ¼ 50, T ¼ 10, jS j ¼ 3). Objective Value Heuristic Gap Optimality Gap (Maybe unknown) UB Optimal Value LB (Maybe unknown) Iteration Fig. 8. The Heuristic and optimality gaps achieved with the Subgradient algorithm. model defines the rules of exchange and accumulation of subjective evidence, which feeds into node-level hypothesis testing, en route to making decisions, forming judgments, etc. The preference of a null hypothesis over an alternative hypothesis determines a node's participation status in regards to further evidence propagation. The value of evidence collected and accumulated, in support of or against the null hypothesis, is calculated using Bayesian update logic. The optimization problem of finding the set of influential nodes to initiate the evidence spread in support of the null hypothesis under the PC diffusion model (PCIM) is formulated as a mathematical program and solved using CPLEX. The PCIM problem is shown to be NP-hard, and next, an efficient, guaranteed-performance heuristic tool set is presented, exploiting Lagrangian Relaxation. The studies of the spread of evidence in social networks using the PC diffusion model showcase that the ability of the decisionmaker to trigger a successful cascade or keeping a cascade alive is sensitive to the density of network connections and the presence of the opposite opinions in a target cluster. This paper focuses on node-level IM solutions, utilizing the exact fine features of the network structure; however, it also opens a door to studying the problem on the network level, e.g., describing the general properties of the seeds' optimal locations based on metrics such as density and clusterization. Based on the presented PC diffusion model, one can potentially develop new centrality metrics for evaluating network ability to reinforce/preserve beliefs. Future research can also explore how PCIM instances with extremely large Influence Graphs can be reduced, e.g., via clustering, to become manageable. The PC model quantifies belief reinforcement through social connections, and informs the changes in optimal seed allocation for creating successful cascades. As noted in Section 3, the model can incorporate actors' actions: based on the collected evidence, the actors may not only be active in spreading their opinions and judgments, but also, choose to buy a product, vote for a party, etc. Such actions will result in the acquisition of first-hand objective evidence by the actors, which can be processed differently in comparison with the processing of subjective evidence). The addition of actions in the model can lead to more insightful analyses, e.g., of low-quality but actively advertised goods where customers may get excited about a product but only until they buy one. Also, this paper opens up a new area for modeling the defendability of cohesive clusters in social networks against strong external opinions and for the identification of “vulnerable” nodes in network clusters. Furthermore, the paper establishes connection between PCIM and location theory models. Further efforts will pursue the construction of a theoretical method for solving stochastic PCIM instances. Moreover, future studies can apply the proposed optimization scheme for modeling the spread of evidence in the social networks that are growing, and in situations where neither the structure of a social network nor the locations of the opponent's opinion leaders are precisely specified. Acknowledgements This work was supported in part by the National Science Foundation Grant ICES-1216082, and a Multidisciplinary University Research Initiative (MURI) Grant W911NF-09-1-0392. This support is gratefully acknowledged. Appendix A. Proof of Theorem 1 PCIM problem is shown to be NP-hard by a polynomial Turing reduction from the Maximum Coverage Problem (MCP), also referred to as the max k-cover problem or set k-cover problem in the literature [26]. The objective of MCP is to select a group of sets, where some sets have common elements, such that the total number of selected sets is less than the predefined limit and the total number of selected elements is maximized. MCP is first formally stated and then, the reduction from PCIM to MCP is presented. Maximum coverage problem Instance: A number k 4 0 and a collection of sets J ¼ J 1 ; J 2 ; …; J m . Objective: Find a subset J 0 DJ such that jJ 0 j r k and the number of covered elements j⋃J i A J 0 J i j is maximized. Given an arbitrary instance of MCP, define a particular instance of PCIM as follows: Assume the Influence Graph GðN; AÞ is given and let T ¼1, jNj ¼ m, jS þ j ¼ k and j S j ¼ 0. Let þ þ e 4 maxθi ; i ¼ 1; 2; …; jNj, e ¼ 0, and lastly, set α þ ¼ α ¼ β þ ¼ β ¼ 1. Set Ji for i ¼ 1; 2; …; m can be defined such that j A J i iff j¼ i or ði; jÞ A A; j ¼ 1; 2; …; N (all the nodes in the Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ first hop of node i). This transformation can be performed in polynomial time in the size of the arbitrary instance of the MCP. In order to show that an optimal solution to PCIM problem maps to an optimal solution of MCP, let X ni0 for i ¼ 1; 2; …; jNj (X A f0; 1g) to be an optimal solution to PCIM problem. Then, Pi0 P jNj þ j0 for j ¼ 1; 2; …; jNj, Y it ¼ 0 for ði;jÞ A A X i;0 þX i ¼ 1 X i0 rjS j, X jT r PjNj PT i ¼ 1; 2; …; jNj; t ¼ 0; 1; …; T and t ¼ 0 ðX it Y it Þ is maxii¼1 n mized. The claim is that X i0 is an optimal solution for MCP. Note that X n for i ¼ 1; 2; ‥; jNj is a feasible solution for MCP because PjNj i0 þ i ¼ 1 X i0 rjS j ¼ k. Suppose there exists such solution to MCP X i0 for i ¼ 1; 2; ‥; jNj that j⋃J A J 0 J i j 4 j⋃J i A J0n J i j. Solution X i0 for i ¼ 1; 2; ‥; jNj is a feasible P i þ solution for PCIM: jNj the PCIM objeci ¼ 1 X i0 r jS j ¼ k. Therefore, PjNj PT tive function for this solution is t ¼ 0 ðX it Y it Þ i ¼ 1 P PT ¼ j⋃J A J 0 J i j þ k 4 j ⋃J i A J0n Jij þ k ¼ jNj ðX nit Y nit Þ, which is a t ¼ 0 i ¼ 1 i contradiction. Thus, X ni0 for i ¼ 1; 2; ‥; jNj is an optimal solution for MCP. □ Appendix B. Subgradient search algorithm for the Lagrangian dual problem The Lagrangian dual problem (LDu) is presented in Section 4.1 for finding the best (lowest) upper bound for the optimal solution of (P). Since Z LDu ðuÞ is non-differentiable, the subgradient of this function is employed in the implementation of a search algorithm for finding the improved multipliers. Definition 1. Vector s is a subgradient of Z LDu ðuÞ at point u0 if Z LDu ðuÞ rZ LDu ðu0 Þ þ sðu u0 Þ; 8 u: ð28Þ A multiplier un is optimal for (LDu(u)) iff subgradient of Z LDu ðun Þ is zero. At iteration k of the Subgradient search algorithm, the subgradient can be expressed as X k sk ¼ X i0 jS þ j; ð29Þ i where X ki0 ; i ¼ 1; 2; …; N; are the optimizers of ðLRuk Þ. According to Fisher [29], the iterative Subgradient search algorithm for generating the sequence of Lagrangian multipliers uk , given an initial value u0, is defined as uk þ 1 ¼ uk lk ðsk Þ; ð30Þ LDu nLDu if lk →0 with where lk denotes the step size and Z ðu Þ-Z Pk i ¼ 0 li -1 [32]. As lk approaches zero, it is guaranteed that the Subgradient algorithm does not overstep un. Note that the summation of step size values approaches positive infinity, which, theoretically, guarantees the convergence to un. At the end of each iteration of the Subgradient algorithm, the step size value can be updated using the quality of the solution obtained for (LDu) at the same iteration, lk ¼ λk ðZ LDu ðuk Þ Z 0 Þ J sk J 2 ; k ð31Þ where λk is a positive scalar for the step size and Z 0 is a lower bound for ZLD u . The appropriate range of values for λk can be defined experimentally; the range 0 o λk r 2 has been found to work well in practice [29]. The maximum value in the selected range for the step size is assigned to the initial value of the step size (λ0), and it is split when ZLD u fails to decrease for a given number of consecutive iterations of the Subgradient algorithm [40]. There is no mathematical proof for the optimality in the Subgradient algorithm. As (P) is a maximization problem, the Subgradient algorithm stops when the gap between the lower 15 bound, obtained by the ISR and ASB algorithms presented in Section 4.2, and the upper bound, obtained by the Subgradient algorithm, becomes less than a preselected threshold value, which guarantees the quality of the solutions for Lagrangian Relaxation heuristic. Alternatively, the Subgradient algorithm can be terminated after a predetermined number of iterations have been executed or a predefined run-time limit has been reached [67,78]. In this paper, the quality of the gradually improved solutions drives the stopping criteria for the Subgradient algorithm to obtain a guaranteed-performance heuristic method for solving (P). Fig. 8 shows how the heuristic and optimality gaps change in the Lagrangian Relaxation method to reduce the gap around the optimal solution. References [1] Angst CM, Agarwal R, Sambamurthy V, Kelley K. Social contagion and information technology diffusion: the adoption of electronic medical records in US hospitals. Management Science 2010;56(8):1219–41. [2] Aral S, Muchnik L, Sundararajan A. Engineering social contagions: optimal network seeding and incentive strategies. In: Winter conference on business intelligence; 2011. [3] Aral S, Walker D. Identifying social influence in networks using randomized experiments. IEEE Intelligent Systems 2011;26(5):91–6. [4] Aral S, Walker D. Tie strength, embeddedness, and social influence: a largescale networked experiment. Management Science 2014;60(6):1352–70. [5] Arthur D, Motwani R, Sharma A, Xu Y. Pricing strategies for viral marketing on social networks. In Internet and network economics. Springer; 2009. p. 101– 12. [6] Baumeister RF, Bratslavsky E, Finkenauer C, Vohs KD. Bad is stronger than good. Review of General Psychology 2001;5(4):323. [7] Becker MH. Sociometric location and innovativeness: reformulation and extension of the diffusion model. American Sociological Review 1970:267–82. [8] Berger J, Sorensen AT, Rasmussen SJ. Positive effects of negative publicity: when negative reviews increase sales. Marketing Science 2010;29(5):815–27. [9] Bhagat S, Goyal A, Lakshmanan LV. Maximizing product adoption in social networks. In: Proceedings of the fifth ACM international conference on Web search and data mining. ACM; 2012. p. 603–12. [10] Bimpikis K, Ozdaglar A, Yildiz E. Competing over networks; 2015, submitted for publication. Available online at: 〈http://web.mit.edu/asuman/www/publi cations.htm〉. [11] Borgatti SP. Centrality and network flow. Social Networks 2005;27(1):55–71. [12] Borgatti SP. Identifying sets of key players in a social network. Computational and Mathematical Organization Theory 2006;12(1):21–34. [13] Chen W, Collins A, Cummings R, Ke T, Liu Z, Rincon D, et al. Influence maximization in social networks when negative opinions may emerge and propagate. In: Proceedings of the SIAM international conference on data mining; 2011. p. 379–90. [14] Chen W, Lakshmanan LV, Castillo C. Information and influence propagation in social networks. Synthesis Lectures on Data Management 2013;5(4):1–177. [15] Chen W, Wang C, Wang Y. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2010. p. 1029–38. [16] Chen W, Wang Y, Yang S. Efficient influence maximization in social networks. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2009. p. 199–208. [17] Chen W, Yuan Y, Zhang L. Scalable influence maximization in social networks under the linear threshold model. In: 2010 IEEE 10th international conference on data mining (ICDM). IEEE; 2010. p. 88–97. [18] Choi S, Gale D, Kariv S. Learning in networks: an experimental study. Unpublished manuscript; 2005. [19] De Nooy W, Mrvar A, Batagelj V. Exploratory social network analysis with Pajek, vol. 27. Cambridge University Press; 2011. [20] Deroıan F. Formation of social networks and diffusion of innovations. Research Policy 2002;31(5):835–46. [21] Diaby M, Bahl HC, Karwan MH, Zionts S. A Lagrangian relaxation approach for very-large-scale capacitated lot-sizing. Management Science 1992;38 (9):1329–40. [22] Ding C, He X, Simon HD. Nonnegative Lagrangian relaxation of k-means and spectral clustering. In: Machine Learning: ECML 2005, Lecture Notes in Computer Science, Springer; 2005. p. 530–8. [23] Dinh TN, Zhang H, Nguyen DT, Thai MT. Cost-effective viral marketing for time-critical campaigns in large-scale social networks. IEEE/ACM Transactions on Networking (TON) 2014;22(6):2001–11. [24] Domingos P, Richardson M. Mining the network value of customers. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2001. p. 57–66. [25] Easley D, Kleinberg J. Networks, crowds, and markets: reasoning about a highly connected world. Cambridge University Press; 2010. Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i 16 M. Samadi et al. / Omega ∎ (∎∎∎∎) ∎∎∎–∎∎∎ [26] Feige U. A threshold of ln n for approximating set cover. Journal of the ACM (JACM) 1998;45(4):634–52. [27] Feldman E, Lehrer F, Ray T. Warehouse location under continuous economies of scale. Management Science 1966;12(9):670–84. [28] Fisher ML. The Lagrangian relaxation method for solving integer programming problems. Management science 1981;27(1):1–18. [29] Fisher ML. The Lagrangian relaxation method for solving integer programming problems. Management science 2004;50(Suppl. 12):1861–71. [30] Ghose A, Ipeirotis PG. Estimating the helpfulness and economic impact of product reviews: mining text and reviewer characteristics. IEEE Transactions on Knowledge and Data Engineering 2011;23(10):1498–512. [31] Girvan M, Newman ME. Community structure in social and biological networks. Proceedings of the National Academy of Sciences 2002;99(12):7821–6. [32] Goffin J. On convergence rates of subgradient optimization methods. Mathematical Programming 1977;13(1):329–47. [33] Golub B, Jackson MO. Naive learning in social networks and the wisdom of crowds. American Economic Journal: Microeconomics 2010:112–49. [34] Goodman ND, Ullman TD, Tenenbaum JB. Learning a theory of causality. Psychological Review 2011;118(1):110. [35] Goyal A, Bonchi F, Lakshmanan LV, Venkatasubramanian S. On minimizing budget and time in influence propagation over social networks. Social Network Analysis and Mining 2012:1–14. [36] Goyal A, Lu W, Lakshmanan LV. Celfþ þ : optimizing the greedy algorithm for influence maximization in social networks. In: Proceedings of the 20th international conference companion on World wide web. ACM; 2011. p. 47–8. [37] Granovetter M. Threshold models of collective behavior. American Journal of Sociology 1978;83(6):1420. [38] Hakimi SL. Optimum locations of switching centers and the absolute centers and medians of a graph. Operations Research 1964;12(3):450–9. [39] Held M, Karp RM. The traveling-salesman problem and minimum spanning trees. Operations Research 1970;18(6):1138–62. [40] Held M, Wolfe P, Crowder HP. Validation of subgradient optimization. Mathematical Programming 1974;6(1):62–88. [41] Hinz O, Skiera B, Barrot C, Becker JU. Seeding strategies for viral marketing: an empirical comparison. Journal of Marketing 2011;75(6):55–71. [42] Iyengar R, Van den Bulte C, Valente TW. Opinion leadership and social contagion in new product diffusion. Marketing Science 2011;30(2):195–212. [43] Jaynes ET. Probability theory: the logic of science. Cambridge University Press; 2003. [44] Katz E. The two-step flow of communication: an up-to-date report on an hypothesis. Public Opinion Quarterly 1957;21(1):61–78. [45] Kempe D, Kleinberg J, Tardos É. Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2003. p. 137–46. [46] Kempe D, Kleinbergm J, Tardos É. Influential nodes in a diffusion model for social networks. In: Proceedings of the 32nd International Conference on Automata, Languages and Programming. LNCS, ICALP'05, vol. 3580; 2005. p. 1127–38. [47] Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N. Costeffective outbreak detection in networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2007. p. 420–9. [48] Leskovec J, Krevl A. SNAP Datasets: Stanford large network dataset collection. 〈http://snap.stanford.edu/data〉; June 2014. [49] Lu Y, Jerath K, Singh PV. The emergence of opinion leaders in a networked online community: a dyadic model with time dynamics and a heuristic for fast estimation. Management Science 2013;59(8):1783–99. [50] Macy MW. Chains of cooperation: threshold effects in collective action. American Sociological Review 1991:730–47. [51] Mahajan V, Muller E, Sharma S. An empirical comparison of awareness forecasting models of new product introduction. Marketing Science 1984;3 (3):179–97. [52] Manchanda P, Xie Y, Youn N. The role of targeted communication and contagion in product adoption. Marketing Science 2008;27(6):961–76. [53] Merton RK. Selected problems of field work in the planned community. American Sociological Review 1947:304–12. [54] Nam S, Manchanda P, Chintagunta PK. The effect of signal quality and contiguous word of mouth on customer acquisition for a video-on-demand service. Marketing Science 2010;29(4):690–700. [55] Newman ME. Spread of epidemic disease on networks. Physical Review E 2002;66(1):016128. [56] Padgett JF, Ansell CK. Robust action and the rise of the medici. American Journal of Sociology 1993:1400–34. [57] Pan F, Nagi R. Multi-echelon supply chain network design in agile manufacturing. Omega 2013;41(6):969–83. [58] Peres R, Muller E, Mahajan V. Innovation diffusion and new product growth models: a critical review and research directions. International Journal of Research in Marketing 2010;27(2):91–106. [59] Richardson M, Domingos P. Mining knowledge-sharing sites for viral marketing. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2002. p. 61–70. [60] Sangachin MG, Samadi M, Cavuoto LA. Modeling the spread of an obesity intervention through a social network. Journal of Healthcare Engineering 2014;5(3):293–312. [61] Siomina I, Värbrand P, Yuan D. Pilot power optimization and coverage control in wcdma mobile networks. Omega 2007;35(6):683–96. [62] Susarla A, Oh J-H, Tan Y. Social networks and the diffusion of user-generated content: evidence from youtube. Information Systems Research 2012;23 (1):23–41. [63] Tang J, Sun J, Wang C, Yang Z. Social influence analysis in large-scale networks. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2009. p. 807–16. [64] Tansel BC, Francis RL, Lowe TJ. State of the artlocation on networks: a survey. Part I. The p-center and p-median problems. Management Science 1983;29 (4):482–97. [65] Taylor SE. Asymmetrical effects of positive and negative events: the mobilization-minimization hypothesis. Psychological Bulletin 1991;110(1):67. [66] Tenenbaum JB, Griffiths TL, Kemp C. Theory-based Bayesian models of inductive learning and reasoning. Trends in Cognitive Sciences 2006;10 (7):309–18. [67] Trigeiro WW, Thomas LJ, McClain JO. Capacitated lot sizing with setup times. Management Science 1989;35(3):353–66. [68] Tucker C. Identifying formal and informal influence in technology adoption with network externalities. Management Science 2008;54(12):2024–38. [69] Valente TW, Frautschi S, Lee R, OKeefe C, Schultz L, Steketee R, et al. Network models of the diffusion of innovations. Nursing Times 1994;90(35):52–3. [70] Van den Bulte C, Joshi YV. New product diffusion with influentials and imitators. Marketing Science 2007;26(3):400–21. [71] Van den Bulte C, Lilien GL. Medical innovation revisited: social contagion versus marketing effort1. American Journal of Sociology 2001;106 (5):1409–35. [72] Wang C, Chen W, Wang Y. Scalable influence maximization for independent cascade model in large-scale social networks. Data Mining and Knowledge Discovery 2012;25(3):545–76. [73] Wasserman S. Social network analysis: methods and applications, vol. 8l. Cambridge university press; 1994. [74] Watts DJ, Dodds PS. Influentials, networks, and public opinion formation. Journal of Consumer Research 2007;34(4):441–58. [75] Wejnert B. Integrating models of diffusion of innovations: a conceptual framework. Sociology 2002;28(1):297. [76] Whyte Jr. WH. The web of word of mouth. Fortune 1954;50(1954):140–3. [77] Xu F, Tenenbaum JB. Word learning as Bayesian inference. Psychological Review 2007;114(2):245. [78] Xu J, Nagi R. Solving assembly scheduling problems with tree-structure precedence constraints: a Lagrangian relaxation approach. IEEE Transactions on Automation Science and Engineering 2013;10(3):757–71. [79] Yoganarasimhan H. Impact of social network structure on content propagation: a study using youtube data. Quantitative Marketing and Economics 2012;10(1):111–50. [80] Young HP. Individual strategy and social structure: an evolutionary theory of institutions. Princeton University Press; 2001. [81] Zachary WW. An information flow model for conflict and fission in small groups. Journal of Anthropological Research 1977:452–73. Please cite this article as: Samadi M, et al. A subjective evidence model for influence maximization in social networks. Omega (2015), http://dx.doi.org/10.1016/j.omega.2015.06.014i Noname manuscript No. (will be inserted by the editor) Probabilistic Graphical Models in Modern Social Network Analysis Alireza Farasat · Alexander Nikolaev · Sargur N. Srihari · Rachael Hageman Blair Received: date / Accepted: date Rachael Hageman Blair Department of Biostatistics State University of New York and Buffalo E-mail: [email protected] 2 Alireza Farasat et al. Abstract The advent and availability of technology has brought us closer than ever through social networks. Consequently, there is a growing emphasis on mining social networks to extract information for knowledge and discovery. However, methods for Social Network Analysis (SNA) have not kept pace with the data explosion. In this review, we describe directed and undirected Probabilistic Graphical Models (PGMs), and describe recent applications to social networks. Modern SNA is flooded with challenges that arise from the inherent size, scope, and heterogeneity of both the data and underlying population. As a flexible modeling paradigm, PGMs can be adapted to address some SNA challenges. Such challenges are common themes in Big Data applications, but must be carefully considered for reliable inference and modeling. For this reason, we begin with a thorough description of data collection and sampling methods, which are often necessary in social networks, and underlie any downstream modeling efforts. PGMs in SNA have been used to tackle current and relevant challenges, including the estimation and quantification of importance, propagation of influence, trust (and distrust), link and profile prediction, privacy protection, and news spread through micro-blogging. We highlight these applications, and others, to showcase the flexibility and predictive capabilities of PGMs in SNA. Finally, we conclude with a discussion of challenges and opportunities for PGMs in social networks. Keywords Probabilistic Graphical Modeling · Social Network Analysis · Bayesian Networks · Markov Networks · Exponential Random Graph Models · Markov Logic Networks · Social Influence · Network Sampling Probabilistic Graphical Models in Modern Social Network Analysis 3 1 Introduction Over forty years ago, social scientist Allen Barton stated that “If our aim is to understand people’s behavior rather than simply to record it, we want to know about primary groups, neighborhoods, organizations, social circles, and communities; about interaction, communication, role expectations, and social control.” (Barton, 1968 as reported in Freeman, 2004). This sentiment is fundamental to the concept of modularity. The importance of structural relationships in defining communities and predicting future behaviors has long been recognized, and is not restricted to the social sciences [48]. Social Network Analysis (SNA) has a rich history that is based on the defining principle that links between actors are informative. The advent and availability of Internet technology has created an explosion in online social networks and a transformation in SNA. The analysis of today’s social networks is a difficult Big Data problem, which requires the integration of statistics and computer science to leverage networks for knowledge mining and discovery [99]. SNA scientists have had to rely on tractable records of social interactions and experiments (e.g., Milgram’s small world experiment); now they have a luxury of accessing huge digital databases of relational social data. However, this gain in information comes at a price; many of the statistical tools for analyzing such databases break due to the enormity of social networks and complex interdependencies within the data. False discovery rates are not easily controlled, which makes the identification of meaningful signals and relationships difficult [42]. Moreover, sampling networks is typically required, which can propagate selection bias through and downstream inference procedures. SNA relies on diverse data representations and relational information, which may include (among others), tracked relationships among actors, events, and other covariate information [130]. Modeling social networks is especially challenging due to the heterogeneity of the populations represented, and the broad spectrum of information represented in the data itself. In this review, we focus on Probabilistic Graphical Models (PGMs), a flexible modeling paradigm, which has been shown to be an effective approach to modeling social networks [81, 91]. Modern applications, including the estimation of influence, privacy protection, trust (and distrust) microblogging, and web-browsing, are presented to highlight the flexibility and utility of PGMs in addressing current and relevant problems in modern SNA. PGMs provide a compact representation of a high-dimensional joint probability distribution of variables, by utilizing conditional independencies in the network of these variables; such a network, with local (in)dependency specifications, is called a model. PGM modeling is rooted in probabilistic reasoning, querying and also can also be used for generative purposes (sampling) [81]. In this review, we outline the basic theory, and model parameter and structural learning, but emphasize practical application and implementation of these 4 Alireza Farasat et al. models to solve modern problems in SNA. We describe some of the unique statistical challenges that arise in using PGMs in SNA. The challenges are not isolated to PGMs. Rather, they propagate from the very foundation of the model - the data, through the local statistical models of the links and nodes, and finally to the graphical model. This review is organized from the bottomup: from data sampling, to directed and undirected graphical models. This paper is structured as follows. Section 2 provides an overview on data collection methods for SNA, reviews the challenges that arise in network sampling, and cites some network data repositories. In Section 3, directed probabilistic graphical models, static and dynamic, are discussed accompanied by application examples in SNA. Section 4 turns to undirected graphical model types and their applications. Section 5 concludes the paper and outlines future directions and challenges for PGM-based research in SNA. 2 Data collection and sampling Data collection from social networks is a fundamental challenge that inherently affects downstream analysis through sampling bias [11, 19]. The reproducibility and generalization of any statistical analysis performed depends critically on the sample population, and how representative they are of the true population. In traditional observational and clinical studies, randomization and large sample size are important aspects of experimental design [28]. The object of a study may be driven by attributes such as the presence of a disease, or a covariate such as profession, age, preferences, etc. In contrast, SNA focuses primarily on the relations among actors, not the actors themselves and their individual attributes. For this reason, the population is not usually comprised of actors sampled independently; rather, the sampling scheme is driven by ties among the actors. Snowball sampling begins with an actor, or a set of actors, and moves through the network by sampling ties [13]. Snowball methods are useful for identifying modules within a population, e.g., leaders, sub-cultures, and communities. The inability to include isolated actors that are directly tied in, but may be informative to the analysis, is a major limitation. Other disadvantages include the overestimation of connectivity, and the sensitivity of the sample to the initialization setting(s) of the snowball(s). Improvements on snowball sampling have been proposed to address some of these limitations [8, 44, 66, 133]. An alternative approach is to target actors in an ego-centric manner. There are two main sampling designs, with and without alter peer connections [63]. In this setting, a set of focal actors is selected, and their first-level ties are identified. In ego-centric networks with alter connections, those first-level ties are examined to determine connections between them. Ego-centric network without alter connections simply rely on focal actors and first-level ties; with Probabilistic Graphical Models in Modern Social Network Analysis 5 this approach, the extrapolation and generalization to the whole network is not possible. Online Social Networks (OSNs) present unique challenges due to their massive size and the nature of the heterogenous attributes. A number of factors complicate the data collection process. Individuals can customize personal privacy setting, limiting crawlers from obtaining information and ultimately creating a missing data problem for the analyst. The diversity and dynamic nature of the data itself makes pages difficult to parse for collection purposes. Furthermore, the sampling is critical for tractable inference and analyses of large-scale OSNs. In most OSNs, we are faced with hidden populations, i.e., with unknown population size or the underlying distributions of the variables (edges or actors). In these cases, access to the network is facilitated through neighbors only. Crawling (through neighbors), either by random walks or graph transversals, is one of the most widely-used network exploration technique for OSNs. – Random Walk: Metropolis-Hastings algorithm is a widely-used Markov Chain Monte Carlo (MCMC) method for sampling social networks [26]. The random walk starts at a random (or targeted) node and proceeds iteratively, moving between nodes i to j according to a transition probability. As n −→ ∞, the sampling distribution approaches the stationary distribution of actor characteristics, as if each sampled individual was uniformly drawn from the underlying population. In practice, the heuristic diagnostics are performed to assess convergence; the success of the methods can also depend on the starting point of the chain. Even with multiple chains, mixing can be slow and the chain can get stuck in regions of the graph. Note that these features are common to applications of MCMC methods, and not restricted to OSNs [52]. – Graph transversals: Several graph transversal methods have been applied to OSNs. These techniques differ only slightly in the order in which they systematically visit nodes in the network. Breadth-first search (BFS) and snowball sampling visit the graph through neighbor nodes [57]. Depthfirst search (DFS) explores the graph from the seed node through the children nodes, and backtracks at dead-ends. Factors such as sample size, as well as seed and algorithm choice can introduce bias into the statistical analysis of a network. Several authors have performed detailed investigations of the efficiency and bias associated with sampling algorithms using different OSNs [18,92]. Breadth First Search (BFS) is the most widely used method for OSN sampling and has been shown to be biased toward high-degree nodes [87,160]. Variants of the M-H algorithm have been proposed: Metropolized Random Walk with Backtracking (MRWB), M-H Random Walk (MHRW), Re-Weighted Random Walk (RWRW) and Unbiased Sampling to Reduce Self Loops (USRS), which aim to reduce or correct sample 6 Alireza Farasat et al. bias [54, 125, 139, 152]. Publicly Available Data: Several data resources have been created to house a wealth of diverse social network data. These resources are usually open source, requiring, at a minimum, a user agreement. Leveraging these resources is ideal for the development and testing of methodologies related to SNA. Max-Plank researchers have released OSN data used in publications, which includes crawled data from Flickr, YouTube, Wikipedia and Facebook [20, 21, 101, 149]. Several directed OSNs have been released in the Stanford network analysis package (snap), e.g. from Epinions, Amazon, LiveJournal, Slashdot and Wikipedia voting [138]. Recently, a Facebook dataset collected with MHRW was released, which exhibited convergence properties and was shown to be representative of the underlying population [54]. However, MHRW and UNI data sets contain only link information, thereby prohibiting attribute based analyses. Document classification datasets have also been released [53]. A sample from the CiteSeer database contains 3, 312 publications from one of six classes, and 4, 732 links. The Cora dataset consists of 2, 708 publications classified into seven categories and the citation network has 5, 429 links. Each publication is described by a binary word vector which indicates the presence of certain words within a collection of 1, 433. WebKB consists of 877 scientific publications from five classes, contains 1, 601 links and includes binary word attributes similar to Cora. Terrorism databases are also publicly available [38, 141, 142]. The most extensive is the RAND Database of Worldwide Terrorism Incidents, which details terrorist attacks in nine distinct regions of the world across the time-span 1968 − 2009 (dates vary slightly depending on region) [38]. Several well-known challenges may arise in the analysis and representation of terrorist network data, including incomplete information, latent variables influencing node dynamics, and fuzzy boundaries between terrorists, supporters of terrorists, and the innocent [85, 136]. An alternative option to access data is to enroll in data challenges, which are often posed by corporations and operators of the networks themselves. For example, the Nokia mobile data challenge data was released in 2012 [90]. The data follows 200 users throughout the course of a year, and includes: usage (full call and message log), status (GPS readings, operation mode, environment (accelerometer samples, wi-fi access points, bluetooth devices), personal (full contact list, calendar), and user profile. Formal requests are required to use of this data, ensuring use for search and development, and prohibiting commercial use. Twitter has just posed TREC 2013, a collection of 240 million tweets (statutes) collected over a two month period [100]. This is the third year of TREC releases. The use of this data requires registration for a competition that centers around a competition. The 2013 competition centers around realtime ad-hoc search tasks. Probabilistic Graphical Models in Modern Social Network Analysis 7 3 Directed Probabilistic Graph Models Bayesian Networks (BNs) are a special class of PGMs that capture directed dependencies between variables, which may represent cause-and-effect relationships. We describe two different branches of BNs, static and dynamic, which may be used to model social networks at a single time point or across a series of time point respectively. Both rely on the Markov assumptions, which enables the compact representation of the high-dimensional joint probability distribution of the variables in the model. Arguably, the use of directed graphs in SNA has been somewhat limited, although the applications themselves are diverse. We describe the basic principles of these directed PGMs and motivate them with applications in the literature, which showcase their utility in SNA. Static Bayesian Networks utilize data from a single snapshot of a social community at a given time-point, described by Directed Acyclic Graph (DAGs). A DAG conveys precise information regarding the conditional independencies between modeled variables (nodes). The resulting graph, G, can be translated directly into a factored representation of the joint distribution [67, 91]. BNs obey the Markov condition which states that each variable, Xi , is independent of its non-descendants (unconnected nodes), given its parents in G. Under these assumptions, a BN for a set of variables {X1 , X2 , . . . Xn } is a network with the structure that encodes conditional independence relationships: P (X1 , X2 , . . . , Xn ) = P (G) n Y P (Xi | pa(Xi ), Θi ) , i=1 where P (G) is the prior distribution over the graph G, pa(Xi ) are the parent nodes of child Xi , and Θi denotes the parameters of the local probability distribution. Depending on the data and modeling objectives, BN learning may require up to two layers of inference: structural and parameter learning. Identifying the DAG that best explains the data is an NP-hard problem [27]. Structural inference can be conducted by sampling the posterior distribution to obtain an ensemble of feasible graphs, or through the implementation of a greedy hillclimbing algorithm, to identify a single graph structure that best approximates the Maximum a Posteriori (MAP) probabilities [68]. In many applications of SNA, the structure is often assumed, at least to some degree. In this case, the statistical inference problem is local parameter inference conditional on the assumed structure of the network. The directionality and causal structure of the inferred model makes BN an attractive modeling paradigm for social networks that captures and conveys cause and effect relationships in a problem setting. Such examples, may manifest in decision making (influence). Screen-Based Bayes Net Structure (SBNS) 8 Alireza Farasat et al. was developed as a search strategy for large-scale data, which relies on the adopted assumption of sparsity in the overall network structure [55]. Sparsity in BN is a popular assumption that can safeguard against over-fitting [68]. SPSN enforces the sparsity through a two stage process, which frames the structural learning problem as Market Basket Analysis task [12]. The algorithm relies on the theory of frequent sets and support, to first screen for local modules of nodes, and then connect them through a global structure search. The Market Basket framework lends itself to transaction style data, which is by nature large, sparse and binary. In this case, actors are assumed to be linked to each other indirectly through items or events (Figure 1A). The learning problem is to identify an influence graph based on derived features of the binary transaction data. The method was shown to be effective for modeling a variety of SNs, including citation networks, collaboration data, and movie appearance records [12]. Koelle et al. proposed applications of BNs to SNA for the prediction of novel links and pre-specified node features (e.g., leadership potential) [80]. The authors emphasize the advantage of BN to account for uncertainty, noise, and incompleteness in the network. For example, a topology-based network measures such as degree centrality, which is often used as a surrogate for importance, is subject to summarizations over incomplete and sometimes erroneous data. Comparatively, a BN affords more flexibility that enables measures such as importance to be estimated in a more data-dependent manner. Koelle et al. provide an example of combining topology-based network measures with covariate information (Figure 1B). Directed inference of this type leverages small local models, which can be naturally translated to regression or classification problems, depending on the child node (response variable). In this setting, the local BN can be evaluated at the node-level, ranked probability estimates can be used for predictive purposes, and the output serves as a surrogate for model fit on a given structure. Privacy protection is a major concern amongst users in online social networks [65]. Generally, people prefer that their personal information is shared in small circles of friends and family, and shielded from strangers [24]. Despite this common desire, relatively simple BNs have been shown to be successful in the invasion of privacy though the inference of personal attributes, which have been shielded through privacy settings [65]. The BNs operate under the often accurate assumption that friends in social circles are likely to share common attributes. In 2006, the recommendation by He et al. to improve privacy was to hide friend lists through privacy settings, and to request that friends hide their personal attributes. Practically speaking, setting the optimal privacy settings is complex, and can be a tedious and difficult for an average user [96]. In 2010, a privacy wizard template was proposed, which automates a persons privacy settings based on an implicit set of rules derived using Naive Bayes (the simplest BN) or Decision Tree methods [43]. Probabilistic Graphical Models in Modern Social Network Analysis 9 On the other side of the application spectrum, BNs are useful for recommending products and services, to users, taking into account their interests, needs and communications patterns. Belief propagation has been used to summarize belief about a product and propagate that belief through a BN [9,159]. Belief propagation is the process in which node marginal distributions (beliefs) are updated in light of new evidence [82]. In the case of a BN, evidence (e.g., opinion or ratings) is absorbed and propagated through a computational object known as a junction tree, resulting in updated marginal distributions. Comparing the network marginals before and after evidence is entered and propagated conveys a system-wide effect of influence(s), and insights into how perception or ratings change when recommendations are passed through a network. Despite its simplicity, the BN approach has been shown to be competitive with the more classical Collaborative Filterting (CF)-based recommendation [158]. Trust (and distrust) can be highly variable dynamic processes, which depends not only on distance from a recommender, but also, the characteristics of the network users [88, 153]. Accounting for trust in recommendation systems is an open area of research Microblogging networks represent another effective venue for rapidly disseminating information and influence throughout a community. Twitter is the most well-known microblogging network, in which posts (tweets) are short and time-sensitive with respect to the reference of current topics [89]. Users within microblogging networks of this type participate though the act of following and being followed, which gives rise naturally to directed associations [75]. With over 50 million tweets submittted daily, ranking and querying micrblogs has become an important and active area of open research [25, 97, 105, 110, 114]. Jabeur et al. proposed a retrieval model for tweet searches, which takes into account a number of factors, including hashtags, influence of the microbloggers, and the time [72,73]. A query relevance function was developed based on a BN that leverages the PageRank algorithm to estimate parameters, such as influence, in the model (Figure 1C). The retrieval model was shown to outperform traditional methods for information retrieval on Twitter data from the TREC Tweets 2011 corpus [111]. 10 Alireza Farasat et al. Bayesian Network Applications A) Sparse Bayesian Influence Individuals linked through events Inferred Social Influence Anne Anne Conference 1 Bill Bill Cal Cal Conference 2 Doug Conference 3 Eden Doug Eden B) Local Bayesian Network Prediction Degree Centrality Link Certainty Individual Importance Sex Education Religion - Attribute - Centrality Measure Individual Importance Future Leadership Potential - Derived Metric C) Bayesian models of Twitter Queries Tweet “Layored” Search Twitter Edge Types following mentioning tagging publishing re-tweeting sharing Node Types microblogger tweet reply retweet hashtag web resource t1 k1 t2 k2 t3 k3 m1 m2 Microbloggers Tweets Terms Q Query Fig. 1 Simplified schematics of select examples of Bayesian Networks in social networks. (A) Inferring sparse Bayesian influence based on transaction style data, which links actors to events. (B) Local models can be used to assess predict local metrics, such as individual importance or leadership potential, from attributes and centrality measures on the network itself. (C) Twitter is a microblogging community, which can be queried using a retrieval model described by a Bayesian Network. Thus far, the BNs discussed summarize information at a single time-point. This represents an oversimplification of the true nature of the networks described, which are inherently dynamic [137]. In the described SN applications, the dynamic aspects are simplified by extracting data from a snapshot (or series of snapshots) of the SN across a time-period. The discretization, e.g., coarse or fine, can bias the results of the analysis. Discretization can give rise to many of the issues related to data collection discussed in Section 2. Modeling the dynamics of a network over the time-course can be achieved in the Probabilistic Graphical Models in Modern Social Network Analysis 11 BN framework with additional modeling assumptions. Dynamic Bayesian Networks Dynamic Bayesian Networks (DBNs) provide compact representations for encoding structured probability distributions over arbitrarily long time courses [103]. State-space models, such as Hidden Markov Model (HMM) and Kalman Filter Models (KFMs), can be viewed as a special class of the more general DBN. Specifically, KFMs require unimodal linear Gaussian assumptions on the state-space variables. HMMs do not allow for factorizations within the state-space, but can be extended to hierarchical HMMs for this purpose. DBNs enable a more general representation of sequential or time-course data. DBN modeling is achieved through the use of template models, which are instantiated, i.e., duplicated, over multiple time points. The relationships between the variables within a template are fixed, and represent the inherent dependencies between ground variables in the model. The objective is to model a template variable over a discretized time course, X 0 . . . X T , and represent P (X 0 : X T ) as a function of the templates over the range of time points. Reducing the temporal problem to conditional template models, makes the problem computationally tractable, but requires the specification of a fixed structure across the entire time trajectory. In a DBN, the probability for a random variable X spanning the time course can be given in factored form, −1 TY P X t+1 | X t , P X 0:T = P X 0 t=0 where X 0 represents the initial state, and the conditional probability terms of the form P X t+1 | X t convey the conditional independence assumptions. The conditional representation of the likelihood is similar in spirit to the static BN representation, but conveys the conditional independence with respect to time. The Markov assumption enables this factorization, which has different, yet analogous meanings in static and dynamic BNs. In a DBN, the Markov assumption explains the memorylessness property, i.e., that the current state depends on the previous and is conditionally independent of the past X t+1 ⊥X 0:t−1 | X t . Comparatively, in static BNs, the Markov assumption only captures nodes’ independence of their non-descendants, given the states of their parents. Both DBNs and static BNs represent joint distributions of random variables. Similar to static BNs, DBNs also may require up to two layers of inference, structural and parameter learning. The learning paradigms are rather similar. Structural learning is typically achieved by the same scoring strategies, but with the added constraint that the structure must repeat over time [49]. Such a constraint alleviates the computational burden for search strategies. 12 Alireza Farasat et al. Additionally, the best initial structure can be searched for independently from the remainder of the time-course. The search is performed either through greedy hill climbing or sampling. Several options exist for parameter learning, including junction trees, belief prorogation, and EM algorithm [33,78,132]. Despite the fact that social networks are typically inherently dynamic, the applications of DBNs in SNA have been limited. Importantly, there have been many attempts to model social networks probabilistically over time, but not in the strict PGM context, which is the focus of this review; many of these advances are discussed in Section 5. Chapelle et al. used DBNs to model web users’ browsing history [22]. The DBN extends the traditional and widelyused cascade model for browsing behavior to a more general model [77]. The dynamic studied here is that of click sequences, which is illustrated in Figure 3 for a single click (one time instance). The model takes into account the information at the query and session levels, differentiating perceived/ actual attraction (au and Ai respectively) and perceived/ actual satisfaction (su and Si respectively) with links. At each click (time-step), the hidden binary variables for examination (Ei ) and satisfaction (Si ) track the time progression to predict future clicks. The DBM approach was shown to outperform traditional methods, and highlighted the sensitivity of click modeling to measures of relevance and popularity at the query level. DBNs and HMMs are very popular in the area of speech recognition [115, 162]. Meetings are social events, in which valuable information is exchanged mainly through speech. Effectively processing, capturing, and organizing this information can be costly, but is critical in order to maximize the impact and information flow for participants. Dielman et al. cast the problem of meeting structuring as a DBN, which partitions meetings into sequences of actions or phases based on audio [35]. Data including speaker order, location detected from microphone array, talk rate, pitch, and overall energy (enthusiasm). DBNs outperformed baseline HMMs in detecting meeting actions in a smart room, such as dialogue, notes at the board, computer presentations, and presentations at the board. Twitter, and microblogs in general, have become a major resource for the media to obtain breaking news or a the occurrence of a critical event. Recently, Sakaki et al. modeled Twitter activity using KFMs in an effort to identify event and event location [124]. Each Twitter user is assumed to represent a sensor that monitors tweet features such as keywords, locations of tweets, their length and content. Support Vector Machines (SVM) are first used for event classification, followed by a Kalman filter to identify the location and the path itself. Location information of the quake is estimated through parameter learning at each time-point. Through tweet modeling, the authors were able to predict 96% of Japan’s earthquakes of a certain magnitude. Furthermore, they developed a reporting system Torreter, which is quicker than the existing government reporting system in warning registered individuals through email Probabilistic Graphical Models in Modern Social Network Analysis 13 of an impending quake [74]. This important and highly cited work can be generalized in this paradigm to model and predict other events. Dynamic Bayesian Network Example: Click Modeling Et-1 Et Et+1 Ci Ai Si au su Session Level Query Level - Latent - Observed Fig. 2 An example of a time instance in a DBN used for click modeling in a browser. The temporal dimension is click sequence, which can be progressed through binary latent variables depicting satisfaction (Si ) and examination (Ei ). Attraction (Ai ) and satisfaction (Si ) are modeled at the session level, as well as the query level (au and su ), which is assumed to be time invariant. 4 Undirected Probabilistic Graph Models Markov Networks (MNs), also known as Markov Random Fields (MRFs), are PGMs with undirected edges. Similar to directed BNs, a MN graph is a representation of the joint distribution between variables (nodes), where the absence of an edge between two nodes implies conditional independence between the nodes, given the other nodes in the network. In this review, we restrict our focus to MNs, Markov Logic Networks (MLNs) and Exponential Random Graph Models (ERGMs), which can be viewed as generalizations of the random graphs [47], and are widely used in SNA [109]. The basic formulation of these models and their utility in SNA will be highlighted. Markov Networks can be decomposed into smaller complete sub-graphs known as cliques. A clique is a maximal clique if it cannot be extended to 14 Alireza Farasat et al. included addition adjacent nodes. Clique representation enables a compact factorization of the probability density function (pdf). Specifically, the pdf captured by a graph G can be represented in the form: P (X) = 1 Y ψC (XC ), Z (1) C∈Ω where C is a maximal clique in the set of maximal cliques Ω, and ψC (xC ) is the clique potential. The clique potentials are positive functions that capture the variable dependence within the cliques [82]. The normalizing constant, also known as the partition function, is given as: X Y Z= ψC (XC ). X∈χ C∈Ω Each clique potential in a MN is specified by a factor, which can be viewed as a table of weights for each combination of values of variables in the potential. In some special cases of MNs such as log-linear models [104], clique potentials are represented by a set of functions, termed features, with associated weights (i.e., φC (XC ) = log(ψC (XC )), where φC (XC ) is a feature derived from the values of the variables in set XC ). The Hammersley-Clifford theorem specifies the conditions under which a positive probability distribution can be represented as a MN. Specifically, the given representation (Equation 1) implies conditional independencies between the maximal cliques and is, by definition, a Gibbs measure [61]. MN specification problems, including parameters estimation and structure learning from data, can be quite challenging. The main difficulty in MN parameter estimation is that the maximum likelihood problem formulated with Equation 1 has no analytical solution due to the complex expression of Z [93]. The problem of finding the optimal structure of G [76] using available data, similar to BNs, is even more challenging [16]. Currently existing approaches to structure learning are either constraint-based or score-based (see [37, 81, 106, 123, 129, 161] for more details). MNs found use in SNA with the emergence of online social networks (OSNs) and digital social media (see [14] for a review of key problems in SNA). The need to capture non-causal dependencies within and between data instances (e.g., profile information) and observed relationships (e.g., hyperlinks) in these applications is exacerbated by the presence of missing or hidden data in OSNs [156]. A popular problem instance in this domain, that of user (missing) profile prediction, has been attacked using MNs [107, 117, 140]. Along with the problem of predicting missing profiles, link prediction is among the most prominent problems in Big Data SNA. Multiple variations of MNs that have been used to estimate the probability that a (unobserved) Probabilistic Graphical Models in Modern Social Network Analysis 15 link exists between nodes include Markov Logic Networks, Relational Markov Networks, Relational Bayesian Networks and Relational Dependency Networks [5, 23, 143, 145]. Detection of community substructures is another area of MN application [41, 108]. Social network clustering is especially challenging in a dynamic context, e.g. in Mobile Social Networks [70]. Wan et al. employed undirected graphical models (i.e., conditional Random Fields) constructed from mobile user logs that include both communication records and user movement information [151]. Communities can be discovered through examination and subsetting (cutting) network relationships according to labels of interest, and through the use of weighted community detection algorithms. Relational Markov Networks can be used for labeling relationships in a social network with given content and link structure [150]. Several generative models have been proposed, which are motivated by MNs, and explain the effects of selection and influence (e.g., see [2]). Modeling channeled spread of opinions and rumors, known more generally as diffusion modeling, is an active area of research in SNA [10, 94, 119]. Several applications of diffusion models have been proposed for social networks including, but not limited to the spread of information [30], viral marketing [77], spread of diseases [7], the spread of cooperation [127]. Given a social network, for each node, a corresponding random variable indicates the state of the node (e.g., product or technology adoption) and links in the network represent dependency [155]. Markov Logic Networks employ a probabilistic framework that integrates MNs with first-order logic such that the MN weights are positive for only a small subset of meaningful features viewed as templates [117]. Formally, let Fi denote a first-order logic formula, i.e., a logical expression comprising constants, variables, functions and predicates, and wi ∈ R denote a scalar weight. An MLN is then defined as a set of pairs (Fi , wi ). From the MLN, the ground Markov network, ML,C , is constructed [117] with the probability distribution [145], ! X 1 wi ni (x) , (2) P (X = x) = exp Z i where ni (x) is the number of true groundings (e.g., logic expressions) of Fi , i.e., such formulae that hold, in x. Figure 3 gives an example of a ground MLN represented as a pairwise MN (left) for two individuals [104]. Many problems in statistical relational learning, such as link prediction [39], social network modeling, collective classification, link-based clustering and object identification, can be formulated using instances of MLN [117]. Dierkes et al. used MLNs to investigate the influence of Mobile Social Networks on 16 Alireza Farasat et al. Fig. 3 An example of MLN with two entities (individuals) A and B, the unary relations “smokes” and “cancer” and the binary relation “friend”. The ground predicates are denoted by eight elliptical nodes. Two formulas, F1 (“someone who smokes has cancer”) and F2 (“friends either both smoke or both do not smoke”) are captured. There exist two groundings of the F1 (illustrated by the edges between the “smokes” and “cancer” nodes) and four groundings of F2 captured by the rest of the edges [145]. consumer decision-making behavior. With the call detail records represented by a weighted graph, MLNs were employed in conjunction with logit models as the learning technique based on lagged neighborhood variables. The resulting MLNs were used as predictive models for the analysis of the impact of word of mouth on churn (the decision to abandon a communication service provider) and purchase decisions [36]. As mentioned above, link mining and link prediction problems can also be addressed using MLNs, since MLNs combine logic and probability reasoning in a single framework [40, 131]. Furthermore, the ability of MLNs to represent complex rules by exploiting relational information makes them an appropriate alternative for collective classification (e.g., classification of publications in a citation network, or of hyperlinked webpages) [31, 34]. The Ising model and its variations form a subclass of MN with foundations in theoretical physics [6]. The Ising model is a discrete and pairwise MN, and is popular in applications in part due to its simplicity [82]. The variables in the model, X1 . . . Xp , are assumed to be binary, and their joint probability is given as: X p(X, Θ) = exp θjk Xj Xk − Φ(Θ) ∀ X ∈ χ, (j,k)∈E p where χ ∈ {0, 1} , and Φ(Θ) is the log of the partition function X X exp Φ(Θ) = log θjk xj xk . x∈χ (j,k)∈E Special, efficient methods exist for learning the Ising Model parameters from data [116]. While the model has been originally found useful for understanding magnetism and phase transitions, its utility has later expanded to Probabilistic Graphical Models in Modern Social Network Analysis 17 image processing, neural modeling, and studies of tipping points in economics and social domains [1]. In SNA, the Ising model can be employed to analyze factors such as network sub-structures and nodal features affecting the opinion formation process. A classical example within this are is a study of medical innovation spread, namely the adoption of drug tetracycline by 125 physicians in four small cities in Illinois [17]. Figure 4 depicts the physicians’ advisory network from a data set prepared by Ron Burt from the 1966 data collected by Coleman, Katz and Menzel [29] about the spread of medical innovation. The figure illustrates the physicians’ network in two different time points and shows how physicians changed their opinions and adopted the new medication overtime. Adopted Not adopted Fig. 4 The spread of new drug adoption through an advisory network physicians: two snapshots at different time points, about two years apart (from left to right). The growth dynamics in the number of adopters can be analyzed with an Ising Model. Recently, the Ising Model has been used to examine social behaviors [148], including collective decision making, opinion formation and adoption of new technologies or products [50, 60, 84]. For example, Fellows et al. proposed a random model of the full network by modeling nodal attributes as random variates. They utilized the new model formulation to analyze a peer social network from the National Longitudinal Study of Adolescent Health [45]. Agliari et al. proposed a model to extract the underlying dynamics of social systems based on diffusive effects and people strategic choices to convince others [3]. Through the adaptation of a cost function, based on the Ising model, for social interactions between individuals, they showed by numerical simulation that a steady-state is obtained through natural dynamics of social systems. 18 Alireza Farasat et al. Exponential Random Graph Models (ERGMs) [154], also known as the p∗ -class models, are among the most widely-used network approaches to modeling social networks in recent years [47,113,120,121,134]. A social network of individuals is denoted by graph Gs with N nodes and M edges, M ≤ N2 . The corresponding adjacency matrix of is denoted by Y = [yij ]N ×N , where yij is a random variable and defined as follows: 1 if there exists a link between nodes i and j ∀ i, j, i 6= j yij = 0 otherwise. Based on an ERGM, the probability of an observed network, x, is: ! K X 1 θi fi (y) , P (Y = y, Θ) = exp Z i=1 (3) where fi (y), i = 1, . . . , K, are called sufficient statistics [98, 102], based on configurations of the observed graph and Θ = {θ1 , . . . , θK } is a K-vector of parameters (K is the number of statistics used in the model). Network configurations, including but not limited to network edge count (tie between two actors), as well as counts of 2-stars (two ties sharing an actor) and triads of various types, are related to communication patterns among actors in a social network (see [98] for more details about network configurations). The parameters of ERGMs reflect a wide variety of possible configurations in social networks [119]. In addition, Z is the normalization constant. Some of the first proposed models, e.g., random graphs and p1 models [47], used Bernoulli and dyadic dependence structures, which are generally overlysimplistic [120]. On the contrary, ERGMs are based on Markov dependence assumption [47] supposing that two possible ties are conditionally dependent when they share an actor (node). Moreover, Markov dependence assumption can be extended to attributed networks which assumes each node has a set of attributes influencing the node’s possible incoming and outgoing ties [120] (e.g., more experienced actors in an advisory network, more incoming ties). When nodal attributes are taken into account as random variables, ERGMs and MNs can be integrated to model the social network due to similarities that they share (see the Appendix and [45, 98, 144]). ERGMs have been widely employed to study the network and friendship formation [135] and global network structural using local structure of the observed network [146]. The observed network is considered as one realization from too many possible networks with similar important characteristics [120]. For example, Broekel et al. used ERGMs to identify factors determining the structure of inter-organizational networks based on the single observation [15]. Schaefer et al. used SNA to study the relation between weight status and friend selection and ERGMs to measure the effects of body mass index on friend selection [128]. Probabilistic Graphical Models in Modern Social Network Analysis 19 Morover, Goodreau et al. used ERGMs to examine the generative processes that give rise to widespread patterns in friendship networks [59]. Cranmer and Desmarais used ERGMs to model co-sponsorship networks in the U.S. Congress and conflict networks in the international system. They figured out that several previously unexplored network parameters are acceptable predictors of the U.S. House of Representatives legislative co-sponsorship network [32]. The ERGMs have also been utilized in modelling the changing communication network structure and classifying networks based on the occurrence of their local features [146] and to identify micro-level structural properties of physician collaboration network on hospitalisation cost and readmission rate [147]. Finally, a ERGM-based model of clustering nodes considering their role in the network has been reported [126]. 5 Discussion Mining social networks for knowledge and discovery has proven to be a very challenging and active research area [79]. This review focussed on PGMs, and motivated their use in social networks through a variety of diverse applications. An important consideration is the issue of scalability, which is a major challenge, not only for PGMs, but for SNA, in general. Structural and parameter learning in high-dimensions can be prohibitive. In practice, several different network structures may be plausible, and equally likely. Moreover, both greedy- and sampling-based search strategies can get stuck at local minima. These numerical caveats can give rise to misleading networks, generating models, and subsequent predictions. ERGMs can exhibit degeneracy [64], which occurs when the generated networks show little resemblance to the generating model. Proposed modifications to the concept of goodness of fit have been proposed to safeguard against the problems of degeneracy [58, 71]. Social networks continuously evolve over time. The methods we have discussed either utilize a static snapshot of the social network at a given time, or a fixed template structure which captures the dynamics. Template-based dynamics have proven their utility in a few social network applications. However, they are overly simplistic in their assumptions. More realistically, social networks can give rise to several interrelated streams that contain complex overlapping relational data [83]. Moreover, communities drift as new members join, old members leave or becoming inactive, and activities change. PGMs are not equipped to model temporal models of this type. Data stream mining research is an active area of research that aims to analyze web data as a stream and upon arrival [86]. There are considerable challenges related not only to the sheer volume and speed in which data is processed, but also to the changes in the features or targets being processed. Another major challenge, which has been extensively studied, is the concept of drift [51]. This phenomena 20 Alireza Farasat et al. occurs when the probability of features and targets change in time, in other words, probability distributions change in the stream. Estimation in posterior probabilities in DBNs is spirit to drift estimation, but much more severely constrained due to the Markov assumption. Alternative methods to modeling dynamics of the network have been proposed, including latent modeling approaches and the adoption of smooth transition assumptions. Sarkar et al. proposed a latent space model which assumes smooth transitions between time-steps, i.e. networks that change drastically from one time step to the next are assigned a lower probability. They also adopt a standard Markov assumption which states that t+1 is conditionally independent of all previous time-steps given t, which is the assumption adopted in our discussion of DBNs. Hoff et al. describe a latent space approach that relies on mapping actors into a social space by leveraging assumed transitive tendencies in relationships in order to estimate proximity in the latent space [69]. The iterative Facetnet algorithm frames the dynamic problem in terms of a nonnegative matrix factorization, and uses the Kubler-Leibler divergence measure to enforce temporal smoothness [95]. TESLA extends the well-known graphical LASSO method for sparse regression, and penalizes changes between time steps using l1 -regularization. [4]. The TESLA algorithm was tested on both biological and social networks. In this review, we survey directed and undirected PGMs, and highlight their applications in modern social networks. Despite limitations that arise related to scalability and inference, it is our opinion that the utility of PGMs has been somewhat under-realized in the social network arena. It is indisputable that methods for understanding social networks have not kept pace with the data explosion. There are several relevant topics and opportunities in social networks, e.g., link predication, collective classification, modeling information diffusion, entity resolution, and viral marketing, where conditional independencies can be leveraged to improve performance. PGMs implicitly convey conditional independence and provide flexible modeling paradigms, which hold tremendous promise and untapped opportunity for SNA. 6 Acknowledgements AF and AN are supported by a Multidisciplinary University Research Initiative (MURI) grant (Number W911NF-09-1-0392) for Unified Research on Networkbased Hard/Soft Information Fusion, issued by the US Army Research Office (ARO) under the program management of Dr. John Lavery. RHB is supported through NSF DMS 1312250. Probabilistic Graphical Models in Modern Social Network Analysis 21 References 1. Afrasiabi, M.H., Guérin, R., Venkatesh, S.: Opinion formation in ising networks. In: Information Theory and Applications Workshop (ITA), 2013, pp. 1–10. IEEE (2013) 2. Aggarwal, C.C.: An introduction to social network data analytics. Springer (2011) 3. Agliari, E., Burioni, R., Contucci, P.: A diffusive strategic dynamics for social systems. Journal of Statistical Physics 139(3), 478–491 (2010) 4. Ahmed, A., Xing, E.P.: Recovering time-varying networks of dependencies in social and biological studies. PNAS 106(29) (2009) 5. Al Hasan, M., Zaki, M.J.: A survey of link prediction in social networks. In: Social network data analytics, pp. 243–275. Springer (2011) 6. Anderson, C.J., Wasserman, S., Crouch, B.: A¡ i¿ p¡/i¿* primer: logit models for social networks. Social Networks 21(1), 37–66 (1999) 7. Anderson, R.M., May, R.M., et al.: Population biology of infectious diseases: Part i. Nature 280(5721), 361–367 (1979) 8. Atkinson, R., Flint, J.: Accessing hidden and hard-to-reach populations: Snowball research strategies. Social research update 33(1), 1–4 (2001) 9. Ayday, E., Fekri, F.: A belief propagation based recommender system for online services. In: Proceedings of the fourth ACM conference on Recommender systems, pp. 217–220. ACM (2010) 10. Bach, S.H., Broecheler, M., Getoor, L., O’Leary, D.P.: Scaling mpe inference for constrained continuous markov random fields with consensus optimization. In: NIPS, pp. 2663–2671 (2012) 11. Berk, R.A.: An introduction to sample selection bias in sociological data. American Sociological Review pp. 386–398 (1983) 12. Berry, M.J., Linoff, G.: Data mining techniques: For marketing, sales, and customer support. John Wiley & Sons, Inc. (1997) 13. Biernacki, P., Waldorf, D.: Snowball sampling: Problems and techniques of chain referral sampling. Sociological methods & research 10(2), 141–163 (1981) 14. Bonchi, F., Castillo, C., Gionis, A., Jaimes, A.: Social network analysis and mining for business applications. ACM Transactions on Intelligent Systems and Technology (TIST) 2(3), 22 (2011) 15. Broekel, T., Hartog, M.: Explaining the structure of inter-organizational networks using exponential random graph models. Industry and Innovation 20(3), 277–295 (2013) 16. Bromberg, F., Margaritis, D., Honavar, V., et al.: Efficient markov network structure discovery using independence tests. Journal of Artificial Intelligence Research 35(2), 449 (2009) 17. Van den Bulte, C., Lilien, G.L.: Medical innovation revisited: Social contagion versus marketing effort1. American Journal of Sociology 106(5), 1409–1435 (2001) 18. Callaway, D.S., Newman, M.E.J., Strogatz, S.H., Watts, D.J.: Network robustness and fragility: Percolation on random graphs. Physical Review Letters 85, 5468 (2000) 19. Canali, C., Colajanni, M., Lancellotti, R.: Data acquisition in social netowrks: Issues and proposals. In: Proc. of the International Workshop on Services and Open Sources (SOS11) (2011) 20. Cha, M., Mislove, A., Adams, B., Gummadi, K.P.: Characterizing social cascades in flickr. In: Proceedings of the 1st Workshop on Online Social Networks (WOSN’08),. Seattle, WA (2008) 21. Cha, M., Mislove, A., Gummadi, K.P.: A measurement-driven analysis of information propagation in the flickr social network. In: Proceedings of the 18th Annual World Wide Web Conference (WWW’09). Madrid, Spain (2009) 22. Chapelle, O., Zhang, Y.: A dynamic bayesian network click model for web search ranking. In: Proceedings of the 18th international conference on World wide web, pp. 1–10. ACM (2009) 23. Chen, H., Ku, W.S., Wang, H., Tang, L., Sun, M.T.: Linkprobe: Probabilistic inference on large-scale social networks. In: Data Engineering (ICDE), 2013 IEEE 29th International Conference on, pp. 290–301. IEEE (2013) 24. Chen, X., Michael, K.: Privacy issues and solutions in social network sites. Technology and Society Magazine, IEEE 31(4), 43–53 (2012) 22 Alireza Farasat et al. 25. Cheong, M., Lee, V.: Integrating web-based intelligence retrieval and decision-making from the twitter trends knowledge base. In: Proceedings of the 2nd ACM workshop on Social web search and mining, pp. 1–8. ACM (2009) 26. Chib, S., Greenberg, E.: Understanding the metropolis-hastings algorithm. The American Statistician 49(4), 327–335 (1995) 27. Chickering, D., Heckerman, D., Meek, C.: Large-sample learning of Bayesian networks in NP-hard. Computing Science and Statistics 33 (2001) 28. Cochran, W.G.: Sampling Techniques, 3rd edn. Wiley (1977) 29. Coleman, J.S., Katz, E., Menzel, H., et al.: Medical innovation: A diffusion study. Bobbs-Merrill Company Indianapolis (1966) 30. Cowan, R., Jonard, N.: Network structure and the diffusion of knowledge. Journal of economic dynamics and control 28(8), 1557–1575 (2004) 31. Crane, R., McDowell, L.K.: Evaluating markov logic networks for collective classification. In: Proceedings of the 9th MLG Workshop at the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2011) 32. Cranmer, S.J., Desmarais, B.A.: Inferential network analysis with exponential random graph models. Political Analysis 19(1), 66–86 (2011) 33. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) pp. 1–38 (1977) 34. Dhurandhar, A., Dobra, A.: Collective vs independent classification in statistical relational learning. Submitted for publication (2010) 35. Dielmann, A., Renals, S.: Dynamic bayesian networks for meeting structuring. In: Acoustics, Speech, and Signal Processing, 2004. Proceedings.(ICASSP’04). IEEE International Conference on, vol. 5, pp. V–629. IEEE (2004) 36. Dierkes, T., Bichler, M., Krishnan, R.: Estimating the effect of word of mouth on churn and cross-buying in the mobile phone market with markov logic networks. Decision Support Systems 51(3), 361–371 (2011) 37. Ding, S.: Learning undirected graphical models with structure penalty. J. CoRR (2011) 38. Division, N.S.R.: Rand database of worldwide terrorism incidents (1948). URL http://www.rand.org/nsrd/projects/terrorism-incidents.html 39. Domingos, P., Kok, S., Lowd, D., Poon, H., Richardson, M., Singla, P.: Markov logic. In: Probabilistic inductive logic programming, pp. 92–117. Springer (2008) 40. Domingos, P., Lowd, D., Kok, S., Nath, A., Poon, H., Richardson, M., Singla, P.: Markov logic: A language and algorithms for link mining. In: Link Mining: Models, Algorithms, and Applications, pp. 135–161. Springer (2010) 41. Du, N., Wu, B., Pei, X., Wang, B., Xu, L.: Community detection in large-scale social networks. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp. 16–25. ACM (2007) 42. Efron, B.: Size, power and false discovery rates. The Annals of Statistics 35(4), 1351– 1377 (2007) 43. Fang, L., LeFevre, K.: Privacy wizards for social networking sites. In: Proceedings of the 19th international conference on World wide web, pp. 351–360. ACM (2010) 44. Faugier, J., Sargeant, M.: Sampling hard to reach populations. Journal of advanced nursing 26(4), 790–797 (1997) 45. Fellows, I., Handcock, M.S.: Exponential-family random network models. arXiv preprint arXiv:1208.0121 (2012) 46. Fienberg, S.E.: A brief history of statistical models for network analysis and open challenges. Journal of Computational and Graphical Statistics 21(4), 825–839 (2012) 47. Frank, O., Strauss, D.: Markov graphs. Journal of the american Statistical association 81(395), 832–842 (1986) 48. Freeman, L.: The development of social network analysis. Empirical Press (2004) 49. Friedman, N., Murphy, K., Russell, S.: Learning the structure of dynamic probabilistic networks. In: Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pp. 139–147. Morgan Kaufmann Publishers Inc. (1998) 50. Galam, S.: Rational group decision making: A random field ising model at¡ i¿ t¡/i¿= 0. Physica A: Statistical Mechanics and its Applications 238(1), 66–80 (1997) 51. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Computing Surveys (CSUR) 46(4), 44 (2014) Probabilistic Graphical Models in Modern Social Network Analysis 23 52. Gelman, A., Shirley, K.: Inference from simulations and monitoring convergence. In: S. Brooks, A. Gelman, G.I. Jones, X.L. Meng (eds.) Handbook of Markov Chain Monte Carlo, pp. 163–174. Chapman & Hall: CRC Handbooks of Modern Statistical Methods (2011) 53. Getoor, L.: Social network datasets (2012). URL http://www.cs.umd.edu/ sen/lbcproj/LBC.html 54. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: A case study of unbiased sampling of osns. In: INFOCOM, 2010 Proceedings IEEE, pp. 1 –9 (2010) 55. Goldenberg, A., Moore, A.: Tractable learning of large bayes net structures from sparse data. In: Proceedings of the twenty-first international conference on Machine learning, p. 44. ACM (2004) 56. Goldenberg, A., Zheng, A.X., Fienberg, S.E., Airoldi, E.M.: A survey of statistical R in Machine Learning 2(2), 129–233 (2010) network models. Foundations and Trends 57. Goodman, L.A.: Snowball sampling. Annals of Mathematical Statistics 32(1), 148–170 (1961) 58. Goodreau, S.M.: Advances in exponential random graph (¡ i¿ p¡/i¿¡ sup¿*¡/sup¿) models applied to a large social network. Social Networks 29(2), 231–248 (2007) 59. Goodreau, S.M., Kitts, J.A., Morris, M.: Birds of a feather, or friend of a friend? using exponential random graph models to investigate adolescent social networks*. Demography 46(1), 103–125 (2009) 60. Grabowski, A., Kosiński, R.: Ising-based model of opinion formation in a complex network of interpersonal interactions. Physica A: Statistical Mechanics and its Applications 361(2), 651–664 (2006) 61. Hammersley, J.M., Clifford, P.: Markov fields on finite graphs and lattices (1971) 62. Handcock, M., Hunter, D., Butts, C., Goodreau, S., Morris, M.: Statnet: An r package for the statistical analysis and simulation of social networks. manual. university of washington (2006) 63. Handcock, M.S., Gile, K.J., et al.: Modeling social networks from sampled data. The Annals of Applied Statistics 4(1), 5–25 (2010) 64. Handcock, M.S., Robins, G., Snijders, T.A., Moody, J., Besag, J.: Assessing degeneracy in statistical models of social networks. Tech. rep., Working paper (2003) 65. He, J., Chu, W.W., Liu, Z.V.: Inferring privacy information from social networks. In: Intelligence and Security Informatics, pp. 154–165. Springer (2006) 66. Heckathorn, D.D.: Respondent-driven sampling: a new approach to the study of hidden populations. Social problems pp. 174–199 (1997) 67. Heckerman, D.: Bayesian networks for data mining. Data Mining and Knowledge Discovery 1, 79–119 (1997) 68. Heckerman, D.: A tutorial on learning with Bayesian networks. Springer (2008) 69. Hoff, P.D., Raftery, A.E., Handcock, M.S.: Latent space approaches to social network analysis. Journal of the american Statistical association 97(460), 1090–1098 (2002) 70. Humphreys, L.: Mobile social networks and social practice: A case study of dodgeball. Journal of Computer-Mediated Communication 13(1), 341–360 (2007) 71. Hunter, D.R., Goodreau, S.M., Handcock, M.S.: Goodness of fit of social network models. Journal of the American Statistical Association 103(481) (2008) 72. Jabeur, L.B., Tamine, L., Boughanem, M.: Featured tweet search: Modeling time and social influence for microblog retrieval. In: Web Intelligence and Intelligent Agent Technology (WI-IAT), 2012 IEEE/WIC/ACM International Conferences on, vol. 1, pp. 166–173. IEEE (2012) 73. Jabeur, L.B., Tamine, L., Boughanem, M.: Uprising microblogs: A bayesian network retrieval model for tweet search. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp. 943–948. ACM (2012) 74. Jansen, B.J., Zhang, M., Sobel, K., Chowdury, A.: Twitter power: Tweets as electronic word of mouth. Journal of the American society for information science and technology 60(11), 2169–2188 (2009) 75. Java, A., Song, X., Finin, T., Tseng, B.: Why we twitter: understanding microblogging usage and communities. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp. 56–65. ACM (2007) 24 Alireza Farasat et al. 76. Karger, D., Srebro, N.: Learning Markov networks: maximum bounded tree-width graphs. In: Proc. SIAM-ACM Symposium on Discrete Algorithms, pp. pp. 392–401 (2001) 77. Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 137–146. ACM (2003) 78. Kjaerulff, U.: A computational scheme for reasoning in dynamic probabilistic networks. In: Proceedings of the Eighth international conference on Uncertainty in artificial intelligence, pp. 121–129. Morgan Kaufmann Publishers Inc. (1992) 79. Kleinberg, J.M.: Challenges in mining social network data: processes, privacy, and paradoxes. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 4–5. ACM (2007) 80. Koelle, D., Pfautz, J., Farry, M., Cox, Z., Catto, G., Campolongo, J.: Applications of Bayesian belief networks in social network analysis. In: Proc. of the 4th Bayesian Modeling Applications Workshop, UAI Conference, p. pp. (2006) 81. Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques. Massachusetts Institute of Technology (2009) 82. Koller, D., Friedman, N.: Probabilistic graphical models: principles and techniques. MIT press (2009) 83. Koren, Y.: Collaborative filtering with temporal dynamics. Communications of the ACM 53(4), 447–455 (2009) 84. Krause, S.M., Böttcher, P., Bornholdt, S.: Mean-field-like behavior of the generalized voter-model-class kinetic ising model. Physical Review E 85(3), 031,126 (2012) 85. Krebs, V.E.: Mapping networks of terrorist cells. Connections 24(3), 43–52 (2002) 86. Krempl, G., Žliobaite, I., Brzeziński, D., Hüllermeier, E., Last, M., Lemaire, V., Noack, T., Shaker, A., Sievi, S., Spiliopoulou, M., et al.: Open challenges for data stream mining research. ACM SIGKDD Explorations Newsletter 16(1), 1–10 (2014) 87. Kurant, M., Markopoulou, A., Thiran, P.: On the bias of bfs (breadth first search) pp. 1 –8 (2010) 88. Kuter, U., Golbeck, J.: Sunny: A new algorithm for trust inference in social networks using probabilistic confidence models. In: AAAI, vol. 7, pp. 1377–1382 (2007) 89. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web, pp. 591–600. ACM (2010) 90. Laurila, J.K., Gatica-Perez, D., Aad, I., Blom, J., Bornet, O., Do, T.M.T., Dousse, O., Eberle, J., Miettinen, M.: The mobile data challenge: Big data for mobile computing research. In: Proceedings of the Workshop on the Nokia Mobile Data Challenge, in Conjunction with the 10th International Conference on Pervasive Computing, pp. 1–8 (2012) 91. Lauritzen, S.L.: Graphical models. Oxford University Press (1996) 92. Lee, S.H., Kim, P.J., Hawoong, J.: Statistical properties of sampled networks. Phys. Rev. E 73, 016,102 (2006). DOI 10.1103/PhysRevE.73.016102 93. Lee, S.I., Ganapathi, V., Koller, D.: Efficient structure learning of markov networks using l 1-regularization. In: Advances in neural Information processing systems, pp. 817–824 (2006) 94. Leenders, R.: Longitudinal behavior of network structure and actor attributes: modeling interdependence of contagion and selection. Evolution of social networks 1 (1997) 95. Lin, Y.R., Chi, Y., Zhu, S., Sundaram, H., Tseng, B.L.: Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th international conference on World Wide Web, WWW ’08, pp. 685–694. ACM (2008) 96. Lipford, H.R., Besmer, A., Watson, J.: Understanding privacy settings in facebook with an audience view. UPSEC 8, 1–8 (2008) 97. Luo, Z., Tang, J., Wang, T.: Propagated opinion retrieval in twitter. In: Web Information Systems Engineering–WISE 2013, pp. 16–28. Springer (2013) 98. Lusher, D., Koskinen, J., Robins, G.: Exponential Random Graph Models for Social Networks: Theory, Methods, and Applications. Cambridge University Press (2012) Probabilistic Graphical Models in Modern Social Network Analysis 25 99. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: The next frontier for innovation, competition, and productivity (2011) 100. McCreadie, R., Soboroff, I., Lin, J., Macdonald, C., Ounis, I., McCullough, D.: On building a reusable twitter corpus. In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 1113– 1114. ACM (2012) 101. Mislove, A., Marcon, M., Gummadi, K.P., Druschel, P., Bhattacharjee, B.: Measurement and analysis of online social networks. In: In Proceedings of the 5th ACM/USENIX Internet Measurement Conference (IMC’07). San Diego, CA (2007) 102. Morris, M., Handcock, M.S., Hunter, D.R.: Specification of exponential-family random graph models: terms and computational aspects. Journal of statistical software 24(4), 1548 (2008) 103. Murphy, K.P.: Dynamic bayesian networks: representation, inference and learning. Ph.D. thesis, University of California (2002) 104. Murphy, K.P.: Machine learning: a probabilistic perspective. The MIT Press (2012) 105. Nagmoti, R., Teredesai, A., De Cock, M.: Ranking approaches for microblog search. In: Web Intelligence and Intelligent Agent Technology (WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1, pp. 153–157. IEEE (2010) 106. Netrapalli, P., Banerjee, S., Sanghavi, S., Shakkottai, S.: Greedy learning of markov network structure. In: Proc. of Allerton Conf. on Communication, Control and Computing, Monticello, USA (2010) 107. Neville, J., Jensen, D.: Relational dependency networks. The Journal of Machine Learning Research 8, 653–692 (2007) 108. Newman, M.E.: Modularity and community structure in networks. Proceedings of the National Academy of Sciences 103(23), 8577–8582 (2006) 109. Newman, M.E., Watts, D.J., Strogatz, S.H.: Random graph models of social networks. Proceedings of the National Academy of Sciences 99(suppl 1), 2566–2572 (2002) 110. O’Connor, B., Krieger, M., Ahn, D.: Tweetmotif: Exploratory search and topic summarization for twitter. In: ICWSM (2010) 111. Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the trec-2011 microblog track. In: Proceeddings of the 20th Text REtrieval Conference (TREC 2011) (2011) 112. Park, J., Newman, M.E.: Statistical mechanics of networks. Physical Review E 70(6), 066,117 (2004) 113. Pattison, P., Wasserman, S.: Logit models and logistic regressions for social networks: Ii. multivariate relations. British Journal of Mathematical and Statistical Psychology 52(2), 169–193 (1999) 114. Pochampally, R., Varma, V.: User context as a source of topic retrieval in twitter. In: Workshop on Enriching Information Retrieval (with ACM SIGIR) (2011) 115. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989) 116. Ravikumar, P., Wainwright, M.J., Lafferty, J.D., et al.: High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics 38(3), 1287– 1319 (2010) 117. Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1-2), 107– 136 (2006) 118. Rinaldo, A., Fienberg, S.E., Zhou, Y., et al.: On the geometry of discrete exponential families with application to exponential random graph models. Electronic Journal of Statistics 3, 446–484 (2009) 119. Robins, G., Pattison, P., Elliott, P.: Network models for social influence processes. Psychometrika 66(2), 161–189 (2001) 120. Robins, G., Pattison, P., Kalish, Y., Lusher, D.: An introduction to exponential random graph (¡ i¿ p¡/i¿¡ sup¿*¡/sup¿) models for social networks. Social networks 29(2), 173– 191 (2007) 121. Robins, G., Pattison, P., Wasserman, S.: Logit models and logistic regressions for social networks: Iii. valued relations. Psychometrika 64(3), 371–394 (1999) 122. Robins, G., Snijders, T., Wang, P., Handcock, M., Pattison, P.: Recent developments in exponential random graph (¡ i¿ p¡/i¿*) models for social networks. Social networks 29(2), 192–215 (2007) 26 Alireza Farasat et al. 123. Roy, S., Lane, T., Werner-Washburne, M.: Learning structurally consistent undirected probabilistic graphical models. In: Proc ICML (2009) 124. Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th international conference on World wide web, pp. 851–860. ACM (2010) 125. Salganik, M.J., Heckathorn, D.D.: Sampling and estimation in hidden populations using respondent-driven sampling. Sociological Methodology 34(1), 193–240 (2004) 126. Salter-Townshend, M., Murphy, T.B.: Role analysis in networks using mixtures of exponential random graph models. Journal of Computational and Graphical Statistics (just-accepted), 00–00 (2014) 127. Santos, F.C., Pacheco, J.M., Lenaerts, T.: Evolutionary dynamics of social dilemmas in structured heterogeneous populations. Proceedings of the National Academy of Sciences of the United States of America 103(9), 3490–3494 (2006) 128. Schaefer, D.R., Simpkins, S.D.: Using social network analysis to clarify the role of obesity in selection of adolescent friends. American journal of public health (0), e1–e7 (2014) 129. Schmidt, M., Murphy, K., Fung, G., Rosales, R.: Structure learning in random fields for heart motion abnormality detection. In: CVPR (2010) 130. Scott, J., Carrington, P.J.: The SAGE handbook of social network analysis. SAGE publications (2011) 131. Singla, P., Domingos, P.: Lifted first-order belief propagation. In: AAAI, vol. 8, pp. 1094–1099 (2008) 132. Smyth, P., Heckerman, D., Jordan, M.I.: Probabilistic independence networks for hidden markov probability models. Neural computation 9(2), 227–269 (1997) 133. Snijders, T.A.: Estimation on the basis of snowball samples: how to weight? Bulletin de méthodologie sociologique 36(1), 59–70 (1992) 134. Snijders, T.A., Pattison, P.E., Robins, G.L., Handcock, M.S.: New specifications for exponential random graph models. Sociological methodology 36(1), 99–153 (2006) 135. Song, X., Jiang, S., Yan, X., Chen, H.: Collaborative friendship networks in online healthcare communities: An exponential random graph model analysis. In: Smart Health, pp. 75–87. Springer (2014) 136. Sparrow, M.K.: The application of network analysis to criminal intelligence: An assessment of the prospects. Social networks 13(3), 251–274 (1991) 137. Spiliopoulou, M.: Evolution in social networks: A survey. In: Social network data analytics, pp. 149–175. Springer (2011) 138. Stanford: Stanford network analysis package (snap) (2011). URL http://snap. stanford. edu 139. Stutzbach, D., Rejaie, R., Duffield, N., Sen, S., Willinger, W.: On unbiased sampling for unstructured peer-to-peer networks. Networking, IEEE/ACM Transactions on 17(2), 377 –390 (2009) 140. Taskar, B., Abbeel, P., Koller, D.: Discriminative probabilistic models for relational data. In: Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pp. 485–492. Morgan Kaufmann Publishers Inc. (2002) 141. for the Study of Terrorism, N.C., to Terrorism (START), R.: International center for political violence and terrorism research (icpvtr) (2012). URL http://www.pvtr.org/ICPVTR/ 142. for the Study of Terrorism, N.C., to Terrorism (START), R.: Gtd global terrorism database (2013). URL http://www.start.umd.edu/gtd/ 143. Thi, D.B., Hoang, T.A.N.: Features extraction for link prediction in social networks. In: Computational Science and Its Applications (ICCSA), 2013 13th International Conference on, pp. 192–195. IEEE (2013) 144. Thiemichen, S., Friel, N., Caimo, A., Kauermann, G.: Bayesian exponential random graph models with nodal random effects. arXiv preprint arXiv:1407.6895 (2014) 145. Tresp, V., Nickel, M.: Relational models. Encyclopedia of Social Network Analysis and Mining. Ed. by J. Rokne and R. Alhajj. Heidelberg: Springer (2013) 146. Uddin, S., Hamra, J., Hossain, L.: Exploring communication networks to understand organizational crisis using exponential random graph models. Computational and Mathematical Organization Theory 19(1), 25–41 (2013) Probabilistic Graphical Models in Modern Social Network Analysis 27 147. Uddin, S., Hossain, L., Hamra, J., Alam, A.: A study of physician collaborations through social network and exponential random graph. BMC health services research 13(1), 234 (2013) 148. Vega-Redondo, F.: Complex social networks, vol. 44. Cambridge University Press (2007) 149. Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P.: On the evolution of user interaction in facebook. In: Proceedings of the 2nd ACM SIGCOMM Workshop on Social Networks (WOSN’09). Barcelona, Spain (2009) 150. Wan, H., Lin, Y., Wu, Z., Huang, H.: A community-based pseudolikelihood approach for relationship labeling in social networks. In: Machine Learning and Knowledge Discovery in Databases, pp. 491–505. Springer (2011) 151. Wan, H.Y., Lin, Y.F., Wu, Z.H., Huang, H.K.: Discovering typed communities in mobile social networks. Journal of Computer Science and Technology 27(3), 480–491 (2012) 152. Wang, D., Li, Z., Xie, G.: Towards unbiased sampling of online social networks. In: Communications (ICC), 2011 IEEE International Conference on, pp. 1 –5 (2011) 153. Wang, Y., Vassileva, J.: Bayesian network-based trust model. In: Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on, pp. 372–378. IEEE (2003) 154. Wasserman, S., Pattison, P.: Logit models and logistic regressions for social networks: I. an introduction to markov graphs andp. Psychometrika 61(3), 401–425 (1996) 155. Wortman, J.: Viral marketing and the diffusion of trends on social networks (2008) 156. Xiang, R., Neville, J.: Collective inference for network data with copula latent markov networks. In: Proceedings of the sixth ACM international conference on Web search and data mining, pp. 647–656. ACM (2013) 157. Yang, T., Chi, Y., Zhu, S., Gong, Y., Jin, R.: Detecting communities and their evolutions in dynamic social networksa bayesian approach. Machine Learning 82, 57189 (2011) 158. Yang, X., Guo, Y., Liu, Y.: Bayesian-inference based recommendation in online social networks. In: INFOCOM, 2011 Proceedings IEEE, pp. 551–555. IEEE (2011) 159. Yang, X., Guo, Y., Liu, Y.: Bayesian-inference-based recommendation in online social networks. Parallel and Distributed Systems, IEEE Transactions on 24(4), 642–651 (2013) 160. Ye, S., Lang, J., Wu, S.F.: Crawling online social graphs. In: Proceedings of the 12th International Asia-Pacific Web Conference, pp. 236–242 (20010) 161. Zhu, J., Lao, N., Xing, E.P.: Grafting-light: fast, incremental feature selection and structure learning of markov random fields. In: Proc.16th ACM SIGKDD (2010) 162. Zweig, G., Russell, S.: Speech recognition with dynamic bayesian networks (1998) 7 Appendix Similarity between MNs and ERGMs. While MNs and ERGMs have been developed in different scientific domains, they both specify exponential family distributions. MN models treat social network nodes as random variables, and hence, their utility is most obvious in modeling processes on networks; ERGMs, on the other hand, have been conceptualized to model network formation, where it is the edge presence indicators that are treated as random variables (these random variables are dependent if their corresponding edges share a node). But in fact, this application-related difference in what to treat as random is not fundamental. This Appendix works to more rigorously disclose the similarity between MNs and ERGMs by re-defining an ERGM as a PGM. We begin, however, by reviewing the branch of literature devoted exclusively to ERGMs. 28 Alireza Farasat et al. Similar to MNs, a well-discussed problem of ERGMs for analyzing social networks is related to the challenge of parameters estimation [122] due to the lack of enough observed data. Robins et al. outline this and some other problems associated with ERGMs, e.g., degeneracy in model selection and bimodal distribution shapes [122] (see also [62, 64, 118, 134]). The roots of ERGMs in the Principle of Maximum Entropy [112] and the Hammersley-Clifford theorem have been previously pointed out [56,119]. Here, we illustrate how MNs and ERGMs are similar in terms of the form and structure using most popular significant statistics in ERGMs; under the assumption of Markov dependence, for a given social network, one can build a corresponding Markov network via the following conversion: 1) each node in the Markov network will correspond to an edge in the social network (Fienberg called this construct a “usual graphical model” for ERGMs [46]), 2) when two edges share a node in the social network, a link will be built between two corresponding nodes in the Markov network. Corresponding to each possible edge in a social network, a node in an MN network is introduced; note the difference between the original social network and the MN network - they are not the same! Consider an ERGM with the significant statistics including the number of edges, f1 (y), the number of k-stars, fi (y) ,i = 2, . . . , N − 1 and the number of triangles, fN (y). In an MN, a maximum Entropy (maxent) model proposes the following form P for the internal energy of the system, Ec (x) = − i αci gci . Define, gci as ith feature of clique PN c ∈ Ω and αci is its corresponding weight in G. Thus, ψc (x) = exp{βc i=1 αci gci }. Since there are too many parameters in the MN, they can be deducted by imposing homogeneity constraints similar to that of ERGMs [120]. Before imposing such constraints, these following facts are required. It is straightforward to demonstrate that G encompasses cliques of size {3, . . . , N − 1}. In addition, all substructure in Gs can be redefined by features in G. Considering these points, we can rewrite the joint probability of all variables represented by the MN, P (X), as follows: ! ! C N C N X X X 1 1 Y exp βc αci gci = exp βc αci gci . (4) P (X) = Z(α) c=1 Z(α) c=1 i=1 i=1 In (4), Z(α) is the partition function which is a function of parameters. The homogeneity assumption, here, means αci = θi0 ∀ c = 1, . . . , C; then P (X) is: ! N C X X 1 0 P (X) = exp θi βc gci . (5) Z(θ0 ) c=1 i=1 PC In (5), let’s Z 0 = Z(θ0 ). In addition, we assume that c=1 βc gci represented by fi0 , means that substructures i in all cliques c are added up by weight βc . Probabilistic Graphical Models in Modern Social Network Analysis 29 y15 y45 Observed 1 Not observed y14 y15 5 y25 y14 y45 2 y13 y35 y24 y24 y35 y23 Observed Not observed y34 4 y12 y12 y25 y23 3 y13 y34 Fig. 5 A social network with five actors(left) and its corresponding Markov network (right). Finally, if we replace fi0 in (5): N X 1 P (X) = 0 exp θi0 fi0 Z i=1 ! . (6) Comparing P (Y = y) and (4) confirms that ERGMs and MNs are similar and under the following conditions they are identical: 1) θi = θi0 , PC 2) fi = fi0 = c=1 βc gci . The following Numerical Example depicts similarities between ERGMs and MNs. A social network with five actors, N = 5, is assumed (Figure 5 (left)). Considering Markov dependency assumption, there exists an unique corresponding Markov network shown in Figure 5 (right) with 10 nodes. There are 15 cliques (so-called factors) of size three or four, Φ = {φ1 (y12 , y13 , y14 , y15 ), . . . , φ15 (y24 , y45 , y25 )}. As already mentioned, the joint probability function of all variables in each clique is proportional to the internal energy. For instance: φ1 (x) = 1 exp{−β1 Ec (y12 , y13 , y14 , y15 )}, λ P where E1 (x) = − i αci gci and λ is the distribution parameter. This simple example shows that how ERGMs and MNs are the same in terms of the underlying concept and the expressed probability distribution. Social Networks 40 (2015) 154–162 Contents lists available at ScienceDirect Social Networks journal homepage: www.elsevier.com/locate/socnet On efficient use of entropy centrality for social network analysis and community detection Alexander G. Nikolaev ∗ , Raihan Razib, Ashwin Kucheriya Department of Industrial and Systems Engineering, 438 Bell Hall, State University of New York at Buffalo, Buffalo, NY 14260, United States a r t i c l e i n f o Keywords: Social network modeling Centrality Entropy Community detection Clustering a b s t r a c t This paper motivates and interprets entropy centrality, the measure understood as the entropy of flow destination in a network. The paper defines a variation of this measure based on a discrete, random Markovian transfer process and showcases its increased utility over the originally introduced path-based network entropy centrality. The re-defined entropy centrality allows for varying locality in centrality analyses, thereby distinguishing locally central and globally central network nodes. It also leads to a flexible and efficient iterative community detection method. Computational experiments for clustering problems with known ground truth showcase the effectiveness of the presented approach. © 2014 Elsevier B.V. All rights reserved. 1. Introduction Despite the abundance of existing methods for measuring centrality in social networks, new research challenges and opportunities continue to emerge. In application to large network datasets, computational efficiency of evaluation becomes a major indicator of utility of centrality measures. Even more importantly, the typically reliable path-based measures lose sensitivity when the number of paths contributing to their formulae grows too large, making the evaluation of node centrality with respect to nearby neighbors (as opposed to the whole network) particularly difficult. In searching for answers to new challenges, it is desirable to design centrality measures with solid grounding in theory, while not compromising interpretability sought by social science practitioners. This paper develops a centrality measure whose computation for a given node does not require dyad-based path enumeration. Instead, the presented measure relies on an absorbing Markovian process evolving over finite time – this allows for matrix multiplication-based computation of centrality. Depending on the absorption rate and evolution time, the presented measure enables centrality analysis at varying localities around a node of interest, thereby distinguishing locally central and globally central network nodes. The measure offers an information theory-based approach to measuring centrality, and takes a particular, previously unoccupied spot in the typology of flow-based centrality metrics. ∗ Correspondence author. Tel.: +1 716 645 4710. E-mail address: [email protected] (A.G. Nikolaev). http://dx.doi.org/10.1016/j.socnet.2014.10.002 0378-8733/© 2014 Elsevier B.V. All rights reserved. Different measures of centrality capture different aspects of what it means for a node to be “central” to the network. In his seminal paper, Freeman (1979) argued that node degree centrality, the number of direct links incident to a node, indexes the node’s activity; node betweenness centrality, based on the position of a node with respect to the all-pair shortest paths in a network, exhibits the node’s potential for network control; and closeness centrality, the sum of geodesic distances from a node to all the other nodes, reflects its communication independence or efficiency. Borgatti (2005) conceptualized a typology of centrality measures based on the ways that traffic flows through the network. Two characteristics – the route the traffic follows (geodesics, paths, trails, or walks) and the method of propagation (parallel duplication, serial duplication, or transfer) – define the two-dimensional typology. Each measure of centrality makes assumptions about the importance of the various types of traffic flow, and hence, each measure of centrality can be assessed by where it falls in the typology. For example, betweenness centrality is perfect for networks featuring flows along geodesics. A node with high betweenness centrality is essentially a traffic checkpoint that can shut down the flow. At the same time, betweenness is an inappropriate measure in networks where flow is not constrained to follow geodesics. Non-geodesic paths avoid the checkpoints altogether, making an alternative measure essential. Over the years, researchers have proposed a number of different centrality measures, including eigenvector centrality (Bonacich, 1972), information centrality (Stephenson and Zelen, 1989), subgraph centrality (Estrada and Rodriguez-Velazquez, 2005), alpha centrality (Bonacich and Lloyd, 2001), etc. However, their meaning A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 with respect to Borgatti’s typology have not always been clearly defined or analyzed. Tutzauer (2007) began to address this issue and proposed a centrality measure for networks characterized by path-based transfer flows. The path-based transfer model assumes that an object travels from a particular node (the one whose centrality is being evaluated) to a destination (the node itself or one of its neighbors) along a random path. More specifically, a path is sequentially built: if the flow originating node is randomly selected to be the next in the sequence, then the flow is over before it begins; otherwise, the object is randomly passed to one of the original node’s immediate neighbors. Given that the object has arrived to the new node, the next transfer step destination is then randomly chosen from among its neighbors (including the current node, but not including any of the previously visited nodes), and again the flow either stops (if the current node in the sequence is selected) or continues on in the same fashion (if a different node is selected). For the described transfer model, the centrality of a given node can be defined as the entropy of the transfer’s final destination. In other words, it can be expressed via the probabilities of transfer paths from the node to each of the other nodes. Despite the fact that the motivation for this entropy-based measure is intuitively and technically clear, the research community has been slow to adopt it for application purposes, largely due to the need for exhaustive path enumeration in evaluating the defined centrality. This paper develops the idea of Tutzauer (2007), and presents a new, high-utility entropy centrality measure based on a discrete Markovian transfer process. In the presented model, a transferred object randomly walks through a network; then, the resulting measure – the walk destination entropy – can be efficiently computed, which opens new ways for insightful, computationally efficient analyses of networks. The structure of the paper is as follows. Section 2 introduces essential notation and the fundamentals of path-transfer flow process, builds a Markov model for the study of this process, presents an expression for the entropy centrality measure, and offers an illustrative computational example. Section 3 uses entropy centrality to design an algorithm for community detection in networks, and reports computational results with the algorithm applied to clustering problems with known ground truth. Section 4 offers discussion and concluding remarks. 2. Model description 2.1. Mathematical preliminaries The mathematical representation of a network is a directed or undirected graph G = (V, E), where V = {1, 2, . . ., N} is a finite, nonempty set of nodes (vertices), and E is a relation (a tie configuration) on V. The elements of E are called edges. The edge (i, j) ∈ E is incident with the vertices i and j, and i and j are incident with the edge (i, j) ∈ E. Moreover, (i, j) ∈ E is a link if i = / j and a loop if i = j. The incidence matrix of G has elements (bij ), i = 1, 2, . . ., N, j = 1, 2, . . ., N such that bij = 1 if nodes i and j in the network are connected with an edge and 0 otherwise. 2.2. Centrality and entropy connection To motivate the connection between the centrality of a given node and the concept of entropy, consider a network of friends transferring an object among themselves. The more central the original node is, the more difficult it is to predict the object’s final destination. If the node is central, the object has a greater probability of traveling far in multiple potential directions. In contrast, a less central node has a more limited choice of immediate transfer options and the process is more likely to stop (be absorbed) 155 before the number of transfer options increases, which makes its destination more predictable. This idea can be more easily understood if one considers an extreme example of a network of one extrovert person and many introverts. An introvert is a node in the network with no or very few incident links, while an extrovert is a node adjacent to many nodes in the network. Assume that, according to a random rule, an object transfer process can terminate after the object is passed from one node to another, i.e., the object will eventually be absorbed by some node, termed destination node. In the case of high absorption probabilities, if the object transfer process originates from the extrovert (following the transfer process described above), the probability that it ends up at any given node is close to 1/N. In contrast, if the transfer process originates from the introvert, then the flow first needs to reach the hub to go beyond it, limiting the likelihood that “far-away” nodes are reached at all. The level of uncertainty of object destination, as a function of its origin, can be captured as destination entropy. The concept of entropy was first introduced in physics, and later, developed in information and communication sciences; entropy enjoys distinct and intuitive interpretations in multiple applied domains. In adopting it for the use in social network analysis, one avoids having to assess a node’s position with respect to paths connecting all node pairs, and instead, focuses on the node’s potential to diversify flow propagation. 2.3. Path transfer and random walk flows as foundations for entropy centrality computation In assessing the value of node position using network flow, researchers have historically focused on paths as channels that flow may follow. Entropy centrality does not explicitly measure the ability of a node to interfere with path-based exchanges between other nodes; instead, it views a node of interest as flow originator. The treatment of paths and flow types, relevant to the concept of entropy centrality, deserves a more in-depth discussion. This paper’s contribution to centrality theory is akin to that of Newman (2005), who first proposed to use walks, instead of only shortest paths, for betweenness measurement. In entropy centrality calculation, the idea of analyzing random walks is further developed, by allowing walks to be randomly interrupted; the longer a given planned object route, i.e., the more exchanges (transfers) it requires, the less likely it is to be completed. To further illustrate this point, a review and discussion of path-transfer flows is in order. Examples of path-transfer flows are aplenty among trading and smuggling networks (Tutzauer, 2007), especially when the traded or smuggled commodity is discrete such as the case of exotic animals, nuclear weapons material and parts, fossils, artworks and antiquities, and even trafficking humans. For a more peaceful example, consider a group of people linked by friendship ties, with one of them having a specific object. To model a path-transfer process, think of the object being passed from one person to another. The flow (i.e., object transfer) originates at a particular person in the group (i.e., a node in the graph). If that person does not pass the object to any one of their immediate friends, the flow is over before it begins; otherwise, the object flows (i.e., is transferred) to a randomly selected person. The next person then chooses whether to pass the object to their immediate friends, and again the flow either stops or continues. The object thus traverses a path in the network, traveling along the links, stopping when the process is absorbed at some node or if the object’s trajectory completes a loop. According to the original model formulation, each of the eligible neighbors is assumed to be selected with equal likelihood, although this assumption can be relaxed without loss of generality. The main restriction in the path-transfer process is that the object cannot be passed to the nodes it has already visited. 156 A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 This paper relaxes the restriction for flow to follow paths in the entropy centrality definition. Instead, it develops a model based on a special case of random walk, where each node has a positive probability of absorbing the flow for good (Newman, 2005; Noh and Rieger, 2004). The motivation for this alternative definition of the entropy centrality computation mechanism is two-fold. First, the path-based centrality is extremely difficult to use in practice. The necessity for complete path enumeration in its computation makes the original measure (Tutzauer, 2007) not suitable for the analysis of well-connected networks containing over ten nodes. On the contrary, the relevant transfer and absorption probabilities for random walks can be easily calculated using matrix-analytic methods. Second, note an important nuance in the entropy centrality concept that can be utilized with its definition based on random walks. Entropy centrality is calculated using a serial transfer model, however, because multiple transfer destination probabilities enter the entropy expression simultaneously, it may be more conducive to analyzing serial duplication processes. An iterative, step-by-step analysis of a random walk originating from a given node would inform one of the temporal (periodic) dynamics of the flow destination entropy, i.e., indicate how fast (in how many periods) the network can be informed/conquered if the spread of influence is initiated from the given node. Consider modeling a community becoming engaged into discussing a pertinent topic/issue picked up by one of its members from a news outlet. All the sequences in which people can converse come into play, and some conversations can occur simultaneously, as long as multiple community members are informed of the topic/issue by their neighbor(s). Thoughtful, as opposed to gossip-generating, conversations between people are rarely broadcast, they take place sequentially: the same news can be discussed by the same two individuals multiple times (think more about mulling over a political situation, rather than sharing a news of a rock-star making appearance at a night club). In summary, walk-based entropy centrality can be most useful for identifying influential community members with respect to serial duplication process. This observation defines the measure’s place in Borgatti’s typology, reaffirming the motivation for introducing it. 2.4. Markov model and entropy centrality Consider a connected network represented by graph G = (V, E), with V being a set of N nodes indexed 1 through N, and with E being a relation on V. Refer to Fig. 1 for an illustrative example of a small network with N = 6 nodes and |E| = 8 edges. In a random walk based flow process, the immediate destination of an object transferred from an object-holder depends only on the current object position, and not on the sequence of nodes that the object visited prior to the current state, therefore, its position over time (in time periods) can be modeled as a Markovian process, or a Markov chain. For example, an object being transferred over the network in Fig. 1 could move from node 1 to node 4, and then in the next period, back from node 4 to node 1. It is also assumed that each node has an option to hold the object to itself in any given period even though it is connected to other nodes, thus taking a pause in communication. Additionally, each node can stop the flow for good, with the probability of such an event referred to as absorption probability av , v ∈ V . Fig. 2 depicts the node absorption probabilities fixed at a = 0.2, which implies that in a single period node 1 can transfer the object to three nodes (i.e., self, node 2 or node 4) with the same probability of (1 − 0.2)/3 = 0.27. Fig. 2 also adds auxiliary nodes to the original network: labeled with apostrophes, these nodes represent absorbing states of the Markov chain. Note that in order to avoid cluttering in Fig. 2, the loop transitions are not depicted on it. Consider a stochastic process with the state diagram as given by Figs. 1 and 2 combined (including both the loops and absorbing Fig. 1. A schematic representation of an example network depicting the object transfer (transition) diagram. states). This process is a Markov chain with transition probability matrix denoted by P, with elements pij , i ∈ {1, 2, . . ., N, 1 , 2 , . . ., N }, j ∈ {1, 2, . . ., N, 1 , 2 , . . ., N }, as given in Table 1. The measure of centrality for node i = 1, 2, . . ., N, quantified by the entropy of the object destination, given that the transferred object originates from node i and experiences t transitions is defined as N Hit = − (t) (t) ij (t) (t) ij (pij + p ) log(pij + p ). (1) j=1 If the base of the logarithm in formula (1) is chosen to be 2, then the entropy centrality is measured in bits; meanwhile, the results in the subsequent sections of this paper are reported using the more conventional natural logarithm. The expression in (1) (t) (t) involves terms of the form (pij + p ) – one such term gives the ij probability that the object originates at node i, and as t time periods elapse, finds itself in possession of node j. The closer these probabilities are for nodes j ∈ {1, 2, . . ., N}, the more difficult it is to Fig. 2. An expanded state diagram for a Markovian transfer process, with auxiliary nodes for absorbing states. A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 157 Table 1 Transition probability matrix for the Markovian transfer process. Node 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 0.267 0.2 0 0.16 0 0 0 0 0 0 0 0 0.267 0.2 0.2 0.16 0 0 0 0 0 0 0 0 0 0.2 0.2 0.16 0 0.2 0 0 0 0 0 0 0.267 0.2 0.2 0.16 0 0.2 0 0 0 0 0 0 0 0 0 0 0.4 0.2 0 0 0 0 0 0 0 0 0.2 0.16 0.4 0.2 0 0 0 0 0 0 0.2 0 0 0 0 0 1 0 0 0 0 0 0 0.2 0 0 0 0 0 1 0 0 0 0 0 0 0.2 0 0 0 0 0 1 0 0 0 0 0 0 0.2 0 0 0 0 0 1 0 0 0 0 0 0 0.2 0 0 0 0 0 1 0 0 0 0 0 0 0.2 0 0 0 0 0 1 predict/guess the object’s position (at time t), and the larger is the entropy. Note that the number of transitions t, which can be fixed at any integer, defines the desired locality of centrality analysis: it is thus termed transfer locality. In particular, a node in a network may have a high relative centrality for small t but low relative centrality for large t. Also, as t increases, the centrality measure for any node approaches a constant, depending on how fast the process is expected to be absorbed. 2.5. The effect of transfer locality adjustment Given a locality value t, one evaluates a node’s centrality with respect to a part of the network that is likely to be reached from the node by a transfer process in a limited time, i.e., in t steps of a random walk. In other words, by adjusting transfer locality, one “magnifies” the local neighborhood surrounding the node, thus reducing the impact of “far away”, hard-to-reach nodes on the resulting entropy centrality value. When t is large, entropy centrality describes nodes’ network positions on a global (whole network) scale. In order to understand the implications of varying transfer locality in centrality analyses, consider a social network of Zackary’s karate club. In a classical study, 34 members of a karate club were observed over a 2-year period. A network of friendships between the club members was constructed using a variety of measures to estimate the strength of ties. An unweighted version of the club network is given in Fig. 3; the following analysis focuses on the six nodes labeled 1, 5, 12, 29, 33 and 34 – these appear in bold circles in the figure. Fig. 4 reports entropy centrality values for the six nodes, with varied levels of t. With the increasing transfer locality, each node’s centrality value monotonically increases, implying that, given more time, the node can reach more peers (remember, the node for which the centrality is computed is viewed as flow originator). Importantly, observe that the rates of entropy increase as a function of t are different for different nodes. As such, nodes 5 and 29 see their centrality values dramatically increase with the growing t, indicating that such nodes can become influential only if the transfer process they originate does not die early. Meanwhile, nodes 1, 33 and 34 located at “the heart” of well-connected clusters (small or large) do not see their centralities grow by much. Interestingly, node 29 has low centrality in its local neighborhood and high centrality with respect to the whole network, surpassing locally central nodes 5 and 33. The sensitivity of entropy centrality to a node’s position with respect to network clusters leads to the idea of fixing the transfer locality value in such a way that clusters can be identified in any network. 3. Community structure detection This section describes how entropy centrality can be used to reveal community structure in networks. The presented idea of a community detection algorithm is inspired by the algorithm proposed by Girvan and Newman (2002), which iteratively removes high-betweenness edges in an hierarchical clustering procedure. The algorithm proposed in this paper also removes one edge at a time and re-computes the corresponding transition and absorption probabilities for each node. Fig. 3. Zackary’s karate club social network. 158 A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 Fig. 4. Entropy centrality vs transfer locality plot for karate club problem. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: INPUT: Number of nodes in the network N; transition probability matrix P for the Markov Chain with auxiliary nodes; transfer locality t; number of algorithm iterations K. For k = 1 to K For i = 1 to N and j = 1 to N, i = / j Remove the link between nodes i and j, if it exists Revise the probabilities for transitions from nodes i and j Compute the average entropy over all the nodes using (1) Remember / Update the link for which the entropy decrease is maximum End Remove the link for which the entropy decrease is maximum End Sort to identify the obtained clusters OUTPUT: The clusters. Given an undirected graph and a fixed value of absorption probability for all the nodes, the transition probability matrix P for a Markov chain with auxiliary states is created first, as explained in Section 2. The desired centrality locality t is chosen next; during the experimentation, it was empirically discovered that locality values close to the diameter of a given network, and the absorption probability values in the range [0.1, 0.2] are convenient choices for successful global community detection. The algorithm proceeds by identifying and removing network edges such that the average entropy centrality over all the nodes is reduced the most (the algorithm pseudocode is given). Empirical investigations with the designed community detection algorithm to discover clusters in networks with known ground truth are reported next. 3.1. The Zachary’s karate club network Returning to the Zackary’s karate club experiment, recall the part of the club’s story that made it famous in the social network analysis circles: during the 2-year observational study, a split occurred between the club members. A disagreement, which developed between the administrator of the club and the club’s instructor, ultimately resulted in the instructor’s leaving and starting a new club, taking about a half of the original club’s members with him. The node colors in Fig. 3 indicate how exactly the two factions ended up splitting. Fig. 5 presents the results of the entropy centrality-based community detection algorithm, executed with the karate club network. The algorithm discovers the two main club communities, offering a strict refinement of the community structure reported in (Girvan and Newman, 2002) and agreeing with the findings of Medus et al. (2005). For finding the two-community division, 25 iterations of the algorithm were executed with locality t = 5. This partition corresponds almost exactly to the actual factions in the club, with an only exception of some “outliers”, the nodes with the lowest degree values. The outliers, nodes 5, 10, 11, 12 and 29, were Fig. 5. Entropy centrality algorithm results for karate club problem depicting the sequence of community formation. A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 159 Table 2 Clustering algorithms comparison – karate club data (34 nodes, 78 edges). Number of iterations 10 15 25 Girvan–Newman algorithm Entropy-based algorithm Clusters Outliers Time (s) Clusters Outliers Time (s) 1 2 4 0 1 2 5.6 7.4 9.5 2 3 4 2 4 5 0.5 0.6 0.77 Table 3 Clustering algorithms comparison – football network data (115 nodes, 613 edges). Number of iterations 50 100 150 200 250 Girvan–Newman algorithm Entropy-based algorithm Clusters Outliers Time (s) Clusters Outliers Time (s) 1 3 6 12 14 0 0 0 3 4 577.5 904.4 1046.1 1094.0 1134.7 1 3 7 12 12 0 0 0 3 13 205.3 388.2 553.1 673.6 738.9 detected first, which is a desirable property for a community detection algorithm that looks to find closely connected groups. Note also that increasing the number of algorithm iterations produces a more granular clusters (perhaps, smaller friendship groups or families), however, any further refinements could not have been validated due to the lack of data. Table 2 reports the number of clusters identified by Girvan–Newman algorithm (Girvan and Newman, 2002) and the Fig. 6. Clusters for NCAA Division I-A football teams. 160 A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 Fig. 7. Clusters for the Dolphin Network. presented entropy-based algorithm as a function of the number of iterations, together with the respective computational times. In each algorithm, an iteration constitutes a removal of a single edge from the network: in identifying an edge to be removed, the former algorithm computes betweenness centralities for all the edges, while the latter computes entropy centralities for all the nodes and the changes in these centralities that would result from the removal of every edge. Thus, the number of centrality evaluations, required in each iteration of the presented algorithm, is O(N) times greater than that in the Girvan–Newman algorithm. Yet, the observed algorithm runtimes differ by an order of magnitude in favor of the entropy-based algorithm, due to high efficiency in the computation of entropy centrality. Note that both algorithms are hierarchical, in that they will continue breaking communities apart until the last edge has been removed from the network at hand; an analyst is free to stop this process at any point. In Table 2, and in the tables corresponding to the subsequent experiments, cluster-based metrics are reported for multiple breakpoints in the algorithms’ execution, for illustrative purposes. All the experimental results presented in this paper have been obtained using MATLAB version R2012b on a desktop with Intel i3-2120 processor (3.3 GHz, 2 cores) and 8 GB RAM. 3.2. The US Division I football network Another example is based on the structure of a US college football league (football here is American football, not soccer). The network under study is a representation of the schedule of Division I games in year 2000, with nodes representing teams (identified by their college names) and edges representing regular-season games between the teams they connect. What makes this network interesting is that the true community structure is also available. The teams are divided into conferences containing around 8–12 teams each. Games are more frequent between members of the same conference than between members of different conferences, with teams playing an average of about seven intra-conference games and four inter-conference games in the season. Inter-conference play is not uniformly distributed; teams that are geographically close to one another but belong to different conferences are more likely to play one another than teams separated by large geographic distances (Girvan and Newman, 2002). The entropy centrality based community detection algorithm was applied to this network to identify the conference structure. The algorithm was executed for 200 iterations with transfer locality t = 5. The results are presented in Fig. 6. Almost all teams were correctly grouped; a few Table 4 Clustering algorithms comparison – dolphin network data (62 nodes, 159 edges). Number of iterations 45 75 100 Girvan–Newman algorithm Entropy-based algorithm Clusters Outliers Time (s) Clusters Outliers Time (s) 8 11 12 0 12 20 39.2 48.0 51.0 4 6 6 16 22 31 8.7 11.5 13.8 A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 161 Fig. 8. Clusters for the “Les Miserables” weighted network (100 iterations, 31 outliers). independent teams that did not belong to any conference were also successfully identified. Overall, only four teams were misclassified: Boise State in actuality belongs in the Western Athletic Conference, Western Michigan and Marshall in the Mid-American Conference, while Utah State is a conference-independent team. Table 3 offers a comparison of computational times of Girvan–Newman and entropy-based algorithms executed on the football network dataset. Naturally, as the number of iterations increases, the network becomes more sparse. It is worth noting that the number of shortest paths remaining in the network decreases sharply, which explains why an iteration cost in a Girvan–Newman algorithm drops faster than that in the entropy-based algorithm. 3.3. The Dolphin Network Finally, the algorithm was run on a classic network called “The Dolphin Network” (Lusseau, 2003). The network represents a community of 62 bottlenose dolphins in Doubtful Sound, New Zealand. The algorithm was executed for 45 iterations with locality t = 5. The results are presented in Fig. 7 and Table 4: the dolphin community is split into 6 main clusters identified by the entropy centrality community detection algorithm. These results match well with those reported by previously existing clustering algorithms. The singledout outliers, nodes 5, 12, 13, 32 and 36, were detected first, while nodes 23, 49, 61 were not separated into an isolated cluster. In its current design, the entropy centrality algorithm cannot not be used to distinguish overlapping communities, which is another community detection task typically explored based on this dataset. Note that the interpretation of entropy centrality emphasizes diversity in multi-way information exchanges between nodes, as opposed to emphasizing connectivity. Therefore, the presented clustering algorithm first and foremost seeks to achieve high cohesion within each discovered group, and this is why outliers tend to be removed early in all the attempted experiments. 4. Discussion and conclusion This paper introduces a measure of node centrality as the entropy of flow destination in a walk-based transfer process with Markovian property. Entropy centrality can be particularly useful in large social network analysis, where the multitude of paths between node pairs makes the differences in the typically-used betweenness centrality values almost negligible. Entropy centrality is well-interpretable and easy to compute exactly using matrix multiplication. By design, entropy centrality can be interpreted as a measure of node potential for information spread: the more diverse set of destinations a node can engage, the higher centrality it boasts. Moreover, by adjusting the settings of the Markovian transfer process, one is able to measure entropy centrality at different localities, establishing the value of every node’s position with respect to its local neighbors, or globally, with respect to the whole network. The notion of locality in entropy centrality definition is akin to that of reach, used to define reach centrality, however, in the stochastic transfer process context, these two are not quite the same. Entropy centrality is conducive to quantifying the properties of a serial duplication network flow process, thus taking a particular spot in Borgatti’s typology of social network processes/metrics. This observation motivates further investigations into entropy centrality utility for viral marketing studies, where spread of ideas or products takes place simultaneously over multiple network paths or walks. A profit-sharing product distribution strategy where distributors are constantly recruited by the existing distributors directly from the consumer population is a good example of such a potential analysis application. Entropy centrality can be useful for network visualization, with globally central nodes placed into a canvas first, uniformly spaced, and with the surrounding other nodes becoming more distant as their local centrality drops. Such a visualization would emphasize information exchange capabilities of nodes at multiple levels, instead of relying exclusively on local network structure captured 162 A.G. Nikolaev et al. / Social Networks 40 (2015) 154–162 Table 5 Entropy-based clustering algorithm – Les Miserables data (77 nodes, 254 edges, weighted). Number of iterations 25 50 100 150 Entropy-based algorithm Clusters Outliers Time (s) 2 4 5 4 20 25 31 48 14.6 27.2 47.9 58.8 by its edges. This paper also explores how entropy centrality can be utilized by an iterative algorithm to effectively detect communities. Computational experiments on networks with known cluster ground truth showcase the effectiveness of the presented method. Additional insights, drawn from the experiments with the entropy centrality-based clustering algorithm, are notable. First, entropy centrality appears to remain informative with weighted networks (the matrix-based way of computing entropy centrality values does not require any significant revision). Fig. 8 and Table 5 give the results and computational times for the presented clustering algorithm application to a weighted dataset of character co-appearances in the text of “Les Miserables”: the discovered communities comply with the results previously reported in the literature. Second, it is observed that in its current form, entropy centrality may not be as useful for analyzing directed networks. When applied to prisoner relationship data (67 nodes, 182 edges), the entropy centrality-based clustering algorithm failed to discover well-interpretable clusters. This is due to the fact that in a directed network, many actors have very limited options for information spread, and are isolated as “outliers” early in the algorithm’s execution. Also, experiments with larger networks revealed that runtime-wise, the clustering algorithm’s applicability has similar limitations as Girvan–Newman algorithm does: more specifically, the former can work with datasets with up to 500 nodes, whereas the latter becomes slow clustering just 100 nodes. Meanwhile, computational efficiency of one-time evaluation of entropy centrality values over all network nodes remains very high, as expected. On the final note, future research on the use of entropy centrality for social network analysis can focus on evaluating strategic network positions of groups of nodes. Having directly computed all the absorption probabilities (i.e., from state i ∈ N into state j ∈ N) for the Markovian transfer process (i.e., from state i ∈ N into state j ∈ N), one can search for subsets of strategically positioned nodes at various localities. Other research directions include increasing the computational efficiency of the presented methods, and devising methods for detecting overlapping network communities. Acknowledgment This research has been supported in part by the National Science Foundation (Award #62288) and by a Multidisciplinary University Research Initiative (MURI) grant (#W911NF-09-1-0392) for “Unified Research on Network-based Hard/Soft Information Fusion”, issued by the US Army Research Office (ARO) under the program management of Dr. John Lavery. References Bonacich, P., 1972. Factoring and weighting approaches to status scores and clique identification. J. Math. Sociol. 2 (1), 113–120. Bonacich, P., Lloyd, P., 2001. Eigenvector-like measures of centrality for asymmetric relations. Soc. Netw. 23 (3), 191–201. Borgatti, S.P., 2005. Centrality and network flow. Soc. Netw. 27 (5), 5–71. Estrada, E., Rodriguez-Velazquez, J.A., 2005. Subgraph centrality in complex networks. Phys. Rev. E 71 (5), 056103. Freeman, L.C., 1979. Centrality in social networks: conceptual clarification. Soc. Netw. 1 (21), 5–239. Girvan, M., Newman, M.E.J., 2002. Community structure in social and biological networks. Proc. Natl. Acad. Sci. U. S. A. 99 (782), 1–7826. Lusseau, D., 2003. The emergent properties of a dolphin social network. Proc. Biol. Sci. 270, S186–S188. Medus, A., Acuna, G., Dorso, C., 2005. Detection of community structures in networks via global optimization. Phys. A 358, 593–604. Newman, M.E.J., 2005. A measure of betweenness centrality based on random walks. Soc. Netw. 27 (3), 9–54. Noh, J.D., Rieger, H., 2004. Random walks on complex networks. Phys. Rev. Lett. 92, 118701-1-4. Stephenson, K., Zelen, M., 1989. Rethinking centrality: methods and examples. Soc. Netw. 11, 1–37. Tutzauer, F., 2007. Entropy as a measure of centrality in networks characterized by path-transfer flow. Soc. Netw. 29 (2), 249–265. Engagement Capacity and Engaging Team Formation for Reach Maximization of Online Social Media Platforms ∗ Alexander Nikolaev University at Buffalo 312 Bell Hall Buffalo, New York 14260 [email protected] Shounak Gore Venu Govindaraju University at Buffalo 113 Davis Hall Buffalo, New York 14260 University at Buffalo 113 Davis Hall Buffalo, New York 14260 [email protected] ABSTRACT The need to assess the “health” of online social media platforms and strategically grow them guides the efforts of researchers and practitioners. For those platforms that primarily rely on user-generated content, the reach – the degree of participation referring to the percentage and involvement of users – is a key indicator of success. This paper lays a theoretical foundation for measuring engagement as a driver of reach that achieves growth via positive externality effects. The paper takes a game theoretic approach to quantifying engagement, viewing a platform’s social support capital as a cooperatively created value and finding a fair distribution of this value among the contributors. It introduces engagement capacity, a measure of the ability of users and user groups to engage peers, and formulates the Engaging Team Formation Problem (EngTFP) to identify sets of users that “make the platform go”. We distinguish our analyses, which underlie the reach maximization efforts, from the pre-existing influence maximization work and compare the engagement capacity with network-based metrics. Computational investigations with MedHelp and Twitter data reveal the properties of engagement capacity and the utility of EngTFP. Categories and Subject Descriptors A3 [Crowdsourcing Systems and Social Media]; A4 [Economics and Markets]; A8 [Social Networks and Graph Analysis] Keywords social networks, engagement, reach, team formation problem, influence maximization 1. INTRODUCTION Measuring social influence online is becoming a more sophisticated, refined and granular process. A variety of professional services like Klout, PeerIndex, etc., have come up ∗Corresponding author. [email protected] lately that aim at measuring the influence with as much granularity as possible. Social media influence can be defined as an individual’s ability to affect the thinking or actions of peers; in this setting, it is of interest to identify the most influential persons with the objective to strategically change political preferences in a community, or advertise products / services to increase their sales. However, influence measurement looses meaning when an online platform does not seek to exploit its userbase, but instead, simply works to maintain the user activity. The whole purpose of activities including sharing pictures or posts, contributing comments or “likes” in response to peers’ posts or exchanging opinions in social forum threads is to share experiences and maintain (friendly) communication between fellow users of the network. This form of communication keeps the users engaged, and ultimately, defines the influential power of the platform that they use. Take as an example the downfall of MySpace and Orkut with the success of Facebook. As more and more users migrated from Orkut to Facebook, the utilization of Orkut declined, leading to its eventual abandonment. The ability to detect such turns and strategically grow online platforms is of interest to many practitioners, putting the question of assessing the platform’s “health” up on the research agenda. Taking a page from RE-AIM, a widely used systematic program evaluation framework, we use the term “reach” when referring to an online platform’s ability to attract new users [39, 13], and define “engagement” as any existing user’s degree of involvement in the platform’s activities. The interplay between engagement and reach can be understood by leveraging on the research that for decades sought to understand the structure, stagnation, and growth of consumer markets. Studies of cascade emergence in demand-side economics have found that the customers’ willingness to choose (adopt) a product grows with to the number of people that have already adopted it; this increase in value, otherwise known as positive network externality, occurs with each sale of a product unit [43, 18, 20]. On an online platform, positive network externalities occur in two instances: (1) when a new user joins the network and authors a new post, and (2) when an existing user contributes new user-generated content [43]. Thus, the internal growth of a platform (achieved through added user-generated content) leads to its external growth (the increase in the number of newly registered users): i.e., higher engagement leads to higher reach. (a) (b) (c) Figure 1: Examples of communication threads of different structure. We posit that engagement occurs when a user contributes content in response to someone else’s contribution(s). A network of directed relationships reflecting “which post attracts which” captures the sequence and structure of engagement (see Figure 1). An online forum grows through its users: every post (even “weather talk”) fosters user “bonding” and creates a positive externality effect, even more so when it is read and responded to. The propagator – the creator of seed content – is said to engage the responder, while the responder is said to engage into the forum’s activity. Filling the methodological gap in the development of theorysupported methods for quantifying engagement, we introduce a new term - “engagement capacity”, - interpreted as a user’s ability to engage peers and measured as their share in the platform / forum engagement power. We view the value generated by a massive online social network community as a direct product of communication between its members, attributed to the submitted contributions’ content and structure (order), and only in part to the volume (note that irrelevant interactions are promptly removed from pro-health sites, a majority of which are moderated). In order to assess any individual user’s engagement capacity, we employ cooperative game theory. Cooperative game theory is a branch of science that calculates “fair-share” equilibria in settings with agents that form coalitions to achieve a common goal. Its first fundamental advance is due to Shapley [37], who introduced a method for calculating fair contributed values for “players” forming unstructured coalitions, where any contributor can interact or team up with any other contributor with the objective of generating synergistic value. This is a perfect setup for the engagement ability measurement: the un-normalized and normalized (e.g., by the number of contributions or the amount of time spent online) engagement capacity can allow for fair comparison between users as they contribute to the growth of a platform. It is important to see the potential in studying “engagement” as opposed to studying “influence”. The latter requires one to build a fixed time snap-shot of a social network that will define how users can affect each other’s decisions with respect to a particular query; the former is query-independent. Yet, influence maximization research and centrality analyses are the closest in spirit to the presented work, so we briefly review those literatures in Section 2. Section 3 then discusses the details of using cooperative game theory for measuring engagement capacity. Section 4 covers data collection and reports some computational results. Section 5 gives concluding remarks and discusses future work. 2. BACKGROUND AND RELATED WORK Studies of social influence mechanisms have aimed to explain the patterns behind the spread of ideas, technologies, and viral product adoption in social networks. The earliest efforts made “word-of-mouth” an established term in social influence research [23, 45, 17] and developed diffusion-based models to describe innovation adoption [7], disease spread [27], and other phenomena in sociology [22, 41] and politics [9]. Metrics for quantifying people’s ability to influence peers’ decision-making, based on their structural network properties, have also been developed - they are known as centrality metrics [6]. It should be noted that most of such methods were proposed and used in the pre-big data era. Later, with the advent of the Internet and the spread and growth of online social networks, various ranking-based algorithms have gained popularity [32]. In the studies of Question-and-Answer forums, the question of measuring the motivation of a user to contribute to a forum first came up [3, 33]. The simple “ad-hoc” measures, e.g., the number of upvotes on websites like StackOverflow, or the number of “likes” on Facebook and Twitter, were used to inform forum utilization and signal user interest. Natural language processing (NLP) works do distill the specifics of a post to understand its success in attracting responses [15, 4, 21, 44]. Note that that NLP methods that focus on the post content, rather than the dynamics of users creating posts in interacting with each other, require much human supervision, and tend to be slow in processing big data. Moreover, as intuition suggests, none of the above-mentioned measures quantifies how much merit a user deserves for engaging others through interaction. Influence maximization is a branch of social influence research worth a separate discussion. Works in this area formulate and solve the optimization problem of selecting a group of the initiators (also called seeds or opinions leaders) to generate the largest influence cascade - the largest number of product or idea adopters [10, 19]. Identifying the teract [28, 35]. We extend and adapt this line of work to render it useful in thread-based communication context, by allowing for multiple interactions between the users (as players) in forum threads, accounting for contribution (forum post) order. Figure 2: Engagement capacity and network-based metric values for the users of a forum with three threads as depicted in Figure 1 (best viewed in color). most influential subset of the user base of an online resource turns out to be valuable for viral marketing [34, 14], delivering personalized recommendations [38], microblogging effectiveness [5], and health forum analysis [40]. The conventional influence maximization formulations, however, are not suitable for determining how to best keep an online platform active, i.e., answering the question “What set of users is the most important for keeping the whole user base engaged?”. To this end, new methods for measuring engagement are in order. Little research has been done on measuring engagement thus far. The terms “engagingness” and “responsiveneness” were introduced in modeling email message chains [30, 29]. In a study of Twitter, the count of re-tweets was taken as a measure of engagingness [1]. However, all these works were based on only direct responses from one user to another, with indirect communication (engaging one user through another) not accounted for. 3. MEASURING THE ENGAGEMENT CAPACITY OF A USER This section presents a method to quantify users’ ability to engage peers, taking into account the flow of communication between them. It uses cooperative game theory, a branch of science that calculates “fair-share” equilibria in settings with agents that form coalitions to achieve a common goal. The first fundamental advance in this theory is due to Shapley [37], who introduced a method for calculating fair contributed values for agents forming unstructured coalitions, where any contributing agent can interact or team up with any other contributor, with the objective of generating synergistic value. Indeed, the value generated by a coalition (team) typically exceeds the total value that could be generated by the coalition members individually (in isolation of each other) – such synergy is characteristic of most collective human efforts. Shapley’s work has been extended by Myerson [25, 26], and later, by other researchers, to problems with constraints on cooperation structure and frequency [31, 28, 46, 8]. The most relevant to our proposed work are the extensions that recognize that in certain situations, the value generated by a coalition may depend on the order in which players in- Consider a setting where forum users cooperatively generate engagement value: if they contribute posts that attract more posts, then they are promoting the forum growth and increasing the forum’s “engagement power” achieved through positive externality effects. We define the engagement capacity of a user as their share in the overall forum’s ability to attract posts, based on all threads the user has contributed to. It is assumed that engagement capacity of a user can be positive if and only if the user has been active in at least one thread; that is, the passive readers (e.g., forum lurkers) can not add engagement value to the forum. Moreover, the users who contribute but are never responded to also have zero engagement capacity; indeed, if a forum consisted of only such members, it would generate no communication at all. Based on a single thread, the engagement capacity of a user depends on the position(s) of the user’s contribution(s) in the thread’s flow. As such, a thread starter is (in part) responsible for engaging all the users that have replied to the thread: a user whose post generates further posts gets partial engagement value credit from all such contributions. On the other hand, the posts immediately preceding a newly contributed post should get more credit for engaging it than the posts appearing earlier in the thread: otherwise, the new contribution would be expected to have occured earlier or be directly responding to those older posts. Let us look closely at the examples in Figure 1: assume that these three forum threads have been created by the same five users. User A originates each thread. User B replies to user A’s original messages in all the three threads. In the first thread (Figure 1a), users C, D and E reply to B, instead of directly replying to user A. In the second thread (Figure 1b), users C and D also reply to A’s question directly. The third thread (Figure 1c) consists of a single “line” thread branch. Based on the directed networks in Figure 1, the averages of some standard centrality metrics are reported in Figure 2 (the detailed explanation of how the engagement capacity metric is computed in the figure and the discussion of how the metrics relate to each other will be given in the subsequent sections). A methodological approach to computing engagement capacity is presented next, followed by a discussion of some computational aspects of engagement analysis. 3.1 Cooperative Games on k-Coalitions Nowak and Radzik [28] and Sánchez et al. [35] explain that for a transferable utility (cooperative) game in a directed graph, the worth of a coalition can depend not only on the individual properties of coalition members but also on the order in which they interact in a coalition. Nowak and Radzik [28] define the value ΨN R that generalizes the Shapley value for transferable utility games. For a game with player set N and value function v, where player subsets S ⊂ N form ordered coalitions T ≡ (i1 , ..., i|T | ) from the set of all possible ordered coalitions π(S), this value for player i amounts to X X (|N | − |T | − 1)! R (v(T ∪i)−v(T )), ΨN (N, v) = i |N |! S⊂N \{i} T ∈π(S) which, with Ω(N ) as the set of all possible ordered subsets of N , H(T ) as the set of players in coalition T , and i(T ) as the position of player i in coalition T , can be concisely re-written as X ∆∗v (T ) R ΨN (N, v) = , i |T |! T ∈Ω(N ),i∈H(T ),i(T )=|T | ∆∗v (T )T ∈Ω(N ),T 6=∅ where are termed the generalized coefficients of v, also known as the coordinates of v in the generalized unanimity basis [16]. Sanchez and Bergantinos [35] define another Shapley value extension, distinguishing coalitions by the players’ positions within them, X ∆∗v (T ) . ΨSB i (N, v) = |T | (|T |!) T ∈Ω(N ),i∈H(T ) More recently, del Pozo et al. [8] defined a generalization of ΨN R and ΨSB as a parametric family of functions {Ψα }α∈[0,1] , where the value generated by an ordered coalition is shared proportionally to the positions of its members, Ψα i (N, v) = X ∆∗v (T ) T ∈Ω(N ),i∈H(T ) α|T |−i(T ) P |−1 j . |T |! |T j=0 α We extend and adapt the line of work on ordered transferable utility games [28, 35, 8] to render it useful in the threadbased communication context, where interactions take place between online platform users (as players) as they contribute posts to forum threads in response to each other, in sequence, and possibly, multiple times to the same thread. To this end, define k-coalition as a connected ordered sequence of player appearances. As with the coalitions used to define {Ψα }α∈[0,1] , a k-coalition is distinguished not only by its membership, but also, by its ordering. Similarly to i(T ), we let i(T, k), k = 1, 2, ..., K, denote the position of the k-th appearance of player i ∈ H(T ) in k-coalition T ∈ ΩK (N ), with ΩK (N ) denoting the set of all k-coalitions in which any given player can appear at most K times. The value generated by a k-coalition is shared proportionally to the positions of the appearances of the coalition members, ΨK-α (N, v) = i X T ∈ΩK (N ),i∈H(T ),k=1,..,K ∆∗v (T ) α|T |−i(T,k) P |−1 j . |T |! |T j=0 α (1) The family of parametric functions {ΨK-α }K∈I + , α∈[0,1] encompasses, as special cases, the conventional Shapley value as well as the functions ΨN R , ΨSB , and {Ψα }α∈[0,1] . The concept of engagement capacity is now introduced, in conjunction with the term engaging subthread in forum communication. Define subthread as an uninterrupted chain (sequence) of posts of a thread that contains (and thus, begins with) the first post of the thread; in a directed tree graph, representing a forum thread, every path that begins at the root is a subthread. A subthread is called engaging if it is succeeded by at least one post in its respective thread. Therefore, the number of forum posts contributed in response to or following up on another post (or a sequence of posts) constitutes a total engagement value generated by the forum users. The share of each user in this value is called engagement capacity: it measures each user’s ability to engage peers, computed retrospectively, i.e., based on their past activity records. Given an online forum, let N denote the set of all the users in the forum’s userbase. Define (N , v, P ) as a game on the set of all the forum’s subthreads P . A subthread-restricted game (N , vU , p) can be interpreted as the game where users U ⊂ N contribute posts to form p ∈ P ; this setup is similar, but not exactly the same, to a game in a communication “situation” described in [8]. Given a forum’s snapshot (historical data), set K to be the largest number of posts contributed by the same user to any subthread, and set ∆∗v (T ) to return a total number of posts immediately succeeding such engaging subthreads p ∈ P that have the same membership, size and structure as k-coalition T ∈ ΩK (N ). The engagement capacity of forum user i ∈ N is the value that (1) returns for this user as a solution of the game (N , v, P ) and is hereafter denoted by ηi , or by ηi,F to specify that the computation is based on a particular set F of forum threads. Note that the coefficient α ∈ [0, 1] in {ΨK-α }K∈I + , α∈[0,1] captures the engagement share tradeoff between thread contributors. As such, if a new post is submitted in response to multiple (preceding) consecutive posts, then its immediate predecessor gets more credit for attracting it, with the credit to the earlier predecessors discounted by the factors of α, α2 , α3 , etc., respectively. 3.2 Calculating Engagement Capacity Consider the example in Figure 1a, and denote the thread depicted on it by “(a)”. Using (1), the engagement capacity of user A is found to be α , ηA,(a) = 1 + 3 ∗ α+1 where user A gets the engagement value of 1 by contributing 3α to engaging subthread A and α+1 by contributing to engaging subthread AB; meanwhile, subthreads ABC, ABD and ABE are not engaging. Similarly, the engagement capacity of user B amounts to 1 ηB,(a) = 3 ∗ . 1+α Note that users C, D and E have zero engagement capacity in this example; also, the total engagement value generated and shared is equal to four, i.e., the number of posts contributed by users in response to their peers. Table 1 gives the engagement capacity values for each of the threads in Figure 1, with α = 1, i.e., giving all the subthread propagators equal credit for engaging a responder. In these examples, User A B C D E (a) 2.5 1.5 0 0 0 (b) 4.5 0.5 0.5 0.5 0 (c) 2.083 1.083 0.583 0.25 0 Table 1: Engagement capacity values computed for the threads in Figure 1. Figure 3: Distribution of engagement capacity values for MedHelp and Twitter users (best viewed in color). the early contributors to a thread have higher engagement capacities than the late ones. However, in general, this is not necessarily the case; for example, with α = 0, ηA,(a) = 1 and ηB,(a) = 3. The engagement capacity values can be interpreted directly, or upon normalization. One sensible way is to normalize by the number of contributed posts to identify the users whose contribution content is engaging irrespective of the volume. The case where a forum post is immediately followed (in the same thread) by another post of the same user is worth a special discussion. In general, the treatment of such occurrences is up to the researcher: e.g., one may choose to “merge” such posts and treat them as one. However, in a moderated forum, one may assume that every contribution is distinct, i.e., makes a new point; since every post grows the network and creates positive externalities, rewarding the user (with a share in the total engagement value) for both posts makes sense. Another issue that becomes apparent in the considered example is that a user may gain a high engagement capacity by engaging themselves (as opposed to others): this issue will be addressed in Section 4, with the introduction of “targeted engagement capacity.” Engagement capacities of users can be, rather conveniently, dynamically computed and updated in real-time, as new content is added, without the need to redo the analysis for the whole history of the forum/platform every time it changes. Recall that each new user post submitted in response to another post, or sequence of posts, brings in one unit of engagement value to the communication thread and to the platform as a whole. A newly added post increases the engagement capacity values of all the users contibuting to the subthread leading to the new post (but not including it). Consider a k-coalition T 0 ∈ ΩK (N ) with the same membership, size and structure as this subthread. The increase in the engagement capacity of user i ∈ H(T 0 ), resulting from the addition of the new post, is given by 0 ∆i = X k=1,2,...,K α|T |−i(T P|T 0 |−1 j=0 0 ,k) αj . (2) Equation (2) specifies how every new contribution changes the engagement capacity values of forum users. This equation can be used in real time to efficiently track the con- tributed engagement dynamics. Note that the expression in (2) can be evaluated in O(n) time: its denominator requires finding the subthread succeeded by a new post, and its numerator requires the information about the positions of user appearances in this subthread. 4. TARGETED ENGAGEMENT CAPACITY AND ENGAGING TEAM FORMATION Equation (2) of Section 3 specifies how to fairly distribute a unit engagement value, brought to the platform with any new post submitted into an existing thread, between the thread contributors. Naturally, some users may be succesfull in engaging certain peers and not so successful in engaging others. This realization highlights the value of solving a game of engaging one given user j: the peers of j would split the engagement value generated by the j’s responses to their posts or post sequences. In this case, only those subthreads that are followed up on by the posts of j would be considered engaging. More generally, one can address the question of evaluating the ability of a given set of users, V ∈ N , to engage another set of users, W ∈ N , in forum communication. To this end, we introduce the term targeted engagement capacity: denoted by ηV →W , it is defined as the sum of the shares allocated to the members of V in the game of engaging the members of W . In the setting of the game on k-coalitions, described in Section 3.1, one has ηV →W = X X i∈V,j∈W i∈H(T ),k=1,..,K ∆∗v (T ) α|T |−i(T,k) P |−1 j , (3) |T |! |T j=0 α ∆∗v (T ) where returns a total number of posts by the members of W immediately succeeding such engaging subthreads p ∈ P that have the same membership, size and structure as k-coalition T ∈ ΩK (N ). As an interesting special case example, note that ηi 6= ηi→N \i : the difference between these quantities is ηi→i – it indicates how much user i tends to engage in back-and-forth conversations as opposed to occasionally contributing to multiple forum threads. Note that targeted engagement capacity can be updated dynamically in the same manner as the originally defined engagement capacity, by a trivial extension of (2). The ability to measure how successful a particular user is (a) (b) Figure 4: Engagement capacity per user contribution, for the users ordered by the increasing contribution volume. in engaging other particular users allows us to attack the following question: ”What group (team) of active users is most engaging?” This question is of special interest to any growing online platform, since such teams of users can be rewarded, encouraged, and assisted in further increasing peer engagement and retention. To help this cause, the Engaging Team Formation Problem (EngTFP) is introduced. The EngTFP objective is to select a subset of users with a maximal targeted engagement capacity towards all the other users: maxU ηU →N \U . The problem can be formulated with additional constraints, e.g., those specifying which historical engagement data to take into account and/or the maximum size of the subset to be selected (e.g., |U | ≤ b). To set up an instance of EngTFP, first, use (3) to compute all the pairwise targeted engagement capacity values, {ηi→j }i∈N,j∈N,i6=j . Note that, in general, ηi→j 6= ηj→i for i 6= j. Let Xi , i ∈ N , be binary decision variables such that for any i ∈ N , Xi = 1 if i is selected into U , and zero otherwise. Let Yij , i ∈ N , j ∈ N , i 6= j, be auxiliary binary variables that would set to 1 if and only if Xi = 1 and Xj = 0. Then, EngTFP is given, X X max ηi→j Yij i∈N j∈N,j6=i s.t. Yij ≤ Xi , ∀ i, j, Yij ≤ 1 − Xj , ∀ i, j, Xi , Yij ∈ {0, 1}, ∀ i, j. A non-linear formulation of EngTFP P would not require any auxiliary variables and maximize i∈N,j∈N,j6=i ηi→j (Xi2 − Xi Xj ); this formulation is quadratic but not convex, and does not allow for a dominant convex decomposition [24], confirming that EngTFP is combinatorially challenging. EngTFP is a special case of MAX2SAT problem that is known to be NP-Hard [12]; for a review of the algorithmic work on MAX2SAT, see [11]. The complexity of EngTFP lies in the fact that, once user i is selected into a team, the team members can no longer be rewarded for engaging i. Indeed, if a platform decides to pay some users for helping grow it, the users to be paid should be good at attracting other users but not each other. Again, it is important to underline the difference between the EngTFP and the influence maximization problem. Influence maximization strives to enhance a certain effect (e.g., change in political opinion, product adoption, etc.) throughout an existing and known user network. On the other hand, EngTFP solutions aim at helping a platform maintain or build its network or non-network userbase. The EngTFP informs a decision-maker what users should be rewarded, virtually (e.g., via badges, titles, points) or physically (e.g., via gift cards, discount codes, cash), for igniting communication, which the users achieve through content generation, question asking, social support provision, information exchange, etc. 5. COMPUTATIONAL INVESTIGATIONS This section reports the engagement capacity analyses conducted with the data from two active online communication platforms differing in purpose. We begin with an account of the data collection, and then, present the numerical findings. 5.1 Data Collection We collected the forum contribution records from an online healthcare platform MedHelp, one of the biggest active and freely accessible online sites for pro-health social networking. It has about 200 social support forums and about as many “ask an expert” forums. The website has close to 3 Million active and inactive threads and attracts about 8 Million visitors every month. The Medhelp users interact through discussion boards, contribute personal journal entries exploiting weight and mood tracker features, and post notes on their friends’ home pages. The data most relevant to the present study are those of the users’ interactions on discussion forums: such forums allow the users to give each other social support. A web crawler was developed and used to collect the questionanswer type data from the “cholesterol control” and “weightloss and dieting” forums. The weight-loss & dieting forum has about 7000 threads dating back to the early 1999. The cholesterol control forum has about 280 threads. Each thread consists of a single question followed by answers and/or relevant comments. Note that not all the threads have replies; the unanswered threads do not affect the engagement capacity computation. After cleaning up threads with no replies we end up with a database of 4296 unique users for the MedHelp data. Note that the users are allowed to contribute to any forum and hence some users have been active on both the considered forums while others have contributed to only one of them. A data tree is built for each analyzed thread. In each such tree, every user contribution is represented by a node; a directed link is drawn from node A to node B, if B replies to node A. This tree captures the relations between the users who interact in a particular thread discussion. The tree structure also tells us who initiates a particular discussion by posting a question, who engages the maximum number of people by furthering a discussion with good comments and who interacts frequently in a particular conversation. The engagement capacities computed and presented hereafter are based on both forums (all their threads) together. Another dataset was created using Twitter, based on about 20,000 tweets, with their re-tweets, related to the 2014 FIFA World Cup. A tweet thread consists of a tweet and all of its re-tweets. A directed link is drawn from node A to node B, if user B re-tweets user A’s tweet. Thus, we have a number of communication threads where every root is a particular original tweet and the other nodes are re-tweets. This tree structure, like the MedHelp tree, allows us to find the initiator, the most engaging user and the person who interacts frequently in a particular conversation. A total of 31,467 unique Twitter users contributed to these threads. 5.2 Numerical Results Using the dynamic approach to engagement capacity measurement with the collected data, we assess how the different influence-related measures behave as compared to the proposed scheme of measuring user engagement. Figure 2 reports the values of the different metrics for the example in Figure 1, considering that these three threads constitute the whole forum. In Figure 2, the users are arranged over the horizontal axis in the descending order of their engagement capacity values. The other metrics’ values were calculated using the standard networkx [36] package in Python. All the experiments were conducted on a Macbook Pro machine with an Intel-i5 2.3GHz processor. As expected, the engagement capacity values are observed to depend on the number of users engaged, the length of the communication branches, and the frequency of communication. The deeper a branch goes, the more value each of the upstream users gains. User A, involved in both engaging a lot of users (Figure 1a) and seeding deeper communication branches (Figure 1c), has the highest engagement value of all the users. User B also exhibits a high engagement value because it engages a fair share of users directly (Figure 1b) and also gets partial credit whenever the depth of the branch increases (Figure 1c). User E has zero engagement value due to failing to engage any peers – this user always stops the communication. The results in Figure 2 suggest that the engagement capacity A B C D E F G H A 3.9 2.9 1.9 2.5 1.4 1.2 0 2 B 2.3 2.8 3.8 4.4 4.4 1 2.4 0 C 4.3 3.5 4.6 5 1.1 2.2 3 1.5 D 4.5 1.9 0 0 1.7 1 4.6 0.8 E 1.2 0.9 0 2.4 1.5 4 4.7 2.7 F 3.8 1.4 1.7 2.6 4.7 5 2.2 4.5 G 1.3 2.5 3.4 0.6 2.1 0.6 2.4 3.5 H 1.3 3.4 0 2.3 0.2 0.5 0.9 1.6 Table 2: Targeted engagement values for top 8 engaging MedHelp users A B C D E F G H A 8.1 7.8 4.7 2 5.3 6.3 0.1 0.3 B 4.1 8.9 7.8 8.1 6.1 4.1 1.4 3.5 C 2.7 5.8 8.8 2.5 7.3 4.5 1.3 3.2 D 2.7 4.7 6.8 0.9 1.3 1.9 1.3 6.3 E 6.7 2.7 3.1 8.6 2.1 6.1 2.4 1.7 F 2.4 7.9 3.9 1.8 7.4 5 2.6 1.2 G 6.7 4.9 5.1 7.6 6.5 1.1 7.6 9.2 H 4.6 6.4 7.8 2.6 8.4 4.4 3.8 8.7 Table 3: Targeted engagement values for top 10 engaging Twitter users calculation works as expected, in line with the intuition. It can be seen that engagement capacity positively correlates with out-degree: this makes sense since high engagement capacity signals a high ability of a user to spread information, and thus, highly engaging users are expected to be connected to more people. Engagement capacity also correlates with betweenness centrality, since the latter indicates how capable user are of transferring information between otherwise disconnected subgroups. The page rank and in-degree behave very differently than engagement capacity since they focus on describing the information flows into a user, not the other way around. We now turn to the collected Medhelp and Twitter data. In all the subsequent analyses, the (targeted) engagement capacities were computed with α = 0.99. Figure 3(a) shows that the distribution of engagement capacity is Gaussian for the considered MedHelp user base, and also, Twitter user base. The engagement values achieved by Twitter users are generally higher than those of the Medhelp users because of the higher overall level of communication (i.e., retweets contributed vs. posts submitted). Figure 3(b) shows that the distribution of the number of contributions per user on each platform is a power law (the probability density functions look like straight lines on the log-log scale, with the imperfections due to small sizes): this is a common observation in social media analyses, which has also been recently found for pro-health forums [42]. Figure 3 reveals something very important about the nature of the engagement capacity metric: it informs us more of the personality of a user (indeed, personal characteristics/abilities/ talents, e.g., IQ, are typically normally distributed in humans) as opposed to the measures characterizing behavioral patterns or activity levels. It appears that engagement capacity truly (a) (b) (d) (c) (e) Figure 5: Comparison of different metrics for the Twitter users arranged according to the increasing engagement capacity. meets the objective of measuring the innate ability of a user to ignite cascades and attract peers to respond. Figure 4 depicts the engagement capacity values, normalized by the number of contributions, for the users in the MedHelp forum (Figure 4(a)) or on Twitter (Figure 4(b)); a horizontal shift is applied to some points in the plots for better visualization. The plots verify that the engagement value is not entirely dependent on the number of contributions made by the corresponding user; moreover, the correlation between these two quantities is positive on Twitter but negative in Medhelp, signaling that social support provision is difficult to maintain just by increasing the activity. Also, there is a larger variance in the engagement per contribution in the Twitter data, which can be attributed to the shorter and broader communication trees that get formed on Twitter as compared to MedHelp. The plots also show that some users manage to be very engaging even though they contribute to relatively fewer communication trees. Next, the various metrics used for measuring influence are compared against the proposed engagement capacity metric in Figure 5, based on the Twitter data. The Twitter users are first arranged in the increasing order of their engagement capacity values. Then, they are partitioned by the bins of equal width separated by the percentile points (forming a total of 100 bins). The horizontal axis in Figure 5 contains the percentile values. The highest value in each bin (for each particular metric) was used to plot the graphs in Figure 5. The vertical axis shows the metric values. Figures 5(a)-5(e) presents the engagement capacity, page rank, eigenvalue cen- trality, betweenness centrality and degree centrality values, respectively, for the Twitter users. Note that the MedHelp data revealed similar results, which are omitted for brevity. As it can be seen from Figure 5, none of these metrics seem to have a significant correlation with the proposed engagement value. In order to find any possible correlation between engagement value and the other metrics, we perform a simple linear regression. The linear regression returned a very high R2 value – around 87% for each of the data sets. This shows that engagement capacity captures multiple aspects of what it means for a node to be important/central in a directed communication tree, as judged by the metrics traditionally used to measure influence. Now let us take a look at the top 10 engaging users (users with the highest engagement value) for both MedHelp and Twitter. A MedHelp user can up-vote a particular answer in the given thread if she feels the answer is good. This upvote is denoted by “star” in MedHelp. Depending on the number of such stars, users are then recognized as top-contributors for various forums. Four of the top 10 engaging users turn out to be among the top contributors on MedHelp. This shows that knowledgeable and widely accepted users can end up having a high engagement value but without a guarantee. We also find the top 10 tweet creators in the Twitter dataset. Two of the ten such users were the accounts associated with news related agencies. For an event like the FIFA world cup, a news agency seems like the right place where other users would go for information. Three of the remaining eight users have a large number of followers and re-tweets. Engagement capacity can thus reveal engaging users both in terms of the k 2 3 4 5 unconstrained # of users in top 10 2 2 2 3 5 Time ∼ 8 hours ∼ 9.5 hours ∼ 10 hours ∼ 11 hours ∼ 25 hours Table 4: EngTFP results for the best team of size k. content they serve as well as the followers attracted. Next, we use the top eight of the above users to show what targeted engagement capacity can capture. Using equation (3), the results for the MedHelp data are presented in Table 2 while those for the Twitter data are presented in Table 3. Users A through H are the top eight most engaging users in each forum, with A being the most engaging, B the second most engaging and so on. The columns in these tables are for the the users whose engagement is calculated and the rows are for the targeted users. It can be observed that indeed, ηi→j 6= ηj→i for i 6= j, i.e., the targeted engagement value from one user i to another user j is not the same as that from user j to user i. Some of the targeted engagement values are zero signaling no interaction between the user whose engagement capacity was evaluated and the targeted user – in this order. For example, in Table 2, the value for row C and column D is zero. This means there was no communication in the MedHelp forums where D replied to C directly or indirectly. Last but not the least is presented the study of EngTFP applied to the MedHelp data. A total of 1,000 random users were selected from the MedHelp dataset, with the objective of finding the most engaging subset of size k among them. The results for several EngTFP instances, solved as integer programs in the SCIP optimization suite [2], are summarized in Table 4. In the instances, the value of k was varied between two and five, and in one instance, k remained unrestricted. Comparing the best teams’ members against the list of top 10 most engaging MedHelp users, it can be seen the EngTFP does select some highly engaging users but not all. This is because by selecting a pair of users that manage to engage each other, one loses much targeted engagement. For the same reason, the unconstrained problem optimizes at k as low as 14. Out of the 14 users selected, five are among the top ten most engaging users. Four of the remaining nine have quite low engagement value, which indicates that the EngTFP selects some obscure users in order to maximize the reach. The last five users of the 14 have mid-ranged engagement values. Table 4 also shows the time taken to solve each EngTFP instance. It should be noted that the high run time is attributed to the fact that each run included finding out the pairwise targeted engagement capacity values for every pair and then solving the EngTFP. 6. CONCLUSION This paper introduces engagement capacity – a metric which serves a well-defined purpose of measuring the ability of online platform users to engage each other in communication on the platform, creating more user-generated content and increasing the platform’s reach through positive externality effects. We present two methods for valuating engagement capacity: the basic method, rooted in co-operative game theory, and the dynamic method, which extends to performing the same computations incrementally, eliminating the need to re-calculate the engagement value from scratch every time the communication structure changes, e.g., with the addition of new threads and posts to a forum. The reported experimental results show how the new metric compares against the previously existing network metrics, typically used to assess the influential power of nodes. The regression results show that the proposed engagement value captures different aspects of those pre-existing network metrics in a single value. The engagement capacity reveals the different dynamics of communication and engagement in two social media, differing in purposes, in MedHelp and Twitter. The experimental results show how the targeted engagement capacity works, and how it can be used to evaluate the ability of one user to engage another user. Future research into the expansion and utility of this concept will allow one to analyze how and why certain users manage to engage others and facilitate the research into the mechanisms of engagement. Finally, we show through one sample study how one can formulate and solve an Engaging Team Formation Problem (EngTFP) to identify the users who are critical to a platform’s success. This extension is expected to be practically valuable, as many services and organizations can then reward such users in calculated ways. 7. REFERENCES [1] P. Achananuparp, E.-P. Lim, J. Jiang, and T.-A. Hoang. Who is retweeting the tweeters? modeling, originating, and promoting behaviors in the twitter network. ACM Transactions on Management Information Systems (TMIS), 3(3):13, 2012. [2] T. Achterberg. Scip: Solving constraint integer programs. Mathematical Programming Computation, 1(1):1–41, 2009. http://mpc.zib.de/index.php/MPC/article/view/4. [3] L. A. Adamic, J. Zhang, E. Bakshy, and M. S. Ackerman. Knowledge sharing and yahoo answers: everyone knows something. In Proceedings of the 17th international conference on World Wide Web, pages 665–674. ACM, 2008. [4] E. Agichtein, Y. Liu, and J. Bian. Modeling information-seeker satisfaction in community question answering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(2):10, 2009. [5] F. Bonchi, C. Castillo, and D. Ienco. The meme ranking problem: Maximizing microblogging virality. Journal of Intelligent Information Systems, page 29, 2013. [6] S. P. Borgatti. Identifying sets of key players in a social network. Computational & Mathematical Organization Theory, 12(1):21–34, 2006. [7] J. S. Coleman, E. Katz, H. Menzel, et al. Medical innovation: A diffusion study. Bobbs-Merrill Indianapolis, 1966. [8] M. del Pozo, C. Manuel, E. González-Arangüena, and G. Owen. Centrality in directed social networks. a [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] game theoretic approach. Social Networks, 33(3):191–200, 2011. F. DeroıÌĹan. Formation of social networks and diffusion of innovations. Research Policy, 31(5):835–846, 2002. P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 57–66. ACM, 2001. M. Fürer and S. P. Kasiviswanathan. Exact Max 2-Sat: Easier and Faster. Springer, 2007. M. R. Garey and D. S. Johnson. Computers and intractability: a guide to np-completeness, 1979. R. E. Glasgow, T. M. Vogt, and S. M. Boles. Evaluating the public health impact of health promotion interventions: the re-aim framework. American journal of public health, 89(9):1322–1327, 1999. A. Goyal, F. Bonchi, L. V. Lakshmanan, and S. Venkatasubramanian. On minimizing budget and time in influence propagation over social networks. Social Network Analysis and Mining, pages 1–14, 2012. F. M. Harper, D. Raban, S. Rafaeli, and J. A. Konstan. Predictors of answer quality in online q&a sites. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 865–874. ACM, 2008. J. C. Harsanyi. A simplified bargaining model for the n-person cooperative game. International Economic Review, 4(2):194–220, 1963. E. Katz. The two-step flow of communication: An up-to-date report on an hypothesis. Public Opinion Quarterly, 21(1):61–78, 1957. M. L. Katz and C. Shapiro. Network externalities, competition, and compatibility. The American economic review, pages 424–440, 1985. D. Kempe, J. Kleinberg, and É. Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003. S. J. Liebowitz and S. E. Margolis. Network externality: An uncommon tragedy. The Journal of Economic Perspectives, pages 133–150, 1994. Q. Liu, E. Agichtein, G. Dror, E. Gabrilovich, Y. Maarek, D. Pelleg, and I. Szpektor. Predicting web searcher satisfaction with existing community-based answers. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 415–424. ACM, 2011. M. W. Macy. Chains of cooperation: Threshold effects in collective action. American Sociological Review, pages 730–747, 1991. R. K. Merton. Selected problems of field work in the planned community. American Sociological Review, pages 304–312, 1947. C. C. Moallemi and B. Van Roy. Convergence of min-sum message-passing for convex optimization. IEEE Transactions on Information Theory, 56(4):2041–2050, 2010. R. B. Myerson. Graphs and cooperation in games. [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] Mathematics of operations research, 2(3):225–229, 1977. R. B. Myerson. Conference structures and fair allocation rules. International Journal of Game Theory, 9(3):169–182, 1980. M. E. Newman. Spread of epidemic disease on networks. Physical Review E, 66(1):016128, 2002. A. S. Nowak and T. Radzik. The shapley value for n-person games in generalized characteristic function form. Games and Economic Behavior, 6(1):150–161, 1994. B.-W. On, E.-P. Lim, J. Jiang, A. Purandare, and L.-N. Teow. Mining interaction behaviors for email reply order prediction. In Advances in Social Networks Analysis and Mining (ASONAM), 2010 International Conference on, pages 306–310. IEEE, 2010. B.-W. On, E.-P. Lim, J. Jiang, and L.-N. Teow. Engagingness and Responsiveness Behavior Models on the Enron Email Network and Its Application to Email Reply Order Prediction. Springer, 2013. G. Owen. Values of games with a priori unions. In Mathematical economics and game theory, pages 76–88. Springer, 1977. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: bringing order to the web. 1999. J. Preece, B. Nonnecke, and D. Andrews. The top five reasons for lurking: improving community experiences for everyone. Computers in human behavior, 20(2):201–223, 2004. M. Samadi, A. Nikolaev, and R. Nagi. A subjective evidence model for influence maximization in social networks. Omega, 2015. E. Sanchez and G. Bergantiños. On values for generalized characteristic functions. Operations-Research-Spektrum, 19(3):229–234, 1997. D. A. Schult and P. Swart. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conferences (SciPy 2008), volume 2008, pages 11–16, 2008. L. S. Shapley. A value for n-person games. Technical report, DTIC Document, 1952. X. Song, Y. Chi, K. Hino, and B. L. Tseng. Information flow modeling based on diffusion rate for prediction and ranking. In Proceedings of the 16th international conference on World Wide Web, pages 191–200. ACM, 2007. M. Stearns, S. Nambiar, A. Nikolaev, A. Semenov, and S. McIntosh. Towards evaluating and enhancing the reach of online health forums for smoking cessation. Network Modeling Analysis in Health Informatics and Bioinformatics, 3(1):1–11, 2014. X. Tang and C. C. Yang. Identifing influential users in an online healthcare social network. In Intelligence and Security Informatics (ISI), 2010 IEEE International Conference on, pages 43–48. IEEE, 2010. T. W. Valente, S. Frautschi, R. Lee, C. OKeefe, L. Schultz, R. Steketee, L. Chitsulo, A. Macheso, Y. Nyasulu, M. Ettling, et al. Network models of the diffusion of innovations. Nursing Times, 90(35):52–3, 1994. [42] T. van Mierlo. The 1% rule in four digital health social networks: an observational study. Journal of medical Internet research, 16(2), 2014. [43] T. van Mierlo, D. Hyatt, and A. T. Ching. Mapping power law distributions in digital health social networks: Methods, interpretations, and practical implications. Journal of medical Internet research, 17(6), 2015. [44] X. Wang, K. Zhao, and W. Street. Predicting user engagement in online health communities based on social support activities. In Ninth INFORMS Workshop on Data Mining and Analytics, San Francisco, CA, 2014. [45] W. H. Whyte Jr. The web of word of mouth. Fortune, 50(1954):140–143, 1954. [46] E. Winter. The shapley value. Handbook of game theory with economic applications, 3:2025–2054, 2002.
© Copyright 2024 Paperzz