Community Discovery and Profiling with Social Messages ∗ Wenjun Zhou University of Tennessee 247 Stokely Mgmt. Center Knoxville, TN 37996 [email protected] Hongxia Jin Yan Liu IBM Research at Almaden 650 Harry Road San Jose, CA 95120 Univ. of Southern California 941 Bloom Walk Los Angeles, CA 90089 [email protected] ABSTRACT [email protected] the user’s current focus areas. A focus area is an ad-hoc community, in which several users interact on certain topics. Building such a profile automatically will be helpful for a number of subsequent analytical tasks, such as helping users visualize and organize social communications, classifying new messages into corresponding focus areas, and prioritizing new messages by the user’s activeness or relevance in that area. Current developments in data mining and machine learning provide useful techniques to discover communities from text and social links. For example, topic models can extract topics discussed in documents [9, 5] and represent each topic with a number of ranked key words. On the other hand, social network analysis can identify social relationships according to the communication patterns [10]. Yet many challenges remain to be addressed. First of all, the definition of community has to be well-aligned to application. A community might be a group of people who are closely linked in a social network, or those who share common interests (but not necessarily interact directly with each other). We believe that a semantically meaningful community has to consider both aspects, especially in a collaboration network. Further, most existing work takes a flattened view on social linkage. More specifically, the link between a pair of users is represented by a collapsed evaluation of their relationship, such as closeness or similarity [12], or shared topics [14]. However, communities are usually more than pair-wise connections. The linkage between a pair of users may be sliced into more than one communities being shared. After all, nowadays it is not uncommon that employees of an organization multifunction, and some employees may collaborate on multiple projects at the same time. In this paper, we propose a community discovery and profiling method based on an extension of the generative model [5]. A key element in this method is a latent community assignment, given which the distributions of topics and social links can be determined. The intuition is that each social message document, when created and shared, corresponds to a sharing activity within the community (both topic-wise and person-wise). More specifically, we extend the topic models and assume the generative process of a latently assigned community for each document. Then, based on the assigned community, words and participants are randomly sampled from the vocabulary and pool of people. Such a Bayesian topic model is trained by Gibbs sampling, so that based on the observed words and people who take active roles in the social media, it can discover most promi- Discovering communities from social media and collaboration systems has been of great interest in recent years. Existing work show prospects of modeling contents and social links, aiming at discovering social communities, whose definition varies by application. We believe that a community depends not only on the group of people who actively participate, but also the topics they communicate about or collaborate on. This is especially true for workplace email communications. Within an organization, it is not uncommon that employees multifunction, and groups of employees collaborate on multiple projects at the same time. In this paper, we aim to automatically discovering and profiling users’ communities by taking into account both the contacts and the topics. More specifically, we propose a community profiling model called COCOMP, where the communities labels are latent, and each social document corresponds to an information sharing activity among the most probable community members regarding the most relevant community issues. Experiment results on several social communication datasets, including emails and Twitter messages, demonstrate that the model can discover users’ communities effectively, and provide concrete semantics. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining Keywords Community Discovery, Email, Social Media, Collaboration, Generative Models 1. INTRODUCTION Given a large collection of social messages, we are interested in profiling a user’s communities, which correspond to ∗The work was done when the author was an intern at IBM Almaden Research Center. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’12, August 12–16, 2012, Beijing, China. Copyright 2012 ACM 978-1-4503-1462-6 /12/08 ...$15.00. 388 nent topics and participants, as well as obtain best bets on the communities to be assigned. Our solution has a number of advantages. First of all, it fills the gap of discovering multi-layered social communities, and provide a semantic description (i.e. mixture of topics) of each layer. This can potentially help us find more meaningful communities, which could be missed by existing methods. In addition, because the latent community can serve as an anchor point for mounting information shared across multiple sources and platforms, our model may be applied to other kinds of knowledge sharing systems, such as instant messaging, online discussion forums and group wiki. With this tool, it is possible for users to summarize his or her focus areas in an automatic fashion, so that it is easy to manage the documents, build profiles, find experts [4] and target relevant users in a social network. The rest of the paper is organized as follows. In Section 2, we discuss the motivation of the study, characteristics of the data, and the state of the art in related studies. Then we describe our model in Section 3, where we provide the technical details. We evaluate our model on real-world datasets, and compare the performances with existing models in Section 4. Finally, we conclude in Section 5. 2. f g e u d a c b (a) Single-Layer A2 f g e u d a SOCIAL MESSAGES c A1 In this paper, we use the term “social messages,” referring to text documents that are associated with a group of people. In the following, we overview the basic characteristics for various types of social messages, with an emphasis on their commonalities. b Social Media A tweet (or Facebook update) is created by a user and broadcasted to his or her followers (or friends), who are allowed to look at and respond to such media. Usually a tweet is visible to many followers, and only a few (compared to all followers) make an active response, such as re-tweet, forward, or comment. Those active responses indicate the interest and relevance of those followers. Emails We also consider the email as one type of social message, since it also involves multiple people, and helps spread of information. Each email has designated recipients, who are related to the message, at the discretion of the sender. Compared to social media, emails tend to be more targeted by the sender, and all recipients are considered as relevant to some extent. A3 (b) Multi-Layer Figure 1: A user’s social communities. In the rest of the paper, when referring to social messages, we have taken into account many different types of documents that might be modeled the same way. By unifying different types of social messages, there is the advantage of integrating different sources of information and providing more comprehensive profiles of the users. 2.1 A Motivating Example Consider a single-user’s perspective. Figure 1 illustrates the collaboration network of user u, who resides at the center. All other nodes are his or her visible contacts. There is a link between two nodes if there has been any direct message exchange between them. Figure 1 (a) shows the traditional single-layer view on pair-wise linkage, where each link is evaluated individually based on this pair only. For example, frequency of messages, number of shared contacts, or important topics. Figure 1 (b) shows a multi-layered view on u’s communities and provides a few examples why it makes more sense than the single-layered view. Imagine that the user u is associated with three communities, A1 , A2 and A3 , simultaneously. u and e collaborate on communities A2 and A3 simultaneously, so their connection should have two different layers that apply to different activity areas, depending on either f is involved or d is involved. Sometimes, a message in community A3 does not involve all people in that community. For example, u and e, since they work so closely, may Collaborative Content This category include publications (with co-authors), patents (with co-inventors), and Wiki pages (with collaborators). They all include text and people, so they can also be modeled with the same structure. The people, whose names are declared, share the same published content, which presumably represent their interest and expertise. Among the above typical types of social messages, one feature in common is that each document is created and shared among a group of people. If we consider the people as nodes in a graph, then the social links that are being considered are clique-type hyper-edges in that graph. Such social linkage is different from document linkage, such as hyperlinks on blogs that link to other blogs, and references at the end of a paper that link to other papers. 389 exchange emails relevant to community A3 . Without c or d being involved, such messages should still be routed to community A3 due to the relevance by content. Furthermore, u could see the linkage between f and g since f put g as an additional recipients in some of the emails he or she writes to u. In this case, even if u and g do not exchange emails directly, A2 should include person g. These results will be missed by purely analyzing linkages. 2.2 D E D E Related Work P \ c z M P I w N D Figure 2: Graphical representation of COCOMP. which has two components: topic mixture θ and participant mixture η. More specifically, in community m (m = 1, 2, . . . , M ): • The topic mixture θm represents the weight of different topics in this community, which has a Dirichlet distribution with hyperparameter α: θm |α ∼ Dirich(α). (2) • The participant mixture ηm = represents each person’s activeness in this community. In other words, ηm,p , represents person p’s activeness in community m (p = 1, 2, . . . , P ). We assume that ηm,p has a Beta distribution with hyperparameters α0 and β0 : ηm,p |α0 , β0 ∼ Beta(α0 , β0 ). (3) Finally, there is a community activeness vector ψ, which is assumed a Dirichlet distribution with hyperparameter μ: ψ|μ ∼ Dirich(μ). (4) Further, we assume the following generative process of this collection of D email documents. For d = 1, 2, . . . , D: 1. A latent community cd is assigned by tha maximum likelihood of community membership, according to words, topics and people. cd = arg max LLH(d, c), 3. THE COMMUNITY MODEL c To discover the latent communities, we develop a generative model, called COCOMP, which stands for COllaborator COMmunity Profiling. It attempts to discover communities in social media documents by considering context in both topics and collaboration groups. The basic rational is to assume that each social media document corresponds to a conversation session within one community, which is defined both by topics and participants. In other words, the topics of a social media document is derived from the community’s topic mixture, and the people involved in the thread tend to be those who actively participate in the work area. (5) where LLH(d, c) means the log likelihood of document d being assigned to community c. 2. For each person p, run a Bernoulli trial according to his/her personal activeness in community cd , to see if he or she is involved. Specifically, id,p |η, cd ∼ Bernoulli(ηcd ,p ). (6) ∀p = 1, 2, . . . , P . 3. Suppose that this document has Nd tokens. For each token in the document, a word is generated in a similar fashion to LDA. Specifically, for n = 1, 2, . . . , Nd : Generative Process Figure 2 shows the generative process of the latent community model. Like traditional topic models, it is assumed that there are K word distributions φ1:K which correspond to K latent topics, and are assumed to be Dirichlet distributions with prior β: φk |β ∼ Dirich(β), k = 1, 2, . . . , K T i P K Discovering communities has been of interest by many previous studies. Being interested in studying collaboration network, we find that social network analysis and topic modeling are both relevant. Specifically, social network analysis typically focuses on closeness of social linkage [7], or evolution of social connections [18]. Due to the complexity of the network, social links are typically simplified into a single-layer of measurement, without consideration of the general context. On the other hand, models for topic discovery from textual documents are also extensively studied, including probabilistic Latent Semantic Analysis (pLSA) [9], Latent Dirichlet Allocation (LDA) [5], variations [8] and applications [11, 3]. Based on the words in a large corpus of documents, these models can extract human comprehensible topics represented by a list of keywords. Since documents are commonly related to people, recent developments have taken into account people who are related to the topics. For example, the Author-Topic (AT) model [17] considers the interests of each author across multiple documents, and aim to derive the representative topics for individual authors. The Author-Recipient-Topic (ART) model [13, 14] considers topics that are specific to each author-recipient pair. These models focus on profiling individuals or pairwise relation. However, they do not model communities directly. Other works try to augment social network analysis with topic modeling [6, 16, 19, 12], and such models can discover users’ mixed membership in various topical communities. However these models are singlelayered, without considering the general context of “who are collaborating on what concurrently”. 3.1 K (a) Draw a topic assignment zd,n from the cd -th community’s topic mixture: zd,n |θ, cd ∼ M ulti(θcd ). (7) (b) Draw a word wd,n from the zd,n -th topic-word distribution: (1) wd,n |zd,n , φ ∼ M ulti(φzd,n ). Also, we assume that there are M communities, each of 390 (8) 3.2 Model Training exchanges and celebrity’s Twitter interactions. Table 1 lists some basic statistics of each user’s social media dataset. For each user, we include bi-direction communications. In other words, for email datasets, we include both inbox and outbox; and for twitter datasets, we include tweets written by this user as well as those that mention this user (such as re-tweets). Preprocessing Basic data preprocessing has been conducted on the raw social media documents, such as parsing and unwrapping HTML tags, removing stopwords, and transforming the text into bag of words. We also excluded a small number of incomplete documents. A document is considered incomplete at the preprocessing stage, if it does not involve at least two different users, or if its remaining bag of words after stopword removal is empty. Implementation Our model has been implemented in Java, based on modifications of the MALLET [15] package. All experiments are run within Eclipse on a Dell Latitude, running 64-bit Windows 7 Professional with 8.00GB RAM. Evaluation Metric Perplexity has been a common metric for evaluating language models. For community c, with word sequence wc , whose length (number of tokens) is Nc , the perplexity can be computed as ff j ln p(wc ) perplexity(wc ) = exp − . (15) Nc We used Gibbs sampling for training the model. In this subsection, we derive the conditional posterior distributions for sampling these parameters sequentially. For document d, the posterior probability for its community assignment will be (−d) μj + Dj P (cd = j|c−d ) = PC , c=1 μc + D − 1 (9) (−d) is number of documents (except the d-th docwhere Dj ument) that are assigned to community j. After updating the community densities, the document is assigned to the community in which it has the highest likelihood. For the i-th token in the d-th document, which is assigned to community cd , the conditional posterior for its latent topic will be P (zd,i = j|·) ∝ (−d,i) P (−d,i) βwd,i + nj,wd,i αj + ncd ,j P −(d,i) · P P −(d,i) , (10) v βv + v nj,v k αk + k ncd ,k where nj,v is the number of times a unique word type v is assigned to topic k, nc,j is the number of times a token in community c is assigned to topic j, and the superscript −(d, i) means to exclude the i-th token in the d-th document. After the training process, each parameter is estimated from the ending state: P (c) = P (p|c) = P (k|c) = P (v|k) = μc + Dc μc = P , c μc + D α0 + Dc,p , η = α0 + β0 + Dc αk + nc,k , α = P k αk + nc βv + nk,v β = P v βv + nk Assuming independence among words, we have ln p(wd ) = (11) where P (v|k) and P (v|k) are computed as the posterior probability computed at the end of the training. (13) 4.2 (14) 4.1 Experimental Setup First, we introduce the setup of the experiments, including data summaries, data preprocessing, implementation and platform, as well as the evaluation metric. Table 1: Basic Statistics of Example Users # Contacts 2,280 1,560 690 817 6,890 3,228 Enron Email Dataset Since the Enron dataset is publicly available, effectiveness of our model can be verified using it as a benchmark. In Table 2 we list the top topics and top people for prominent communities discovered for the user Greg Whalley, assuming that there are 30 topics and 10 communities. For each community, we assign a label corresponding to the topics and the people, which are listed in the last column. As we can see, most communities include greg.whalley on the top of the contributor list. This means that the user is active in such communities, so his rank is higher. When building his profile, we want to know how relevant a activity area (i.e. community) is to him, so it is desired that this user appears on the top of the contributor list for many communities, which means the communities are his major activity areas. Also, we can see that some topics may rank high in several communities, but the composition of topics and people are different. From another data source [2], we know that Greg Whalley was the president of Enron, John Lavorato was a CEO, and Louise Kitchen was the president for Enron Online. We don’t have information for mark.frevert or liz.taylor, who might be the assistants for Greg Whalley and his contacts. Their roles are consistent to the topics and communities we have discovered. EXPERIMENTS # Docs 2,120 718 623 607 6,134 5,077 (16) (12) In this section, we present experiments with the COCOMP model. First, we describe the overall setup of the experiments, and then show discovery results from each dataset. User arnold-j whalley-g zhouw hongxia obama justin nc,v [ln P (v|k) + ln P (k|c)], v=1 where Dc,p is the number of times a person p is involved in community c. 4. V X Source Enron Enron IBM IBM Twitter Twitter 4.3 Datasets Our model has been tested with various types of social messages. Given that we are mainly interested in a single user’s perspective, we collected individual user’s email Zhouw Email Dataset Although the Enron dataset is public, the actual social network and communities is not well-known and there is 391 Table 2: Greg Whalley Communities (K = 30, C = 10) Community 1 15 23 4 23 25 18 25 16 12 8 22 2 5 6 7 Topics bill, assembl, senat, day, california pleas, util, transact, review, date today, number, follow, transact, ani pleas, meet, eb, system, expens today, number, follow, transact, ani inform, servic, avail, email, custom secur, servic, bush, salomon, presid, contract inform, servic, avail, email, custom, click may, includ, oper, account, result, rais publish, name, memo, target, task itext, img, onlin, valu, pleas corpor, com, best, onli, monday People People greg.whalley john.lavorato louise.kitchen greg.whalley mark.frevert liz.taylor greg.whalley louise.kitchen mark.frevert greg.whalley liz.taylor louise.kitchen .1335 .0612 .0511 .0942 .0371 .0226 .0744 .0370 .0240 .1913 .0863 .0089 Label .5806 .2500 .2016 .6738 .2032 .1497 .4300 .1500 .1200 .4746 .1864 .1441 legislations operations IT publishing online Topics ARCCS .4222 9 button, valley, open, silicon, omitted, embedded .2153 zhouw .3555 0 location, please, site, july, management, unit .1895 kimsu .2963 2 data, pm, web, technology, learning, applications .1836 C2 C3 T1 .2963 Community 1 wt = .1975 CS Department T2 .3185 kaweaver .1798 jldicoio .1791 zhouw Community 3 wt = . 0900 Interns Topics 8 talk, project, manag, time, am, project manag .1581 6 state, link, pm, conference, topic, unit .1557 5 email, systems, server, watson, human, webspher .1501 People T6 rjbarber zhouw Community 2 wt = . 7125 Research Contacts T7 .2215 .8971 ammartin .8382 starkan .8382 okoyeife .8382 ljalali .8235 T8 .3070 jin T9 .3969 ARCCS Topics T10 zhouw Topics People T5 T4 T3 okoyeife zhouw C1 4 summer, intern, genese, ca, side, location .5846 3 please, thanks, unit, one, today, hi .2077 7 information, work, new, speaker, event, program .0995 Communities (a) Community Labels (b) Communities’ Topic Distributions Figure 3: Zhouw communities (K = 10, C = 3). −608000 no golden standard to check the exact correctness of the important communities found by COCOMP. As a result, we run COCOMP on collections of our own email datasets, which will be presented below. First, we look at results on the zhouw dataset. This dataset contains one of the authors’ emails sent and received during her internship at IBM. Aassuming 10 topics and 3 communities, we find communities as shown in Figure 3. As we can see in Figure 3(a), the three communities discovered are quite clear. It looks like that the first community is the CS department in general, the second community is the smaller research unit, and the third community has to do with the intern group. Both people and keywords make sense. For example, in the intern community, the top contacts are other interns who share similar weights in this community. The top ranked topic, Topic 4, clearly shows that “summer” and “intern” are the top keywords, and “genese” is the summer intern coordinator. Figure 3(a) is a mosaic plot of the topic distribution in each community. Each other represents a topic, and each column represents a community. We can see clearly that the composition of topics, as indicated by the heights of the rectangles, is very different from one community to another. ● −614000 Log Likelihood −620000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 400 600 800 1000 Iteration Figure 4: Convergence process. The widths of the rectangles represent the size of each community. In other words, they are proportional to the number of documents in each community. Clearly, a lot of the emails are related to research. Despite the fact of being a bit more complex than the plain LDA model, since we consider people in addition, our model converges reasonably fast. Figure 4 shows the convergence process for model training. Although we run 1, 000 iterations, the log likelihood begins to stabilize around the 200-th iteration. Other datasets show similar results. 392 People Topics jin .8193 qwang .6506 tpmoran .4458 zhangwe .4096 jpierce .0084 jschoudt .0840 11 1 People Topics perform, context, priorit, messag, cluster, sourc .4148 databas, crawl, email, model, request, issu .2047 17 2 interact, analytic, learn, algorithm, implement, cluster .6336 propos, profil, parameter, show, unit, technic .1142 Community 2 wt = 0.1390 Data Analytics #2: Message Prioritization People Community 9 wt = 0.1126 Data Analytics #1: User Profiling jin .9714 qwang .6000 zhangwe .4714 tpmoran .4571 caverzan .4000 basmith .2430 jin .8621 Topics lots .1552 10 .1207 applic, action, docket, patent, recommend, disclosur, .5155 leakedeots 9 decis, manag, intend, receiv, review, response .1323 dulce .1207 signin .1034 jgeagan .0690 Community 7 wt = 0.0977 Digital-Right-Management (Traitor Tracing) Community 8 wt = 0.1109 Patenting #1 Hongxia Topics 12 drm, mkb, traitor, trace, spec, workshop .5658 0 inform, discuss, week, access, plan, content .1192 Community 5 wt = 0.1026 Patenting #2 Community 3 wt = 0.0993 Global Tech Outlook People People jin .8657 rocky .1940 kashef .1045 turley .1045 Dnorthfield .0060 Mcnlly .0060 People jin .7500 Topics jalal .0667 19 ehaber .0667 zmhou .0500 rjprill dill jin .7581 qwang .3226 Topics rocky .0968 13 .0806 patent, Input, data, applic, result, search .5733 basmith .0500 rjprill .0806 9 .1202 .0330 kashef .0650 decis, manag, intend, receiv, review, response 7 propos, gto, topic, driven, check, busi .4823 social, mobil, technolog, interest, challeng, busi .1836 Figure 5: Hongxia communities (K = 20, C = 10). there is clearly the benefit of finding communities that are interesting to the user. As for the quality of the topics, Figure 6 shows a simple comparison of perplexity for different parameter sets. The three bars on the left correspond to perplexities derived from LDA [5], using k = 3, k = 5, and k = 10, respectively. The two bars on the right represent perplexities derived from COCOMP on the same training data, using 10-topic-3-community (K = 10, C = 3), and 10-topic-5community(K = 10, C = 5), respectively. The COCOMP model performs consistently better than LDA for having lower perplexity. In return of handling more complexity by considering the people information, the communities discovered by COCOMP make better sense than those derived directly from by LDA. In order to find three communities, we run the LDA model with the parameter k = 3. The basic idea is to identify the topic mixture of each document, and assign the documents to the most important topic. Then based on such document communities, we find the most frequently involved people. The results are listed in Table 3. Table 3: Zhouw Communities by LDA (K = 3) 2500 Community 3 conference systems web data user thanks zhouw jin okoyeife kimsu kaweaver mmejias 2446.88 2446.90 2447.06 2152.90 1500 2000 2100.25 Perplexity Community 2 button valley open please silicon embedded zhouw Software okoyeife imber basmith Storage 1000 Community 1 talk project manag pm time am Software rjbarber Research vitaly jldicoio Storage 0 500 The communities, as indicated by the key words, make some sense. For example, Community 1 seems to be the announcement of talks and activities. The top participants include major email lists, to which general announcements are typically sent to. However, a major problem is that the interns group is not found. Despite of the saliently different topics, because there are a small number of emails in the interns community, the LDA fails to capture that community. By considering people (social contacts) in addition, K3 K5 K10 C3T10 C5T10 Figure 6: A comparison of perplexity. 393 Table 4: Hongxia Communities by LDA (K = 10) Topic 5 interact talk social analysis correl check jin tpmoran zhangwe qwang caverzan basmith Topic 3 databas perform crawl women webinar text jin tpmoran zhangwe qwang caverzan basmith Topic 1 patent docket lectur applic application seri jin rocky qwang zhangwe tpmoran basmith Topic 7 test disclosur requir dlp server miss jin qwang tpmoran zhangwe caverzan basmith 4.4 Hongxia Collaboration Dataset Topic 4 step integr week cluster servic june jin qwang tpmoran zhangwe caverzan basmith Topic 2 context messag sourc oper result priorit jin qwang jschoudt jspierce hbadenes tpmoran clustered into each of those 10 topics. For the sake of comparison, we also illustrate the top 6 communities/topics in Table 4. Again, the weight is calculated based on the number of documents clustered into each community/topic. As we can see from the table, LDA also detected the two topics/commnities about data analytics, namely Topic 5 and Topic 2. However, the top people involved in both communities are not as precise as the results from our proposed model. More significantly the data analytics activities are not detected very completely. Indeed, many are mixed up with other topics as clearly demonstrated by Topic 3, 4 and 7 in the table. The fact that topics are mixed in the detected communities also means that the documents clustered into each topic are mixed, resulting in the inaccurate weighting of the community. As one can see, LDA result ranks the Topic/Community 2 as the last of the top 6 communities. This rank is incorrect. Similarly, both Topics 3 and 7 are partly about data analytics work but each mixed up with several different actual activities for Hongxia. For example, Topic 7 mixed data analytics with patenting. But these two activities have no overlapping in terms of topics. As a result, the top people shown for these two communities are still those same people involved in the data analytics activity area. Topic/Community 1 is also a mix of topics (mixing with data analytics activities), but to a less extent. It is mainly about patenting. However it mixes the two different groups of people that user Hongxia involved in working with on patenting. In fact some of those people are not shown on top. Instead some of the people involved in data analytics still appear in top 6 participants in this community due to topic mixture. Topic 4 is partly about ”traitor tracing” but heavily mixed with other topics (mainly with the data analytics activities). As a result, the correct group of people involved are not even shown on top. In this subsection, we apply the COCOMP model on social collaboration data from another author of the paper during a 3-month period. The hongxia dataset contains mostly emails, but also small amounts of data crawled from other social software in IBM, such as wiki and communities. The top 6 communities are shown in Figure 5, assuming 20 latent topics and 10 communities. As shown in Figure 5, the top 2 communities are both about her big data analytics research activities. Data analytics is her main research direction. Community 2 is focused on the user and community profiling work while community 1 is focused on using user profile to perform context aware prioritization on incoming messages/updates for the user. As one can understand, these research endeavors overlap in the research nature. While she heavily works with 3 other colleagues in both these two areas, some others involved in these two areas are different. Our model clearly detects two overlapping communities that focus on two somewhat overlapping research activities. Moreover, these two activity areas are indeed her most focused activities. The second major activity for Hongxia is related to patenting. Our model detected two different communities that Hongxia is involved with regarding patenting, namely community 8 and community 5, ranked #3 and #4 respectively in top 6. Again both the topics and the people involved in these two communities overlap. Indeed both Topic 9 is shown in these two communities. Our model managed to clearly detect these two overlapping communities. This indicates the multi-layers of Hongxia’s social links with those overlapping members in these two communities, an example of what we illustrated in Figure 1 in Section 2. Ranked #5 in the detected top 6 communities by our model is Community 3. It is mainly about GTO (Global Technological Outlook) planning activities that she involved. It involves proposal writing, submitting and reviewing. This community consists of a different group of colleagues that she lightly collaborates with. Lastly, Community 7 is about another light activity area in this 3-month period. It is about Digital Rights Management which was an old research area that Hongxia heavily involved in the past. ”Traitor tracing” was the main research topic in this activity area as indicated by Topic #12. For comparison purpose, we also experiment with LDA using the same dataset. We have LDA detect the top 10 topics and cluster the documents into its dominant topics. We then derive the 10 communities for Hongxia by extracting the top people involved in the corresponding documents 4.5 Twitter Datasets In order to study the effectiveness of our model on Twitter datasets, we choose two celebrity Twitter users and study their tweet exchange with others Twitter users. One account is the United States President Barack Obama (Figure 7), and the other is a famous singer Justin Beiber (Figure 8). We have crawled all the tweets posted by these two users as well as the replied-to tweets by other users, started from November 1, 2009. As a result, we have collected 6, 134 tweets for Barack Obama and 5, 077 tweets for Justin Beiber. Twitter users use some structure conventions such as a user-to-message relation (i.e. initial tweet author, via, cc, by), type of message (i.e. Broadcast, conversation, 394 Topics People 20 senate, moment, vote, help, polls, leadership .1612 user15178 .4189 25 health, passed, bill, ma, real, call .1570 user251666 .0259 7 reform, forward, watch, true, phone, bottom .0671 user220141 .0053 Topics 11 today, learn, people, help, prayers, country .1691 3 haiti, relief, efforts, thoughts, women, brave .1626 24 support, best, prize, congrats, yesterday, holiday .0138 People user15178 .3451 user7008 .0111 user7774360 .0071 user398403 .0071 user3226 .0061 user6601936 .0053 user45593 .0043 happy, family, thanksgiving, good, climate, copenhagen .1026 1 year, congress, today, special, save, house .0725 28 change, job, pass, add, friends, season .0659 People Community 3 wt = .2412 Holidays Community 5 wt = .1611 Blessing Haiti user15178 .5962 user45593 .0125 user7774360 .0084 user398403 .0074 user8222719 .0071 Barack Obama People Community 2 wt = .2617 Public Policy Topics People user4296 23 Community 4 wt = .1704 Senate Votes Community 1 wt = .1656 President user15178 Topics .4055 22 .0177 user387192 .0098 user7774360 .0069 user8222719 .0064 29 17 big, change, things, easy, reform, stirs .2243 changes, controversy, men, time, michelle, move .0792 obama, will, people, twitter, america, health user15178 .6213 user45593 .0345 user185447 .0093 user700488 .0090 user289787 .0059 Topics .0289 0 people, quit, kill, health, industry, bill .0583 14 christmas, pay, start, #p2, check, blog .0466 15 afford, celebration, game, send, great, spending .0454 Figure 7: Obama communities (K = 30, C = 5). Wikipedia page [1], we found that “Bieber performed Stevie Wonder’s Someday at Christmas for U.S. President Barack Obama and first lady Michelle Obama at the White House for Christmas in Washington, which was broadcast on December 20, 2009, on U.S. television broadcaster TNT.” So probably they are connected because of this event, and also Michelle, who is also in Community 1. For that big event, as a most popular pop star who has countless fans all over the country, Justin is promoted to an activity area of the President. For two very influential twitter users, it is interesting to see how they are connected with each other on Twitter. or retweet messages), type of resources (i.e. URLs, hashtags, keywords) to overcome 140 character limit. Since the data are quite noisy with many URLs and acronyms, we preprocessed the data by removing HTML tags and extremely infrequent words. As shown in Figure 7, if we model Obama’s tweets with 30 topics and 5 communities, we can see that his communities from November 2009 to February 2010 can be roughly represented as: President, which is about comments on his advocacy of Presidency; Public Policy, which related to domestic and international politics and policy; Holidays, which is mainly about wishing the best for American families and friends; Senate Votes that has to do with health bills; and Blessing Haiti after the tremendous earthquake. It is interesting to observe that user45593 is the Twitter account whitehouse, which is expected to relate to the President regarding domestic and international policies. user7008 is Nicki Minaj, a singer who originated from Haiti. Not surprisingly she is most concerned about the tragedy that has happened in her home country. user251666 is Martha Coakley, Massachusetts Attorney General, who is on top list of participants in Community 4, which is about law making. We also show Justin Beiber’s Twitter communities in Figure 8. For a pop star like Justin, it is expected that the majority of his “collaborators” are his fans. However, we are able to group those fans into several groups, such as those who like to express emotions (Community #3) and those who does very casual conversations (Commmunity 4). Community 5 is mainly about Justin’s broadcasts of media, such as showing his fans some videos. In Communities 1 and 2, different groups of fans were talking about his new album “My World”, and his Golden Ticket Concert, respectively. Finally, we found that Justin Beiber appeared in Barack Obama’s Community 1. In that community he was ranked second in the participant list, right after Obama himself. What does Justin have to do with Obama? On Justin’s 5. CONCLUDING REMARKS In this paper, we design a latent community model, called COCOMP, to uncover the communities of each user as well as their associated topics and communities. In particular, it models each community as a mixture of topics with a corresponding group of users who collaborate together on these topics. With a latent assignment of community membership, we assume that each social media document corresponds to a sharing activity within a community (both topic-wise and person-wise). Experiment results on email and social media datasets demonstrate the effectiveness of our model. For future work, our model can be extended in various ways. For example, instead of treating all people involved as the same, there is the need of separating active members from passive members (i.e. those who only receive the messages). Also, since the social media contents change over the time, it is meaningful to develop dynamic models to capture the evolving process and online algorithms to monitor the changes over time. 6. ACKNOWLEDGEMENTS This research is continuing through participation in the Social Media in Strategic Communication (SMISC) program 395 Topics People 6 justin, pls, say, repli, tell, happi .1772 user4296 .4829 7 fan, bieber, pls, wish, back, wanna .0978 user37723 .0134 4 dm, know, think, give, am, say .0904 user91325 .0052 user361090 .0052 user361629 .0052 Topics 3 fan, plz, watch, girl, mean, im Community 4 wt = .1916 Conversations .2269 0 dm, ur, amaz, time, justin, dream .1246 4 dm, know, think, give, am, say .0796 Topics .5525 user490288 .0184 user516629 .0054 user150257 .0050 user458998 .0045 lol, realli, one, hope, thank, good .1748 im, math, cri, stink, bad, deserv .1445 0 dm, ur, amaz, time, justin, dream .0852 People People user4296 1 5 user4296 Community 3 wt = .2138 Emotions Community 5 wt = .2189 Media .5354 user6749 .0069 user208734 .0060 user212304 .0037 user893157 .0037 Justin Bieber People People user4296 .5160 user349756 .0072 user188342 .0034 user203257 .0034 user8603 .0034 user4296 Community 2 wt = .1718 The Show Community 1 wt = .2039 Album .4259 user502736 .0057 user563072 .0046 user506174 .0040 user361520 .0040 Topics Topics 2 world, wait, buy, album, #myworld, gonna .1793 8 day, justin, cant, #1, excit, week .1510 0 dm, ur, amaz, time, justin, dream .0819 9 ticket, golden, win, im, mingl, hope 2 world, wait, buy, album, #myworld, gonna .2103 .0582 4 dm, know, think, give, am, say .0487 Figure 8: Justin communities (K = 10, C = 5). sponsored by the U.S. Defense Advanced Research Projects Agency (DARPA) under Agreement Number W911NF-12-10034. The views and conclusions contained herein are those of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or implied, of any of the above organizations or any person connected with them. 7. [10] [11] REFERENCES [12] [1] http://en.wikipedia.org/wiki/Justin Bieber, retrieved on October 16th, 2011 at around 5pm. [2] Enron employee status. Retrieved from http://www.isi .edu/ adibi/Enron/Enron Employee Status.xls on May 30th, 2012 at around 9am. [3] A. Ahmed, E. P. Xing, W. W. Cohen, and R. F. Murphy. Structured correspondence topic models for mining captioned figures in biological literature. In KDD, pages 39–48, 2009. [4] K. Balog and M. de Rijke. Finding experts and their details in e-mail corpora. In WWW, pages 1035–1036, 2006. [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Machine Learning Res., 3:993–1022, 2003. [6] J. Chang, J. L. Boyd-Graber, and D. M. Blei. Connections between the lines: augmenting social networks with text. In KDD, pages 169–178, 2009. [7] W. de Nooy, A. Mrvar, and V. Batagelj. Exploratory Social Network Analysis With Pajek. Cambridge University Press, 2005. [8] E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proc. of the National Academy of Sciences, 101:5220–5227, 2004. [9] T. Hofmann. Probabilistic latent semantic analysis. In [13] [14] [15] [16] [17] [18] [19] 396 Proc. of the Fifteenth Conf. on Uncertainty in Artificial Intelligence (UAI’99), pages 289–296. Morgan Kaufmann, 1999. T. Lappas, K. Liu, and E. Terzi. Finding a team of experts in social networks. In KDD, pages 467–476, 2009. Q. Liu, Y. Ge, Z. Li, E. Chen, and H. Xiong. Personalized travel package recommendation. In ICDM, pages 407–416, 2011. Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: joint models of topic and author community. In A. P. Danyluk, L. Bottou, and M. L. Littman, editors, ICML’09, volume 382, pages 665–672. ACM, 2009. A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In IJCAI, pages 786–791, 2005. A. McCallum, X. Wang, and A. Corrada-Emmanuel. Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res., 30:249–272, 2007. A. K. McCallum. Mallet: A machine learning for language toolkit, 2002. http://mallet.cs.umass.edu. A. Qamra, B. L. Tseng, and E. Y. Chang. Mining blog stories using community-based and temporal clustering. In CIKM, pages 58–67, 2006. M. Rosen-Zvi, T. L. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proc. of the 20th Conf. in Uncertainty in Artificial Intelligence, pages 487–494, 2004. E. Zheleva, C. Park, and L. Getoor. Co-evolution of social and affiliation networks. In KDD, pages 1007–1015, 2009. D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and H. Zha. Probabilistic models for discovering e-communities. In WWW, pages 173–182, 2006.
© Copyright 2026 Paperzz