Community Discovery and Profiling with Social Messages

Community Discovery and Profiling with Social Messages
∗
Wenjun Zhou
University of Tennessee
247 Stokely Mgmt. Center
Knoxville, TN 37996
[email protected]
Hongxia Jin
Yan Liu
IBM Research at Almaden
650 Harry Road
San Jose, CA 95120
Univ. of Southern California
941 Bloom Walk
Los Angeles, CA 90089
[email protected]
ABSTRACT
[email protected]
the user’s current focus areas. A focus area is an ad-hoc
community, in which several users interact on certain topics. Building such a profile automatically will be helpful
for a number of subsequent analytical tasks, such as helping
users visualize and organize social communications, classifying new messages into corresponding focus areas, and prioritizing new messages by the user’s activeness or relevance
in that area.
Current developments in data mining and machine learning provide useful techniques to discover communities from
text and social links. For example, topic models can extract
topics discussed in documents [9, 5] and represent each topic
with a number of ranked key words. On the other hand, social network analysis can identify social relationships according to the communication patterns [10]. Yet many challenges
remain to be addressed. First of all, the definition of community has to be well-aligned to application. A community
might be a group of people who are closely linked in a social network, or those who share common interests (but not
necessarily interact directly with each other). We believe
that a semantically meaningful community has to consider
both aspects, especially in a collaboration network. Further,
most existing work takes a flattened view on social linkage.
More specifically, the link between a pair of users is represented by a collapsed evaluation of their relationship, such as
closeness or similarity [12], or shared topics [14]. However,
communities are usually more than pair-wise connections.
The linkage between a pair of users may be sliced into more
than one communities being shared. After all, nowadays it
is not uncommon that employees of an organization multifunction, and some employees may collaborate on multiple
projects at the same time.
In this paper, we propose a community discovery and
profiling method based on an extension of the generative
model [5]. A key element in this method is a latent community assignment, given which the distributions of topics and
social links can be determined. The intuition is that each
social message document, when created and shared, corresponds to a sharing activity within the community (both
topic-wise and person-wise). More specifically, we extend
the topic models and assume the generative process of a latently assigned community for each document. Then, based
on the assigned community, words and participants are randomly sampled from the vocabulary and pool of people.
Such a Bayesian topic model is trained by Gibbs sampling,
so that based on the observed words and people who take
active roles in the social media, it can discover most promi-
Discovering communities from social media and collaboration systems has been of great interest in recent years. Existing work show prospects of modeling contents and social
links, aiming at discovering social communities, whose definition varies by application. We believe that a community
depends not only on the group of people who actively participate, but also the topics they communicate about or collaborate on. This is especially true for workplace email communications. Within an organization, it is not uncommon that
employees multifunction, and groups of employees collaborate on multiple projects at the same time. In this paper, we
aim to automatically discovering and profiling users’ communities by taking into account both the contacts and the
topics. More specifically, we propose a community profiling
model called COCOMP, where the communities labels are
latent, and each social document corresponds to an information sharing activity among the most probable community
members regarding the most relevant community issues. Experiment results on several social communication datasets,
including emails and Twitter messages, demonstrate that
the model can discover users’ communities effectively, and
provide concrete semantics.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications—
Data Mining
Keywords
Community Discovery, Email, Social Media, Collaboration,
Generative Models
1. INTRODUCTION
Given a large collection of social messages, we are interested in profiling a user’s communities, which correspond to
∗The work was done when the author was an intern at IBM
Almaden Research Center.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD’12, August 12–16, 2012, Beijing, China.
Copyright 2012 ACM 978-1-4503-1462-6 /12/08 ...$15.00.
388
nent topics and participants, as well as obtain best bets on
the communities to be assigned.
Our solution has a number of advantages. First of all, it
fills the gap of discovering multi-layered social communities,
and provide a semantic description (i.e. mixture of topics) of
each layer. This can potentially help us find more meaningful communities, which could be missed by existing methods.
In addition, because the latent community can serve as an
anchor point for mounting information shared across multiple sources and platforms, our model may be applied to
other kinds of knowledge sharing systems, such as instant
messaging, online discussion forums and group wiki. With
this tool, it is possible for users to summarize his or her
focus areas in an automatic fashion, so that it is easy to
manage the documents, build profiles, find experts [4] and
target relevant users in a social network.
The rest of the paper is organized as follows. In Section 2,
we discuss the motivation of the study, characteristics of the
data, and the state of the art in related studies. Then we
describe our model in Section 3, where we provide the technical details. We evaluate our model on real-world datasets,
and compare the performances with existing models in Section 4. Finally, we conclude in Section 5.
2.
f
g
e
u
d
a
c
b
(a) Single-Layer
A2
f
g
e
u
d
a
SOCIAL MESSAGES
c
A1
In this paper, we use the term “social messages,” referring
to text documents that are associated with a group of people.
In the following, we overview the basic characteristics for
various types of social messages, with an emphasis on their
commonalities.
b
Social Media A tweet (or Facebook update) is created by
a user and broadcasted to his or her followers (or friends),
who are allowed to look at and respond to such media.
Usually a tweet is visible to many followers, and only
a few (compared to all followers) make an active response, such as re-tweet, forward, or comment. Those
active responses indicate the interest and relevance of
those followers.
Emails We also consider the email as one type of social
message, since it also involves multiple people, and
helps spread of information. Each email has designated recipients, who are related to the message, at
the discretion of the sender. Compared to social media, emails tend to be more targeted by the sender,
and all recipients are considered as relevant to some
extent.
A3
(b) Multi-Layer
Figure 1: A user’s social communities.
In the rest of the paper, when referring to social messages, we have taken into account many different types of
documents that might be modeled the same way. By unifying different types of social messages, there is the advantage
of integrating different sources of information and providing
more comprehensive profiles of the users.
2.1 A Motivating Example
Consider a single-user’s perspective. Figure 1 illustrates
the collaboration network of user u, who resides at the center. All other nodes are his or her visible contacts. There is a
link between two nodes if there has been any direct message
exchange between them.
Figure 1 (a) shows the traditional single-layer view on
pair-wise linkage, where each link is evaluated individually
based on this pair only. For example, frequency of messages,
number of shared contacts, or important topics.
Figure 1 (b) shows a multi-layered view on u’s communities and provides a few examples why it makes more sense
than the single-layered view. Imagine that the user u is
associated with three communities, A1 , A2 and A3 , simultaneously. u and e collaborate on communities A2 and A3
simultaneously, so their connection should have two different
layers that apply to different activity areas, depending on either f is involved or d is involved. Sometimes, a message in
community A3 does not involve all people in that community. For example, u and e, since they work so closely, may
Collaborative Content This category include publications
(with co-authors), patents (with co-inventors), and Wiki
pages (with collaborators). They all include text and
people, so they can also be modeled with the same
structure. The people, whose names are declared, share
the same published content, which presumably represent their interest and expertise.
Among the above typical types of social messages, one
feature in common is that each document is created and
shared among a group of people. If we consider the people
as nodes in a graph, then the social links that are being
considered are clique-type hyper-edges in that graph. Such
social linkage is different from document linkage, such as
hyperlinks on blogs that link to other blogs, and references
at the end of a paper that link to other papers.
389
exchange emails relevant to community A3 . Without c or d
being involved, such messages should still be routed to community A3 due to the relevance by content. Furthermore, u
could see the linkage between f and g since f put g as an
additional recipients in some of the emails he or she writes
to u. In this case, even if u and g do not exchange emails
directly, A2 should include person g. These results will be
missed by purely analyzing linkages.
2.2
D
E
D
E
Related Work
P
\
c
z
M
P
I
w
N
D
Figure 2: Graphical representation of COCOMP.
which has two components: topic mixture θ and participant mixture η. More specifically, in community m (m =
1, 2, . . . , M ):
• The topic mixture θm represents the weight of different topics in this community, which has a Dirichlet
distribution with hyperparameter α:
θm |α ∼ Dirich(α).
(2)
• The participant mixture ηm = represents each person’s activeness in this community. In other words,
ηm,p , represents person p’s activeness in community m
(p = 1, 2, . . . , P ). We assume that ηm,p has a Beta
distribution with hyperparameters α0 and β0 :
ηm,p |α0 , β0 ∼ Beta(α0 , β0 ).
(3)
Finally, there is a community activeness vector ψ, which is
assumed a Dirichlet distribution with hyperparameter μ:
ψ|μ ∼ Dirich(μ).
(4)
Further, we assume the following generative process of this
collection of D email documents. For d = 1, 2, . . . , D:
1. A latent community cd is assigned by tha maximum
likelihood of community membership, according to words,
topics and people.
cd = arg max LLH(d, c),
3. THE COMMUNITY MODEL
c
To discover the latent communities, we develop a generative model, called COCOMP, which stands for COllaborator
COMmunity Profiling. It attempts to discover communities
in social media documents by considering context in both
topics and collaboration groups. The basic rational is to
assume that each social media document corresponds to a
conversation session within one community, which is defined
both by topics and participants. In other words, the topics
of a social media document is derived from the community’s
topic mixture, and the people involved in the thread tend to
be those who actively participate in the work area.
(5)
where LLH(d, c) means the log likelihood of document
d being assigned to community c.
2. For each person p, run a Bernoulli trial according to
his/her personal activeness in community cd , to see if
he or she is involved. Specifically,
id,p |η, cd ∼ Bernoulli(ηcd ,p ).
(6)
∀p = 1, 2, . . . , P .
3. Suppose that this document has Nd tokens. For each
token in the document, a word is generated in a similar
fashion to LDA. Specifically, for n = 1, 2, . . . , Nd :
Generative Process
Figure 2 shows the generative process of the latent community model. Like traditional topic models, it is assumed
that there are K word distributions φ1:K which correspond
to K latent topics, and are assumed to be Dirichlet distributions with prior β:
φk |β ∼ Dirich(β), k = 1, 2, . . . , K
T
i
P
K
Discovering communities has been of interest by many
previous studies. Being interested in studying collaboration
network, we find that social network analysis and topic modeling are both relevant. Specifically, social network analysis
typically focuses on closeness of social linkage [7], or evolution of social connections [18]. Due to the complexity
of the network, social links are typically simplified into a
single-layer of measurement, without consideration of the
general context. On the other hand, models for topic discovery from textual documents are also extensively studied,
including probabilistic Latent Semantic Analysis (pLSA) [9],
Latent Dirichlet Allocation (LDA) [5], variations [8] and applications [11, 3]. Based on the words in a large corpus of
documents, these models can extract human comprehensible
topics represented by a list of keywords.
Since documents are commonly related to people, recent
developments have taken into account people who are related to the topics. For example, the Author-Topic (AT)
model [17] considers the interests of each author across multiple documents, and aim to derive the representative topics
for individual authors. The Author-Recipient-Topic (ART)
model [13, 14] considers topics that are specific to each
author-recipient pair. These models focus on profiling individuals or pairwise relation. However, they do not model
communities directly. Other works try to augment social
network analysis with topic modeling [6, 16, 19, 12], and
such models can discover users’ mixed membership in various topical communities. However these models are singlelayered, without considering the general context of “who are
collaborating on what concurrently”.
3.1
K
(a) Draw a topic assignment zd,n from the cd -th community’s topic mixture:
zd,n |θ, cd ∼ M ulti(θcd ).
(7)
(b) Draw a word wd,n from the zd,n -th topic-word
distribution:
(1)
wd,n |zd,n , φ ∼ M ulti(φzd,n ).
Also, we assume that there are M communities, each of
390
(8)
3.2
Model Training
exchanges and celebrity’s Twitter interactions. Table 1 lists
some basic statistics of each user’s social media dataset.
For each user, we include bi-direction communications. In
other words, for email datasets, we include both inbox and
outbox; and for twitter datasets, we include tweets written
by this user as well as those that mention this user (such as
re-tweets).
Preprocessing Basic data preprocessing has been conducted on the raw social media documents, such as parsing and unwrapping HTML tags, removing stopwords, and
transforming the text into bag of words. We also excluded
a small number of incomplete documents. A document is
considered incomplete at the preprocessing stage, if it does
not involve at least two different users, or if its remaining
bag of words after stopword removal is empty.
Implementation Our model has been implemented in
Java, based on modifications of the MALLET [15] package.
All experiments are run within Eclipse on a Dell Latitude,
running 64-bit Windows 7 Professional with 8.00GB RAM.
Evaluation Metric Perplexity has been a common metric for evaluating language models. For community c, with
word sequence wc , whose length (number of tokens) is Nc ,
the perplexity can be computed as
ff
j
ln p(wc )
perplexity(wc ) = exp −
.
(15)
Nc
We used Gibbs sampling for training the model. In this
subsection, we derive the conditional posterior distributions
for sampling these parameters sequentially.
For document d, the posterior probability for its community assignment will be
(−d)
μj + Dj
P (cd = j|c−d ) = PC
,
c=1 μc + D − 1
(9)
(−d)
is number of documents (except the d-th docwhere Dj
ument) that are assigned to community j. After updating
the community densities, the document is assigned to the
community in which it has the highest likelihood.
For the i-th token in the d-th document, which is assigned to community cd , the conditional posterior for its
latent topic will be
P (zd,i = j|·) ∝
(−d,i)
P
(−d,i)
βwd,i + nj,wd,i
αj + ncd ,j
P −(d,i) · P
P −(d,i) , (10)
v βv +
v nj,v
k αk +
k ncd ,k
where nj,v is the number of times a unique word type v
is assigned to topic k, nc,j is the number of times a token
in community c is assigned to topic j, and the superscript
−(d, i) means to exclude the i-th token in the d-th document.
After the training process, each parameter is estimated
from the ending state:
P (c)
=
P (p|c)
=
P (k|c)
=
P (v|k)
=
μc + Dc
μc = P
,
c μc + D
α0 + Dc,p
,
η =
α0 + β0 + Dc
αk + nc,k
,
α = P
k αk + nc
βv + nk,v
β = P
v βv + nk
Assuming independence among words, we have
ln p(wd ) =
(11)
where P (v|k) and P (v|k) are computed as the posterior
probability computed at the end of the training.
(13)
4.2
(14)
4.1 Experimental Setup
First, we introduce the setup of the experiments, including
data summaries, data preprocessing, implementation and
platform, as well as the evaluation metric.
Table 1: Basic Statistics of Example Users
# Contacts
2,280
1,560
690
817
6,890
3,228
Enron Email Dataset
Since the Enron dataset is publicly available, effectiveness
of our model can be verified using it as a benchmark. In
Table 2 we list the top topics and top people for prominent
communities discovered for the user Greg Whalley, assuming that there are 30 topics and 10 communities. For each
community, we assign a label corresponding to the topics
and the people, which are listed in the last column.
As we can see, most communities include greg.whalley on
the top of the contributor list. This means that the user
is active in such communities, so his rank is higher. When
building his profile, we want to know how relevant a activity
area (i.e. community) is to him, so it is desired that this
user appears on the top of the contributor list for many
communities, which means the communities are his major
activity areas. Also, we can see that some topics may rank
high in several communities, but the composition of topics
and people are different.
From another data source [2], we know that Greg Whalley
was the president of Enron, John Lavorato was a CEO, and
Louise Kitchen was the president for Enron Online. We
don’t have information for mark.frevert or liz.taylor, who
might be the assistants for Greg Whalley and his contacts.
Their roles are consistent to the topics and communities we
have discovered.
EXPERIMENTS
# Docs
2,120
718
623
607
6,134
5,077
(16)
(12)
In this section, we present experiments with the COCOMP
model. First, we describe the overall setup of the experiments, and then show discovery results from each dataset.
User
arnold-j
whalley-g
zhouw
hongxia
obama
justin
nc,v [ln P (v|k) + ln P (k|c)],
v=1
where Dc,p is the number of times a person p is involved in
community c.
4.
V
X
Source
Enron
Enron
IBM
IBM
Twitter
Twitter
4.3
Datasets Our model has been tested with various types
of social messages. Given that we are mainly interested in a
single user’s perspective, we collected individual user’s email
Zhouw Email Dataset
Although the Enron dataset is public, the actual social
network and communities is not well-known and there is
391
Table 2: Greg Whalley Communities (K = 30, C = 10)
Community
1
15
23
4
23
25
18
25
16
12
8
22
2
5
6
7
Topics
bill, assembl, senat, day, california
pleas, util, transact, review, date
today, number, follow, transact, ani
pleas, meet, eb, system, expens
today, number, follow, transact, ani
inform, servic, avail, email, custom
secur, servic, bush, salomon, presid, contract
inform, servic, avail, email, custom, click
may, includ, oper, account, result, rais
publish, name, memo, target, task
itext, img, onlin, valu, pleas
corpor, com, best, onli, monday
People
People
greg.whalley
john.lavorato
louise.kitchen
greg.whalley
mark.frevert
liz.taylor
greg.whalley
louise.kitchen
mark.frevert
greg.whalley
liz.taylor
louise.kitchen
.1335
.0612
.0511
.0942
.0371
.0226
.0744
.0370
.0240
.1913
.0863
.0089
Label
.5806
.2500
.2016
.6738
.2032
.1497
.4300
.1500
.1200
.4746
.1864
.1441
legislations
operations
IT
publishing online
Topics
ARCCS
.4222
9
button, valley, open, silicon, omitted, embedded
.2153
zhouw
.3555
0
location, please, site, july, management, unit
.1895
kimsu
.2963
2
data, pm, web, technology, learning, applications
.1836
C2
C3
T1
.2963
Community 1
wt = .1975
CS Department
T2
.3185
kaweaver
.1798
jldicoio
.1791
zhouw
Community 3
wt = . 0900
Interns
Topics
8
talk, project, manag, time,
am, project manag
.1581
6
state, link, pm,
conference, topic, unit
.1557
5
email, systems, server,
watson, human, webspher
.1501
People
T6
rjbarber
zhouw
Community 2
wt = . 7125
Research Contacts
T7
.2215
.8971
ammartin
.8382
starkan
.8382
okoyeife
.8382
ljalali
.8235
T8
.3070
jin
T9
.3969
ARCCS
Topics
T10
zhouw
Topics
People
T5 T4
T3
okoyeife
zhouw
C1
4 summer, intern, genese, ca, side, location
.5846
3 please, thanks, unit, one, today, hi
.2077
7 information, work, new, speaker, event, program
.0995
Communities
(a) Community Labels
(b) Communities’ Topic Distributions
Figure 3: Zhouw communities (K = 10, C = 3).
−608000
no golden standard to check the exact correctness of the
important communities found by COCOMP. As a result,
we run COCOMP on collections of our own email datasets,
which will be presented below.
First, we look at results on the zhouw dataset. This
dataset contains one of the authors’ emails sent and received
during her internship at IBM. Aassuming 10 topics and 3
communities, we find communities as shown in Figure 3.
As we can see in Figure 3(a), the three communities discovered are quite clear. It looks like that the first community
is the CS department in general, the second community is
the smaller research unit, and the third community has to
do with the intern group. Both people and keywords make
sense. For example, in the intern community, the top contacts are other interns who share similar weights in this community. The top ranked topic, Topic 4, clearly shows that
“summer” and “intern” are the top keywords, and “genese”
is the summer intern coordinator.
Figure 3(a) is a mosaic plot of the topic distribution in
each community. Each other represents a topic, and each
column represents a community. We can see clearly that
the composition of topics, as indicated by the heights of the
rectangles, is very different from one community to another.
●
−614000
Log Likelihood
−620000
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
200
400
600
800
1000
Iteration
Figure 4: Convergence process.
The widths of the rectangles represent the size of each community. In other words, they are proportional to the number
of documents in each community. Clearly, a lot of the emails
are related to research.
Despite the fact of being a bit more complex than the
plain LDA model, since we consider people in addition, our
model converges reasonably fast. Figure 4 shows the convergence process for model training. Although we run 1, 000
iterations, the log likelihood begins to stabilize around the
200-th iteration. Other datasets show similar results.
392
People
Topics
jin
.8193
qwang
.6506
tpmoran
.4458
zhangwe
.4096
jpierce
.0084
jschoudt
.0840
11
1
People
Topics
perform, context, priorit,
messag, cluster, sourc
.4148
databas, crawl, email,
model, request, issu
.2047
17
2
interact, analytic, learn,
algorithm, implement,
cluster
.6336
propos, profil, parameter,
show, unit, technic
.1142
Community 2
wt = 0.1390
Data Analytics #2:
Message Prioritization
People
Community 9
wt = 0.1126
Data Analytics #1:
User Profiling
jin
.9714
qwang
.6000
zhangwe
.4714
tpmoran
.4571
caverzan
.4000
basmith
.2430
jin
.8621
Topics
lots
.1552
10
.1207
applic, action, docket, patent,
recommend, disclosur,
.5155
leakedeots
9
decis, manag, intend, receiv,
review, response
.1323
dulce
.1207
signin
.1034
jgeagan
.0690
Community 7
wt = 0.0977
Digital-Right-Management
(Traitor Tracing)
Community 8
wt = 0.1109
Patenting #1
Hongxia
Topics
12
drm, mkb, traitor, trace,
spec, workshop
.5658
0
inform, discuss, week,
access, plan, content
.1192
Community 5
wt = 0.1026
Patenting #2
Community 3
wt = 0.0993
Global Tech Outlook
People
People
jin
.8657
rocky
.1940
kashef
.1045
turley
.1045
Dnorthfield
.0060
Mcnlly
.0060
People
jin
.7500
Topics
jalal
.0667
19
ehaber
.0667
zmhou
.0500
rjprill
dill
jin
.7581
qwang
.3226
Topics
rocky
.0968
13
.0806
patent, Input, data,
applic, result, search
.5733
basmith
.0500
rjprill
.0806
9
.1202
.0330
kashef
.0650
decis, manag, intend,
receiv, review,
response
7
propos, gto, topic,
driven, check, busi
.4823
social, mobil, technolog,
interest, challeng, busi
.1836
Figure 5: Hongxia communities (K = 20, C = 10).
there is clearly the benefit of finding communities that are
interesting to the user.
As for the quality of the topics, Figure 6 shows a simple comparison of perplexity for different parameter sets.
The three bars on the left correspond to perplexities derived from LDA [5], using k = 3, k = 5, and k = 10, respectively. The two bars on the right represent perplexities
derived from COCOMP on the same training data, using
10-topic-3-community (K = 10, C = 3), and 10-topic-5community(K = 10, C = 5), respectively. The COCOMP
model performs consistently better than LDA for having
lower perplexity.
In return of handling more complexity by considering the
people information, the communities discovered by COCOMP
make better sense than those derived directly from by LDA.
In order to find three communities, we run the LDA model
with the parameter k = 3. The basic idea is to identify the
topic mixture of each document, and assign the documents
to the most important topic. Then based on such document
communities, we find the most frequently involved people.
The results are listed in Table 3.
Table 3: Zhouw Communities by LDA (K = 3)
2500
Community 3
conference
systems
web
data
user
thanks
zhouw
jin
okoyeife
kimsu
kaweaver
mmejias
2446.88
2446.90
2447.06
2152.90
1500
2000
2100.25
Perplexity
Community 2
button
valley
open
please
silicon
embedded
zhouw
Software
okoyeife
imber
basmith
Storage
1000
Community 1
talk
project
manag
pm
time
am
Software
rjbarber
Research
vitaly
jldicoio
Storage
0
500
The communities, as indicated by the key words, make
some sense. For example, Community 1 seems to be the
announcement of talks and activities. The top participants
include major email lists, to which general announcements
are typically sent to. However, a major problem is that the
interns group is not found. Despite of the saliently different topics, because there are a small number of emails in
the interns community, the LDA fails to capture that community. By considering people (social contacts) in addition,
K3
K5
K10
C3T10
C5T10
Figure 6: A comparison of perplexity.
393
Table 4: Hongxia Communities by LDA (K = 10)
Topic 5
interact
talk
social
analysis
correl
check
jin
tpmoran
zhangwe
qwang
caverzan
basmith
Topic 3
databas
perform
crawl
women
webinar
text
jin
tpmoran
zhangwe
qwang
caverzan
basmith
Topic 1
patent
docket
lectur
applic
application
seri
jin
rocky
qwang
zhangwe
tpmoran
basmith
Topic 7
test
disclosur
requir
dlp
server
miss
jin
qwang
tpmoran
zhangwe
caverzan
basmith
4.4 Hongxia Collaboration Dataset
Topic 4
step
integr
week
cluster
servic
june
jin
qwang
tpmoran
zhangwe
caverzan
basmith
Topic 2
context
messag
sourc
oper
result
priorit
jin
qwang
jschoudt
jspierce
hbadenes
tpmoran
clustered into each of those 10 topics. For the sake of comparison, we also illustrate the top 6 communities/topics in
Table 4. Again, the weight is calculated based on the number of documents clustered into each community/topic.
As we can see from the table, LDA also detected the two
topics/commnities about data analytics, namely Topic 5 and
Topic 2. However, the top people involved in both communities are not as precise as the results from our proposed
model. More significantly the data analytics activities are
not detected very completely. Indeed, many are mixed up
with other topics as clearly demonstrated by Topic 3, 4 and
7 in the table. The fact that topics are mixed in the detected
communities also means that the documents clustered into
each topic are mixed, resulting in the inaccurate weighting
of the community. As one can see, LDA result ranks the
Topic/Community 2 as the last of the top 6 communities.
This rank is incorrect. Similarly, both Topics 3 and 7 are
partly about data analytics work but each mixed up with
several different actual activities for Hongxia. For example,
Topic 7 mixed data analytics with patenting. But these two
activities have no overlapping in terms of topics. As a result, the top people shown for these two communities are
still those same people involved in the data analytics activity area. Topic/Community 1 is also a mix of topics (mixing
with data analytics activities), but to a less extent. It is
mainly about patenting. However it mixes the two different
groups of people that user Hongxia involved in working with
on patenting. In fact some of those people are not shown on
top. Instead some of the people involved in data analytics
still appear in top 6 participants in this community due to
topic mixture. Topic 4 is partly about ”traitor tracing” but
heavily mixed with other topics (mainly with the data analytics activities). As a result, the correct group of people
involved are not even shown on top.
In this subsection, we apply the COCOMP model on social collaboration data from another author of the paper during a 3-month period. The hongxia dataset contains mostly
emails, but also small amounts of data crawled from other
social software in IBM, such as wiki and communities. The
top 6 communities are shown in Figure 5, assuming 20 latent
topics and 10 communities.
As shown in Figure 5, the top 2 communities are both
about her big data analytics research activities. Data analytics is her main research direction. Community 2 is focused
on the user and community profiling work while community
1 is focused on using user profile to perform context aware
prioritization on incoming messages/updates for the user.
As one can understand, these research endeavors overlap in
the research nature. While she heavily works with 3 other
colleagues in both these two areas, some others involved in
these two areas are different. Our model clearly detects two
overlapping communities that focus on two somewhat overlapping research activities. Moreover, these two activity areas are indeed her most focused activities.
The second major activity for Hongxia is related to patenting. Our model detected two different communities that
Hongxia is involved with regarding patenting, namely community 8 and community 5, ranked #3 and #4 respectively
in top 6. Again both the topics and the people involved
in these two communities overlap. Indeed both Topic 9 is
shown in these two communities. Our model managed to
clearly detect these two overlapping communities. This indicates the multi-layers of Hongxia’s social links with those
overlapping members in these two communities, an example
of what we illustrated in Figure 1 in Section 2.
Ranked #5 in the detected top 6 communities by our
model is Community 3. It is mainly about GTO (Global
Technological Outlook) planning activities that she involved.
It involves proposal writing, submitting and reviewing. This
community consists of a different group of colleagues that
she lightly collaborates with.
Lastly, Community 7 is about another light activity area
in this 3-month period. It is about Digital Rights Management which was an old research area that Hongxia heavily
involved in the past. ”Traitor tracing” was the main research
topic in this activity area as indicated by Topic #12.
For comparison purpose, we also experiment with LDA
using the same dataset. We have LDA detect the top 10
topics and cluster the documents into its dominant topics.
We then derive the 10 communities for Hongxia by extracting the top people involved in the corresponding documents
4.5 Twitter Datasets
In order to study the effectiveness of our model on Twitter
datasets, we choose two celebrity Twitter users and study
their tweet exchange with others Twitter users. One account is the United States President Barack Obama (Figure 7), and the other is a famous singer Justin Beiber (Figure 8). We have crawled all the tweets posted by these
two users as well as the replied-to tweets by other users,
started from November 1, 2009. As a result, we have collected 6, 134 tweets for Barack Obama and 5, 077 tweets for
Justin Beiber. Twitter users use some structure conventions
such as a user-to-message relation (i.e. initial tweet author,
via, cc, by), type of message (i.e. Broadcast, conversation,
394
Topics
People
20
senate, moment, vote, help, polls, leadership
.1612
user15178
.4189
25
health, passed, bill, ma, real, call
.1570
user251666
.0259
7
reform, forward, watch, true, phone, bottom
.0671
user220141
.0053
Topics
11
today, learn, people, help,
prayers, country
.1691
3
haiti, relief, efforts, thoughts,
women, brave
.1626
24
support, best, prize, congrats,
yesterday, holiday
.0138
People
user15178
.3451
user7008
.0111
user7774360
.0071
user398403
.0071
user3226
.0061
user6601936
.0053
user45593
.0043
happy, family, thanksgiving,
good, climate, copenhagen
.1026
1
year, congress, today,
special, save, house
.0725
28
change, job, pass, add,
friends, season
.0659
People
Community 3
wt = .2412
Holidays
Community 5
wt = .1611
Blessing Haiti
user15178
.5962
user45593
.0125
user7774360
.0084
user398403
.0074
user8222719
.0071
Barack Obama
People
Community 2
wt = .2617
Public Policy
Topics
People
user4296
23
Community 4
wt = .1704
Senate Votes
Community 1
wt = .1656
President
user15178
Topics
.4055
22
.0177
user387192
.0098
user7774360
.0069
user8222719
.0064
29
17
big, change, things, easy,
reform, stirs
.2243
changes, controversy,
men, time, michelle, move
.0792
obama, will, people,
twitter, america, health
user15178
.6213
user45593
.0345
user185447
.0093
user700488
.0090
user289787
.0059
Topics
.0289
0
people, quit, kill, health, industry, bill
.0583
14
christmas, pay, start, #p2, check, blog
.0466
15
afford, celebration, game, send, great, spending
.0454
Figure 7: Obama communities (K = 30, C = 5).
Wikipedia page [1], we found that “Bieber performed Stevie
Wonder’s Someday at Christmas for U.S. President Barack
Obama and first lady Michelle Obama at the White House
for Christmas in Washington, which was broadcast on December 20, 2009, on U.S. television broadcaster TNT.” So
probably they are connected because of this event, and also
Michelle, who is also in Community 1. For that big event, as
a most popular pop star who has countless fans all over the
country, Justin is promoted to an activity area of the President. For two very influential twitter users, it is interesting
to see how they are connected with each other on Twitter.
or retweet messages), type of resources (i.e. URLs, hashtags, keywords) to overcome 140 character limit. Since the
data are quite noisy with many URLs and acronyms, we preprocessed the data by removing HTML tags and extremely
infrequent words.
As shown in Figure 7, if we model Obama’s tweets with
30 topics and 5 communities, we can see that his communities from November 2009 to February 2010 can be roughly
represented as: President, which is about comments on his
advocacy of Presidency; Public Policy, which related to domestic and international politics and policy; Holidays, which
is mainly about wishing the best for American families and
friends; Senate Votes that has to do with health bills; and
Blessing Haiti after the tremendous earthquake. It is interesting to observe that user45593 is the Twitter account
whitehouse, which is expected to relate to the President
regarding domestic and international policies. user7008 is
Nicki Minaj, a singer who originated from Haiti. Not surprisingly she is most concerned about the tragedy that has
happened in her home country. user251666 is Martha Coakley, Massachusetts Attorney General, who is on top list of
participants in Community 4, which is about law making.
We also show Justin Beiber’s Twitter communities in Figure 8. For a pop star like Justin, it is expected that the
majority of his “collaborators” are his fans. However, we
are able to group those fans into several groups, such as
those who like to express emotions (Community #3) and
those who does very casual conversations (Commmunity 4).
Community 5 is mainly about Justin’s broadcasts of media,
such as showing his fans some videos. In Communities 1 and
2, different groups of fans were talking about his new album
“My World”, and his Golden Ticket Concert, respectively.
Finally, we found that Justin Beiber appeared in Barack
Obama’s Community 1. In that community he was ranked
second in the participant list, right after Obama himself.
What does Justin have to do with Obama? On Justin’s
5. CONCLUDING REMARKS
In this paper, we design a latent community model, called
COCOMP, to uncover the communities of each user as well
as their associated topics and communities. In particular, it
models each community as a mixture of topics with a corresponding group of users who collaborate together on these
topics. With a latent assignment of community membership,
we assume that each social media document corresponds to
a sharing activity within a community (both topic-wise and
person-wise). Experiment results on email and social media
datasets demonstrate the effectiveness of our model.
For future work, our model can be extended in various
ways. For example, instead of treating all people involved
as the same, there is the need of separating active members
from passive members (i.e. those who only receive the messages). Also, since the social media contents change over the
time, it is meaningful to develop dynamic models to capture
the evolving process and online algorithms to monitor the
changes over time.
6. ACKNOWLEDGEMENTS
This research is continuing through participation in the
Social Media in Strategic Communication (SMISC) program
395
Topics
People
6
justin, pls, say, repli, tell, happi
.1772
user4296
.4829
7
fan, bieber, pls, wish, back, wanna
.0978
user37723
.0134
4
dm, know, think, give, am, say
.0904
user91325
.0052
user361090
.0052
user361629
.0052
Topics
3
fan, plz, watch, girl, mean, im
Community 4
wt = .1916
Conversations
.2269
0
dm, ur, amaz, time, justin, dream
.1246
4
dm, know, think, give, am, say
.0796
Topics
.5525
user490288
.0184
user516629
.0054
user150257
.0050
user458998
.0045
lol, realli, one, hope, thank, good
.1748
im, math, cri, stink, bad, deserv
.1445
0
dm, ur, amaz, time, justin, dream
.0852
People
People
user4296
1
5
user4296
Community 3
wt = .2138
Emotions
Community 5
wt = .2189
Media
.5354
user6749
.0069
user208734
.0060
user212304
.0037
user893157
.0037
Justin Bieber
People
People
user4296
.5160
user349756
.0072
user188342
.0034
user203257
.0034
user8603
.0034
user4296
Community 2
wt = .1718
The Show
Community 1
wt = .2039
Album
.4259
user502736
.0057
user563072
.0046
user506174
.0040
user361520
.0040
Topics
Topics
2
world, wait, buy, album, #myworld, gonna
.1793
8
day, justin, cant, #1, excit, week
.1510
0
dm, ur, amaz, time, justin, dream
.0819
9
ticket, golden, win, im, mingl, hope
2
world, wait, buy, album, #myworld, gonna
.2103
.0582
4
dm, know, think, give, am, say
.0487
Figure 8: Justin communities (K = 10, C = 5).
sponsored by the U.S. Defense Advanced Research Projects
Agency (DARPA) under Agreement Number W911NF-12-10034. The views and conclusions contained herein are those
of the authors and should not be interpreted as representing the official policies or endorsements, either expressed or
implied, of any of the above organizations or any person
connected with them.
7.
[10]
[11]
REFERENCES
[12]
[1] http://en.wikipedia.org/wiki/Justin Bieber, retrieved
on October 16th, 2011 at around 5pm.
[2] Enron employee status. Retrieved from http://www.isi
.edu/ adibi/Enron/Enron Employee Status.xls on
May 30th, 2012 at around 9am.
[3] A. Ahmed, E. P. Xing, W. W. Cohen, and R. F.
Murphy. Structured correspondence topic models for
mining captioned figures in biological literature. In
KDD, pages 39–48, 2009.
[4] K. Balog and M. de Rijke. Finding experts and their
details in e-mail corpora. In WWW, pages 1035–1036,
2006.
[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
dirichlet allocation. J. Machine Learning Res.,
3:993–1022, 2003.
[6] J. Chang, J. L. Boyd-Graber, and D. M. Blei.
Connections between the lines: augmenting social
networks with text. In KDD, pages 169–178, 2009.
[7] W. de Nooy, A. Mrvar, and V. Batagelj. Exploratory
Social Network Analysis With Pajek. Cambridge
University Press, 2005.
[8] E. Erosheva, S. Fienberg, and J. Lafferty.
Mixed-membership models of scientific publications.
Proc. of the National Academy of Sciences,
101:5220–5227, 2004.
[9] T. Hofmann. Probabilistic latent semantic analysis. In
[13]
[14]
[15]
[16]
[17]
[18]
[19]
396
Proc. of the Fifteenth Conf. on Uncertainty in
Artificial Intelligence (UAI’99), pages 289–296.
Morgan Kaufmann, 1999.
T. Lappas, K. Liu, and E. Terzi. Finding a team of
experts in social networks. In KDD, pages 467–476,
2009.
Q. Liu, Y. Ge, Z. Li, E. Chen, and H. Xiong.
Personalized travel package recommendation. In
ICDM, pages 407–416, 2011.
Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link
lda: joint models of topic and author community. In
A. P. Danyluk, L. Bottou, and M. L. Littman, editors,
ICML’09, volume 382, pages 665–672. ACM, 2009.
A. McCallum, A. Corrada-Emmanuel, and X. Wang.
Topic and role discovery in social networks. In IJCAI,
pages 786–791, 2005.
A. McCallum, X. Wang, and A. Corrada-Emmanuel.
Topic and role discovery in social networks with
experiments on enron and academic email. J. Artif.
Intell. Res., 30:249–272, 2007.
A. K. McCallum. Mallet: A machine learning for
language toolkit, 2002. http://mallet.cs.umass.edu.
A. Qamra, B. L. Tseng, and E. Y. Chang. Mining blog
stories using community-based and temporal
clustering. In CIKM, pages 58–67, 2006.
M. Rosen-Zvi, T. L. Griffiths, M. Steyvers, and
P. Smyth. The author-topic model for authors and
documents. In Proc. of the 20th Conf. in Uncertainty
in Artificial Intelligence, pages 487–494, 2004.
E. Zheleva, C. Park, and L. Getoor. Co-evolution of
social and affiliation networks. In KDD, pages
1007–1015, 2009.
D. Zhou, E. Manavoglu, J. Li, C. L. Giles, and
H. Zha. Probabilistic models for discovering
e-communities. In WWW, pages 173–182, 2006.