Social Media and Social Computing

3.3 Network-Centric Community Detection
 A Unified Process
1
3.3 Network-Centric Community Detection
 Comparison
– Spectral clustering essentially tries to minimize the number of edges
between groups.
– Modularity consider the number of edges which is smaller than
expected.
– The spectral partitioning is forced to split the network into
approximately equal-size clusters.
2
3.4 Hierarchy-Centric Community Detection
 Hierarchy-centric methods
– build a hierarchical structure of communities based on network
topology
– two types of hierarchical clustering
• Divisive
• Agglomerative
 Divisive Clustering
– 1. Put all objects in one cluster
– 2. Repeat until all clusters are singletons
• a) choose a cluster to split
 what criterion?
• b) replace the chosen cluster with the sub-clusters
 split into how many?
3
3.4 Hierarchy-Centric Community Detection
 Divisive Clustering
– A Method: Cut the “weakest” tie
• At each iteration, find out the weakest edge.
 This kind of edge is most likely to be a tie connecting two communities.
• Remove the edge.
 Once a network is decomposed into two connected components, each component
is considered a community.
• Update the strength of links.
• This iterative process is applied to each community to find sub-communities.
4
3.4 Hierarchy-Centric Community Detection
 Divisive Clustering
– “Finding and evaluating community structure in networks,” M.
Newman and M. Girvan, Physical Review, 2004
• find the weak ties based on “edge betweenness”
• Edge betweenness
 the number of shortest paths between pair of nodes pass along the edge
 utilized to find the “weakest” tie for hierarchical clustering
𝐶𝐵 𝑒(𝑣𝑖 , 𝑣𝑗 ) =
𝑣𝑠 ,𝑣𝑡 ∈𝑉,𝑠<𝑡
𝜎𝑠𝑡 𝑒(𝑣𝑖 , 𝑣𝑗 )
𝜎𝑠𝑡
0
𝐶𝐵 𝑒(𝑣𝑗 , 𝑣𝑖 )
𝑖𝑓 𝑖 < 𝑗
𝑖𝑓 𝑖 = 𝑗
𝑖𝑓 𝑖 > 𝑗
• where
 𝜎𝑠𝑡 is the total number of shortest paths between nodes 𝑣𝑠 and 𝑣𝑡
 𝜎𝑠𝑡 (𝑒(𝑣𝑖 , 𝑣𝑗 )) is the number of shortest paths between nodes 𝑣𝑠 and 𝑣𝑡 that pass
along the edge 𝑒(𝑣𝑖 , 𝑣𝑗 ).
5
3.4 Hierarchy-Centric Community Detection
 Divisive Clustering
– The edge with higher betweenness tends to be the bridge between two
communities
– It is used to progressively remove the edges with the highest
betweenness.
6
3.4 Hierarchy-Centric Community Detection
 Divisive Clustering
– “Finding and evaluating community structure in networks,” M.
Newman and M. Girvan, Physical Review, 2004
• Example
– Negatives for divisive clustering
• edge betweenness-based scheme requires high computation
• One removal of an edge will lead to the recomputation of betweenness for
all edges
7
3.4 Hierarchy-Centric Community Detection
 Agglomerative Clustering
– begins with base (singleton) communities
– merges them into larger communities with certain criterion.
• One example criterion: modularity
 Let 𝑒𝑖𝑗 be the fraction of edges in the network that connect nodes in community 𝑖
to those in community 𝑗
 Let 𝑎𝑖 = 𝑗 𝑒𝑖𝑗 , then the modularity 𝑸 = 𝒊(𝒆𝒊𝒊 − 𝒂𝒊 𝟐 )
 values approaching 𝑄 = 1 indicate networks with strong community structure
 values for real networks typically fall in the range from 0.3 to 0.7
동일한 Community 안의 Edge 수
– 서로 다른 Community 들 간의 Edge 수
8
3.4 Hierarchy-Centric Community Detection
 Agglomerative Clustering
– Two communities are merged if the merge results in the largest increase
of overall modularity
– The merge continues until no merge can be found to improve the
modularity.
Dendrogram according to Agglomerative Clustering based on Modularity
9
3.4 Hierarchy-Centric Community Detection
 Agglomerative Clustering
– In the dendrogram, the circles at the bottom represent the individual
nodes of the network.
– As we move up the tree, the nodes join together to form larger and
larger communities, as indicated by the lines, until we reach the top,
where all are joined together in a single community.
– Alternatively, the dendrogram depicts an initially connected network
splitting into smaller and smaller communities as we go from top to
bottom.
– A cross section of the tree at any level, such the one indicated by a
dotted line, will give the communities at that level.
10
3.4 Hierarchy-Centric Community Detection
 Divisive vs. Agglomerative Clustering
– Zachary's karate club study
Zachary observed 34 members of a karate club
over a period of two years. During the course
of the study, a disagreement developed
between the administrator (34) of the club and
the club's instructor (1), which ultimately
resulted in the instructor's leaving and starting
a new club, taking about a half of the original
club's members with him
11
3.4 Hierarchy-Centric Community Detection
 Divisive vs. Agglomerative Clustering
– Divisive
• “Community structure in social and biological networks”, Michelle
Girvan, and M. E. J. Newman, 2001  Using edge-betweeness
– Agglomerative
• “Fast algorithm for detecting community structure in networks”, M. E.
J. Newman, 2003  Using modularity
Divisive
Agglomerative
12
Summary of Community Detection
 Node-Centric Community Detection
– cliques, k-cliques, k-clubs
 Group-Centric Community Detection
– quasi-cliques
 Network-Centric Community Detection
– Clustering based on vertex similarity
– Latent space models, block models, spectral clustering, modulari
ty maximization
 Hierarchy-Centric Community Detection
– Divisive clustering
– Agglomerative clustering
13
3.5 Community Evaluation
 Here, we consider a “Social Network with Ground Truth”
– Community membership for each actor is known  an ideal case
– For example,
• A synthetic networks generated based on predefined community
structures
 L. Tang and H. Liu. “Graph mining applications to social network analysis.” In C.
Aggarwal and H.Wang, editors, Managing and MiningGraph Data, chapter 16,
pages 487.513.Springer, 2010b
• Some well-studied tiny networks like Zachary’s karate club with 34
members
 M.Newman. “Modularity and community structure in networks.” PNAS,
103(23):8577.8582, 2006a.
 Simple comparison between the ground truth with the identified
community structure
– Visualization
– One-to-one mapping
14
3.5 Community Evaluation
 The number of communities after grouping can be different from
the ground truth
 No clear community correspondence between clustering result
and the ground truth
How to measure the
clustering quality?
Each number denotes a node, and each circle or block denotes a community

1) Both communities {1, 3} and {2} map to the community {1, 2, 3} in the ground truth
2) The node 2 is wrongly assigned
 Normalized Mutual Information (NMI) can be used
15
3.5 Community Evaluation
 Entropy
– 확률변수의 불확실성을 측정하기 위한 것
– Measure of disorder
– The information volume contained in a random variable X (or in a
distribution X)
𝐻 𝑋 =−
𝑝 𝑥 𝑙𝑜𝑔𝑏 (𝑥)
𝑥∈𝑋
• X의 엔트로피는 X의 모든 가능한 결과값 x에 대해 x의 발생 확률과 그 확률
의 역수의 로그 값의 곱의 합
• 일반적으로 지수 b의 값으로서 2나 오일러의 수 e, 또는 10이 많이 사용된다.
b=2인 경우에는 엔트로피의 단위가 비트(bit)이며, b=e이면 네트(nat), 그리
고 b=10인 경우에는 디짓(digit)이 된다.
16
3.5 Community Evaluation
 Entropy와 동전 던지기 [from wikipedia]
– 앞면과 뒷면이 나올 확률이 같은 동전을 던졌을 경우의 엔트로피를 생
각해 보자. 이는 H,T 두 가지의 경우만을 나타내므로 엔트로피는 1이다.
– 𝐻 𝑋 =−
1
1
1
1
𝑥∈𝑋 𝑝 𝑥 𝑙𝑜𝑔𝑏 𝑥 = −(2 × 𝑙𝑜𝑔2 2 + 2 × 𝑙𝑜𝑔2 2)=1
– 한편 공정하지 않는 동전의 경우에는 특정 면이 나올 확률이 상대적으
로 더 높기 때문에 엔트로피는 1보다 작아진다. 우리가 예측해서 맞출
수 있는 확률이 더 높아졌기 때문에 정보의 양, 즉 엔트로피는 더 작아
진 것이다. 동전던지기의 경우에는 앞,뒤 면이 나올 확률이 1/2로 같은
동전이 엔트로피가 가장 크다.
– 엔트로피를 불확실성(uncertainity)과
같은 개념이라고 인식할 수 있다.
– 불확실성이 높아질수록 정보의 양은
더 많아지고 엔트로피는 더 커진다.
17
3.5 Community Evaluation
 Mutual Information (상호 정보량)
– It measures the shared information volume between two random
variables (or two distributions)
– 두 확률 변수 (또는 두 분포) X, Y가 얼마나 밀접한 관계가 있는지 또는
얼마나 서로간에 의존을 하는지를 측정
– 국문 참고 문헌
• http://shineware.tistory.com/7
• http://www.dbpia.co.kr/Journal/ArticleDetail/339089
18
3.5 Community Evaluation
 Normalized Mutual Information (NMI, 정규화된 상호 정보량)
– It measures the shared information volume between two random
variables (or two distributions)
– 두 확률 변수 (또는 두 분포) X, Y가 얼마나 밀접한 관계가 있는지를 측정
– The values is between 0 and 1
 Consider a partition as a random variable, we can compute the
matching quality between ground truth and the identified
clustering
19
3.5 Community Evaluation
 NMI Example (1/2)
– Partition a (𝜋 𝑎 ): [1, 1, 1, 2, 2, 2]
– Partition b (𝜋 𝑏 ): [1, 2, 1, 3, 3, 3]
𝜋𝑎
𝜋𝑏
1, 2, 3
1, 3
4, 5, 6
2
4, 5,6
20
3.5 Community Evaluation
 NMI Example (2/2)
𝜋𝑎
– Partition a (𝜋 𝑎 ): [1, 1, 1, 2, 2, 2]
– Partition b (𝜋 𝑏 ): [1, 2, 1, 3, 3, 3]
n ha
𝜋𝑏
nlb
1, 2, 3
1, 3
4, 5, 6
2
4, 5,6
nh ,l
=0.8278
21
3.5 Community Evaluation
 Accuracy of Pairwise Community Memberships
– Consider all the possible pairs of nodes and check whether they reside
in the same community
– An error occurs if
• Two nodes belonging to the same community are assigned to different
communities after clustering
• Two nodes belonging to different communities are assigned to the same
community
– Construct a contingency table
22
3.5 Community Evaluation
 Accuracy of Pairwise Community Memberships
1, 2,
3
4, 5,
6
Ground Truth
1, 3
2
4,
5, 6
Clustering Result
Accuracy = (4+9)/ (4+2+9+0) = 0.86
23
3.5 Community Evaluation
 Accuracy of Pairwise Community Memberships
– Balanced Accuracy (BAC) = 1 – Balanced Error Rate (BER)
• 𝐵𝐴𝐶 =
• 𝐵𝐸𝑅 =
1
𝑎
𝑑
+
2 𝑎+𝑐
𝑏+𝑑
1 𝑐
𝑏
(
+
)
2 𝑎+𝑐
𝑏+𝑑
= 1 − 𝐵𝐸𝑅
• This measure assigns equal importance to “false positives” and
“false negatives”, so that trivial or random predictions incur an error
of 0.5 on average.
24
3.5 Community Evaluation
 Accuracy of Pairwise Community Memberships
– Balanced Accuracy (BAC) = 1 – Balanced Error Rate (BER)
• 𝐵𝐴𝐶 =
• 𝐵𝐸𝑅 =
1
𝑎
𝑑
+
2 𝑎+𝑐
𝑏+𝑑
1 𝑐
𝑏
(
+
)
2 𝑎+𝑐
𝑏+𝑑
= 1 − 𝐵𝐸𝑅
𝐵𝐴𝐶 =
1 4
2 6
+
9
9
= 0.83
25
3.5 Community Evaluation
 Evaluation without Ground Truth
– This is the most common situation
– Quantitative evaluation functions: modularity
• Once we have a network partition, we can compute its modularity
• The method with higher modularity wins
• modularity
 Let 𝑒𝑖𝑗 be the fraction of edges in the network that connect nodes in community
𝑖 to those in community 𝑗
 Let 𝑎𝑖 = 𝑗 𝑒𝑖𝑗 , then the modularity 𝑸 = 𝒊(𝒆𝒊𝒊 − 𝒂𝒊 𝟐 )
 values approaching 𝑄 = 1 indicate networks with strong community structure
 values for real networks typically fall in the range from 0.3 to 0.7
동일한 Community 안의 Edge 수
– 서로 다른 Community 들 간의 Edge 수
26
Book Available at
• Morgan & claypool Publish
ers
• Amazon
If you have any comments,
please feel free to
contact:
• Lei Tang, Yahoo! Labs,
[email protected]
• Huan Liu, ASU
[email protected]
27