Community Detection in Social Networks Soumyakant

1
Community Detection
in Social Networks
Soumyakant Priyadarshan
Dep artment of Comp uter Science and E ngineering
National Institute of T echnology R ourkela
R ourkela-769 008, O disha, India
Community detection in social networks
Thesis submitted in
June 2013
to the department of
Comp uter Science and E ngineering
of
National Institute of T echnology R ourkela
in partial fulfillment of the requirements
for the degree of
B achelor of T echnology
in
Comp uter Science and E ngineering
by
Soumyakant Priyadarshan
[R oll: 109CS0176]
with the supervision of
Prof. K .Sathyab ab u
Dep artment of Comp uter Science and E ngineering
National Institute of T echnology R ourkela
R ourkela-769 008, O disha, India
Department of Computer Science and Engineering
National Institute of Technology Rourkela
Rourkela-769 008, Odisha, India.
June 11, 2013
Certifi cate
T his is to certify that the thesis entitled, COMMUNITY DETECTION IN
SOCIAL NETWORKS subm itted by SOUMYAKANT PRIYADARSHAN(109CS0176) in partial fulfillm ent of the requirem ents for the com pletion of B achelor of T echnology D egree in C om puter Science and Engineering at
the N ationalInstitute ofT echnology, R ourkela is an authentic w ork carried out by
them under m y supervision and guidance .T o the best of m y know ledge, N either
this thesis or any part or it has been subm itted for any degree or diplom a aw ard
elsew here.
Prof.K .Sathyab ab u
A ssistant P rofessor
D epartm ent of C om puter Science and Engineering
N IT R ourkela
4
A cknowledgment
W ith great satisfaction and pride I present m y thesis on the project under the
R esearch P roject paper during F inal Y ear, for partial fulfillm ent of m y B achelor
of T echnology degree in C om puter Science and Engineering at N IT R ourkela.
I am thankful to P rof. K .Sathyababu for being the best guide and advisor
for this research w ork in every field I have taken to com plete m y requirem ent.
H is ideas and inspirations have helped m e m ake this nascent idea of m ine into a
fully-fledged project. W ithout presence of him I m ay never had tasted the flavor
in a research w ork.
A gain I am thankfulto m y batch-m ates to support m e in m y im plem entation
part som etim e. I am also gratefulto allthe professors ofm y departm ent for being
a constant source of inspiration and m otivation during the course of the project.
I w ould like to dedicate this project to m y parents, w ho alw ays stood by m e
in each and every point of m y life.
So I am thankful again to all w ho are being a part of m y F inal year research
project.
Soumyakant P riyadarshan
A b stract
A socialnetw ork is a collection of nodes representing individuals or organisations
w ith dyadic or binary relationship betw een them . It is usually represented by a
graph G (V ,E) w here V is the set ofvertices representing the individuals participating in the netw ork and E is the set of edges representing the interactions betw een
the vertices. Exam ples of socialnetw ork can be given as scientists co-authoring a
paper, em ployees of a com pany w orking on a com m on project, etc. A com m unity
represents a group of individuals such that the frequency of interactions w ithin
the group is m ore than that of the interactions betw een the groups. C om m unity
detection problem refers to the problem offinding such groups in realw orld social
netw orks. A num ber of m ethods to address this problem have been proposed earlier, and N ew m an distinguishes these into tw o categories: bottom -up sociological
approaches and top-dow n com puter science approaches. M odularity is a property
ofthe netw ork that m easures w hen the division is good, in the sense that there are
m any edges w ithin the com m unity and only a few betw een them . In m odularity
based algorithm s, each node of the graph is considered as an individual com m unity and the com m unities are joined iteratively based on the increase in m odularity
caused by their joining. T he ones producing m axim um change in m odularity are
joined. T here are few draw backs associated w ith m odularity based m ethods such
as they require inform ation regarding the entire structure of the netw ork w hich
is not possible to determ ine in case of vast real w orld netw orks. A lso m odularity
optim ization m ethods are not able to determ ine the overlapping com m unities. In
order to detect overlapping com m unities clique percolation can be used . C lique
percolation is based on the assum ption that a com m unity consists of fully connected subgraphs and detects overlapping com m unities by searching for adjacent
cliques. B ut it is a hard m ethod to im plem ent w ell due to diffi culty of producing
interm ediate representations of percolating structures. In this project w e im plem ent the k-clique percolation m ethod using a C lique m atrix and binary m atrix
inorder to store the interm ediate percolating structure w ith an aim of sim plifying
the im plem entation.
Contents
L ist of F igures
8
1 Introduction
1.1 Introduction to socialnetw ork analysis . . . . .
1.1.1 SocialN etw ork . . . . . . . . . . . . . .
1.1.2 SocialN etw ork A nalysis . . . . . . . . .
1.1.3 P urpose of SN A . . . . . . . . . . . . . .
1.1.4 Im portance of SocialN etw ork A nalysis .
1.2 Introduction to com m unity detection . . . . . .
1.2.1 C om m unity . . . . . . . . . . . . . . . .
1.2.2 C om m unity D etection . . . . . . . . . .
1.2.3 P urpose of com m unity detection . . . . .
1.3 C om m unities in socialm edia . . . . . . . . . . .
1.3.1 T w o types of groups in socialm edia . . .
1.3.2 Is it necessary to extract groups based on
1.3.3 Im portance of netw ork interaction . . . .
1
2
2
2
2
2
3
3
3
4
4
4
5
5
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
netw ork
. . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
topology?
. . . . . .
2 L iterature R eview
2.1 Spectralbisection . . . . . . . . . . . . . . . .
2.2 H ierarchicalclustering . . . . . . . . . . . . .
2.3 T H E M O D U L A R IT Y M EA SU R E . . . . . . .
2.3.1 D isadvantages . . . . . . . . . . . . . .
2.4 M ax-M in M odularity . . . . . . . . . . . . . .
2.5 C lique percolation m ethod . . . . . . . . . . .
2.5.1 Steps of clique percolation algorithm .
2.5.2 Exam ple of clique percolation m ethod
2.6 Issues in the existing m ethods . . . . . . . . .
2.7 O bjective . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
7
8
8
9
10
11
11
12
12
13
3 A lgorithimic imp lementation
3.1 P roblem form ulation . . . .
3.2 D ata structures used . . . .
3.3 D escription of algorithm . .
3.3.1 P rocess initialization
3.3.2 C lique D etection . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
15
15
16
17
18
.
.
.
.
.
.
.
.
.
.
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
3.3.3
D eterm ining connected com ponents . . . . . . . . . . . . . . 19
4 Simulations and R esults
4.1 A step by step exam ple . . . . . .
4.1.1 A djacency m atrix input .
4.1.2 Form ation of clique m atrix
4.2 Sim ulation . . . . . . . . . . . . .
4.2.1 Sim ulation 1 . . . . . . . .
4.2.2 Sim ulation 2 . . . . . . . .
4.2.3 Sim ulation 3 . . . . . . . .
4.2.4 Sim ulation 4 . . . . . . . .
4.3 A nalysis . . . . . . . . . . . . . .
4.4 C onclusions . . . . . . . . . . . .
4.5 Future w orks . . . . . . . . . . .
5 B ib liograp hy
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
21
21
21
22
22
23
24
25
26
27
27
28
L ist of F igures
1.1
C om m unity Structure . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
T w o netw orks w ith sam e m odularity score but netw ork in the right
has m ore absent links than left one . . . . . . . . . . . . . . . . . .
3
9
2.2
A graph division and its com plem ent . . . . . . . . . . . . . . . . . 10
2.3
Exam ples of cliques . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4
k-clique com m unities for k=3 and k=4 . . . . . . . . . . . . . . . . 12
2.5
(a) m axim al cliques detected (b)O verlap m atrix created (c)B inary
m atrix created (d)k-clique com m unities for k=3 . . . . . . . . . . . 12
4.1
A djacency m atrix ofa netw ork w ith 10 nodes and the cliques detected 21
4.2
C orresponding clique m atrix(a) and binary m atrix(b) . . . . . . . . 21
4.3
C lique size vs N um ber of D olphin socialnetw ork . . . . . . . . . . . 22
4.4
C lique size vs tim e D olphin socialnetw ork . . . . . . . . . . . . . . 22
4.5
C lique size vs N um ber of B ooks about U S politics . . . . . . . . . . 23
4.6
C lique size vs tim e of B ooks about U S politics . . . . . . . . . . . . 23
4.7
C lique size vs N um ber of A m erican C ollege Footballnetw ork . . . . 24
4.8
C lique size vs tim e of A m erican C ollege Footballnetw ork . . . . . . 24
4.9
C lique size vs N um ber of C oauthorships in netw ork science . . . . . 25
4.10 C lique size vs tim e of C oauthorships in netw ork science . . . . . . . 26
8
9
–
Chap ter 1
Introduction
1
2
1.1
Introduction to social network analysis
1.1.1
Social Network
T he basic idea of a social netw ork is very sim ple. A social netw ork can be considered as a set of nodes representing individuals or oraganisations w ith a dyadic
or binary relations betw een them . It is usually represented by a graph G (V ,E)
w here V is the set of nodes representing the individuals and E is the set of edges
representing the interactions betw een them . Few exam ples of social netw ork are
Scientists co-authoring a paper, em ployees of a com pany w orking on a com m on
project, etc.
1.1.2
Social Network A nalysis
SN A refers to the process ofextracting inform ation from a socialnetw ork regarding
the individuals participating in it through m apping and m easuring of relationship
and flow s betw een the nodes that m ay represent people, groups, organizations,
com puters, U R L s and other connected inform ation/know ledge entities.
1.1.3
Purp ose of SNA
• T o m ake sense out ofsocialnetw ork that is to extract inform ation about the
individuals from the netw ork they participate in.
• T o find the structure of socialnetw orks.
• U sefullin understanding the evolution of socialnetw orks.
• T o discover com plex com m unication patterns, characteristic features.
1.1.4
Imp ortance of Social Network A nalysis
• Inform ation sharing.
• M arketing in e-com m erce and e-business.
• D eterm ine influentialentities.
3
• B uild effective socialand politicalcam paign.
• P redict future events.
• T racking terrorists.
• L ocation based crow d sourcing.
1.2
1.2.1
Introduction to community detection
Community
It is form ed by individuals such that those w ithin a group interact w ith each other
m ore frequently than w ith those outside the group. A netw ork com m unity (also
som etim es referred to as a m odule or cluster) is typically thought of as a group
ofnodes w ith m ore and/or better interactions am ongst its m em bers than betw een
its m em bers and the rem ainder of the netw ork.
Figure 1.1: Community Structure
1.2.2
Community Detection
It is the process of discovering groups in a netw ork w here individuals group m em berships are not explicitly given. T he problem of cluster or com m unity detection
in realw orld graphs that involves large socialnetw orks, w ebgraphs and biological
netw orks is a problem of considerable practical interest and has recieved a lot of
4
attention recently. T o extract such sets ofnodes one typically chooses an objective
function that captures the above intuition of a com m unity as a set of nodes w ith
better internal connectivity than external connectivity. T hen, since the objective
is typically N P -hard to optim ize exactly, one em ploys heuristics or approxim ation
algorithm s to find sets ofnodes that approxim ately optim ize the objective function
and that can be understood or interpreted as realcom m unities. A lternatively, one
m ight define com m unities operationally to be the output of a com m unity detection procedure, hoping they bear som e relationship to the intuition as to w hat it
m eans for a set ofnodes to be a good com m unity. O nce extracted, such clusters of
nodes are often interpreted as organizational units in social netw orks, functional
units in biochem icalnetw orks, ecologicalniches in food w eb netw orks, or scientific
disciplines in citation and collaboration netw orks
1.2.3
Purp ose of community detection
• U nderstanding the interactions betw een people.
• V isualizing and navigating huge netw orks.
• Form ing the basis for other tasks such as data m ining.
• Socialnetw orks often include com m unity groups based on com m on location,
interests, occupation, etc. C om m unities are present in m etaboillic netw orks
based on functionalgroupings. C om m unities are form ed in citation netw orks
based on research topic. B y identifying these sub-structures w ithin a netw ork
can provide know ledge about how netw ork function and topology affect each
other.
1.3
1.3.1
Communities in social media
T wo typ es of group s in social media
• Explicit G roups: form ed by user subscriptions
• Im plicit G roups: im plicitly form ed by socialinteractions
5
1.3.2
Is it necessary to extract group s b ased on network
top ology?
• A llSocialm edia w ebsites do not provide com m unity platform
• A ll people do not w ant to m ake effort to join groups. T hrough com m unity
extraction com m unitites can be suggested to people based on their interests.
• G roups in the realw orld change dynam ically.
• B esides socialm edia w ebsites it is essentialto extract com m unities in other
netw orks such as citation netw orks, W orld W ide W eb, m etaboillism netw orks
for various practicalpurposes.
1.3.3
Imp ortance of network interaction
• R ich inform ation aboutthe relationship betw een users can be obtained through
analysing netw ork interaction w hich can com plem ent other kinds of inform ation, e.g. user profile
• It P rovides basic inform ation that are essential for other tasks, e.g. recom m endation
• A nalysing N etw ork interaction helps in netw ork visualization and navigation.
Chap ter 2
L iterature R eview
6
7
2.1
Sp ectral b isection
A social netw ork is usually represented by an undirected graph. T he L aplacian
of an undirected graph G w ith n vertices is given by n n sym m etric m atrix L .
T he diagonal elem ent Lii of the m atrix L represents the degree of vertex i, and
off-diagonal elem ent Lij is 1 if vertices i and j are connected in the given graph
and zero otherw ise .So it can be deduced that L = D A , w here D is the diagonal
P
m atrix ofvertex degrees and A is the adjacency m atrix. T he degree Dii = j Aij .
T herefore it can be easily deduced that all row s and colum ns of the L aplacian
m atrix L add up to zero. T hus the vector 1 = (1, 1, 1...) is alw ays an eigenvector
w ith eigenvalue zero.
If the netw ork can be separated perfectly into com m unities, i.e., it can be divided
into g non-overlapping groups of vertices Gk (k = 1...g) such that there are edges
only w ithin the com m unity and no betw een-com m unity ones, then the L aplacian
w ill be block diagonal. Each diagonal block w ill form the L aplacian of its ow n
com ponent, and therefore w ill have an eigenvector vk w ith eigenvalue zero and
elem ents vk ( i) = 1 if i Gk and 0 otherw ise. T hus there w ill be g num ber of
different eigenvectors w ith eigenvalue 0.[12]
If the netw ork cannot be separated perfectly into com m unities then the above
condition w ill no longer be perfectly true. G enerally there w ill be the one eigenvector w ith 1 eigenvalue zero, and g 1 eigenvalues slightly greater than zero,
since all eigenvalues of the graph L aplacian are non-negative 1. T he corresponding eigenvectors w ill be given by linear com binations of the eigenvectors vk as
defined above. T herefore, one should be able to find the blocks them selves, at
least approxim ately by looking for eigenvalues ofthe graph L aplacian only slightly
greater than zero and taking linear com binations ofthe corresponding eigenvectors.
T he draw backs of the spectralbisection m ethod is that it only bisects graphs, i.e,
it divides the graph into tw o partitions. A larger num ber of com m unity division
can be achieved by repeated bisection, but this does not alw ays give satisfactory
results. In real w orld netw orks, w e do not have any prior idea about how m any
com m unities are present and how m any tim es the bisection should be perform ed.
8
2.2
H ierarchical clustering
In H ierarchicalC lustering m ethod[12] a sim ilarity m easure that is used to quantify
som e type ofsim ilarity betw een node pairs is defined. U sually topologicalsim ilarity is quantified. V arious com m only used m easures are the cosine sim ilarity, the
Jaccard index, and the H am m ing distance betw een row s of the adjacency m atrix.
T hen the sim ilar nodes are grouped into com m unities according to this m easure.
T here are severalcom m on schem es used to group the sim ilar nodes. Single linkage
clustering m ethod classifies tw o groups to be separate if node pairs betw een the
groups have a sim ilarity less than a given threshold value. In com plete linkage
clustering, allnodes are considered to belong to the sam e group if they have sim ilarity greater than threshold.
T he advantage of hierarchical clustering m ethod is that it does not require the
size or num ber of groups that w e have to provide beforehand, therefore, it has
been applied to various socialnetw orks w ith predefined sim ilarity m etrics, such as
the m odularity and betw eenness m easure. H ow ever, they are usually slow and the
perform ance highly depends on the corresponding m etrics.
2.3
T H E M O DU L A R IT Y M E A SU R E
It is a property of the netw ork that m easures w hen the division is good, in the
sense that there are m any edges w ithin the com m unity and only a few betw een
them . T he idea is to com pare the division to a random ized netw ork w ith exactly
the sam e vertices and degree in w hich edges are placed random ly.[4]
C onsider a particular division of a netw ork w ith k com m unities. T he division
of the graph into com m unities can be represented by a k x k sym m etric m atrix.
Each elem ent eij represents the fraction ofedges betw een the com m unities iand j.
P
P
T hus i eii gives the fraction ofedges that lie w ithin the sam e com m unity. j eij
gives the fraction of edges that has atleast one end in com m unity i
9
• M odularity Q = (num ber ofedges w ithin groups) (expected num ber ofedges
w ithin groups).
• Q=
P
(e2ii − a2i )
• eii = Fraction of edges present w ithin com m unity i.
• ai = Fraction of edges that have atleast one vertex w ithin com m unity i
2.3.1
Disadvantages
• R equires inform ation about the entire structure of the graph.
• Fails to identify com m unities sm aller than a certain scale.
• M easures the existing links betw een the nodes but does not consider the
absent links betw een the nodes in the sam e com m unity.
Figure 2.1: T w o netw orks w ith same modularity score but netw ork in the
right has more absent links than left one
10
2.4
M ax-M in M odularity
T he idea of M M M odularity[3] is based on the innate know ledge that a good division of a netw ork into com m unities is the one in w hich not only the num ber
of edges betw een groups is sm aller than expected, but also the one in w hich the
num ber of unrelated pairs w ithin groups is sm aller than expected.
G iven a graph G =(V ,E) w here V is the set of vertices and E is the set of edges,
then G ’=(V ,E’) is said to be the com plem ent graph of G if ∀i,j ∈ V (i,j) ∈
E 0 ifandonlyif(i,j) ∈
/ E.[3]
Figure 2.2: A graph division and its complement
• M ax-m in M odularity(QM
ax−M in )=
M odularity ofO riginalgraph - M odular-
ity of com plem ent graph
• m easures the existing links in a com m unity as w ell as considers the absent
links present in the sam e com m unity.
• cannot detect overlapping com m unities.
11
2.5
Clique p ercolation method
T he sequential clique percolation algorithm is an effi cient m ethod of detecting
overlapping com m unities in a netw ork.[8, 9, 16] G iven a graph G (V ,E) w here V
and E represent the vertices and edges set respectively. S ⊆ G , ∀ u,v ∈ S such
that u 6= v and (u,v) ∈ E, then S sisaid to be a clique.[8, 16]
S is said to be m axim alif there exists no S’such that S ⊂ S’.
• A k-clique of a graph is a subset of vertices such that the subset is fully
connected and there exists an edge betw een each and every pair of vertex in
the subset and size of the subset is k.[8, 16]
• a clique is a fully connected com ponent of a graph.
Figure 2.3: E xamples of cliques
• T w o k-cliques are said to be adjacent if the share k-1 nodes in com m on.
• A K -clique com m unity is a set of all h-cliques(h ≥ k) that are reachable to
each other through a series af adjacent k-cliques.
2.5.1
Step s of clique p ercolation algorithm
• M axim alclique detection. a k-clique is m axim alif it is not contained in any
other h-clique of h ≥ k.
• C reate clique-clique overlap m atrix. each entry in the m atrix indicates the
num ber of com m on nodes betw een the respective cliques.
• R eplace every elem ent in the m atrix greater than k-1 by 1.
• Extract the connected com ponents from the m atrix.
12
2.5.2
E xamp le of clique p ercolation method
Figure 2.4: k-clique communities for k= 3 and k= 4
Figure 2.5: (a) maximalcliques detected (b)O verlap matrix created (c)B inary
matrix created (d)k-clique communities for k= 3
2.6
Issues in the existing methods
• T he existing graph partitioning m ethods usually require input param eters
such as num ber of partitions and there size. B ut it is typically not possible
to know the required num ber of partitions and the partitions in real w orld
cases m ay not be of sam e size.
• M odularity based m ethods require the inform ation regarding the entire structure of the netw ork w hich is not possible to determ ine in case of real w orld
large netw orks such as W W W .
13
• M odularity based m ethods are also not able to determ ine the overlapping
com m unities. but in real w orld netw orks, entities or nodes m ay participate
in m ultiple com m unities.
• C lique percolation m ethods require extra space and com putation overhead
for com puting and storing the overlap m atrix.
2.7
O b jective
T he clique percolation m ethod does not require any initial input such as num ber of partitions or the entire structure of the netw ork. It is also able to detect
the overlapping com m unities in the real w orld netw orks.A naive approuch of im plem enting the clique percolation algorithm w ould be to generate allthe m axim al
cliques and store them and then com pare each ofthem to find out the connectivity
betw een them . B ut this w ould require high com putationaland space overhead.O ur
objective in this project is to use a sim ple backtracking algorithm given by B ronK erbosch[2] w ith m inor m odifications for generating m axim al cliques and avoid
generation of sub-m axim al and duplicate cliques such that it fits our purpose of
im plem enting clique percolation algorithm and to introduce m inor m odifications
w ith an aim of reducing the com plications in storing the interm ediate percolating
structures, there by im proving the space and com putation overhead.
Chap ter 3
A lgorithimic imp lementation
14
15
3.1
Prob lem formulation
L et G =(V ,E) be a graph w here V is the set of vertices and E ⊆ V x V is the set
of edges. A k-clique is a subset c ⊆ V such that there exists (i,j) ∈ E ∀ i,j ∈ c. A
k-clique com m unity is the union ofallthe h-cliques, k ≤ h that can be reached by
eachother through a series of adjacent k-cliques.
3.2
Data structures used
• input: A [N ][N ] //A djacency m atrix
• N :num ber of nodes in the netw ork
• generalbacktracking algorithm is used to detect m axim alcliques.
• Stack S is used to keep track of detected cliques.
• A clique m atrix C [][] is used instead of overlap m atrix.
• C [i][j]=1 if vertex j belongs to clique i.
• A binary m atrix B [][] represents the connected cliques. B [i][j]=1 if cliques i
and j are connected.
• M = A ; //copy adjacency m atrix to a tem porary m atrix
16
3.3
Descrip tion of algorithm
A m axim alclique is a clique that is not contained w ithin any other clique. In order
to detect the m axim al cliques in the input graph w e w ill use B ron-K erbosch[2]
backtracking algorithm . It is a recursive algorithm and is dependant on three
sets.
• Set of nodes that have already been defined as a part of the clique.
• Set of the nodes that are connected to allthe nodes of the previous set.
• Set of nodes that have already lead to a valid clique form ation and not to
be touched again.
I have achieved these three sets by using a stack S that stores the nodes of the
clique currently being constructed, a set neighbour that holds neighbours of the
current node being processed and a set processed that holds the nodes that have
already been processed.
T he set ofthe above listed sets is represented by the stack S in our im plem entation.
T he second set is com puted recursively by
• N = neighbouri
T
neighbourj
T
T
... neighbourk
w here i,j...are part ofthe current clique and k is the current node being processed.
T he third set is represented by the set
• not = N
T
processed.
T he detailed process if initialization, C lique generation and detection of the connected com ponents are given in the subsequent subsections.
17
3.3.1
Process initialization
T he process is initialized by A lgorithm 1 and A lgorithm 2 is recursively called to
generate the cliques. A lgorithm 1 initializes the process by assigning the 1st node
to the stack. It com putes the neighcour set of the node and the not set by the
T
form ula not = neighbour processed and passes them as argum ent to A lgorithm
2. A fter the com pletion ofthe recursion tree for a particular node, allthe m axim al
cliques containing that particular node are obtained. A lgorithm 1 puts the node in
the processed set and begins the recursion process for the next node by calling A lgorithm 2. A fter the com pletion ofthe iteration for allthe nodes in the netw ork all
the m axim alcliques w ould have been generated and stored in the C lique m atrix C .
A lgorithm 1: C lique percolation
inp ut : A [N ][N ] w here A is the adjacency m atrix of the netw ork
outp ut:Set of m axim alcliques stored in clique m atrix C and the connection
betw een them Stored in the binary m atrix B
top ← -1;
P rocessed ← null;
neighbour ← null;
not ← null;
k←0;
for i ← 1 to N do
neighbour ← {neighbours of i}-P rocessed ;
initialization
not ← P rocessed
T
{neighbours of i};
// neighbour set
// not set initialization
top ← top +1;
S [top ]← i ;
cliqueDetect(neighbour,not);
end
// node entered into stack
18
3.3.2
Clique Detection
T he algorithm recursively calculates the N eighbour set N and the not set for each
of the candidate nodes forw arded by A lgorithm 1 and calls itself tillthe stopping
criteria is satisfied. T he algorithm stops w hen set N and not are null. T his condition show s that a m axim al clique has been detected. T he contents of the stack
are stored in the C lique m atrix C and the algorithm returns one step back. If
the setN is null but not is not null, the clique so form ed is not m axim al and is
discarded.
A lgorithm 2: cliqueD etect(neighbour,not)
P rocessed ← null;
if neighbour = null and not = null then // termination condition
satisfied
t ← top;
while t ≥ 0 do // stack content stored in clique matrix
C [k ][S [t ]] ← 1;
t ← t-1;
end
k ← k +1;
// row number of clique matrix incremented
end
for ∀j ∈ neighbour do
top ← top +1;
S [top ] ← j;
cliqueDetect(neighbour
of j});
T
{neighbours of j} - P rocessed,not
T
{neighbours
// candidate nodes and not set passed to next level
S
P rocessed ← P rocessed j;
S
not ← not j;
end
top ← top-1;
return;
19
3.3.3
Determining connected comp onents
A fter all the m axim al C liques have been detected and stored A lgorithm 2 com putes the degree ofoverlapping am ong the cliques. D egree ofoverlapping betw een
tw o cliques i,j is calculated by adding the product of the corresponding row elem ents . T he algorithm creates the binary m atrix B , w here B [i][j]=1 if cliques i
and j have m ore than k nodes in com m on.
A lgorithm 3: connectC lique()
sum ← 0;
// degree of overlapping of cliques calculated
for i ← 1 to N do
for j ← i + 1 to N do
for k ← 1 to N do
sum ← sum + C [i][k] * C [j][k];
end
if sum > k then // if no.of common nodes is greater than k then
i,j are connected
B [i][j] ← 1;
end
end
end
Chap ter 4
Simulations and R esults
20
21
4.1
A step by step examp le
L et us go through a step by step exam ple of the w hole process so that w e can get
a clear cut idea about how the process of detecting the cliques and distinguishing
the connected com ponents am ong them actually w orks.
4.1.1
A djacency matrix inp ut
Figure 4.1: A djacency matrix of a netw ork w ith 10 nodes and the cliques
detected
4.1.2
F ormation of clique matrix
Figure 4.2: Corresponding clique matrix(a) and binary matrix(b)
T he binary m atrix form ed show s the cliques that are interconnected w ith each
other for k=3.
22
4.2
Simulation
In order to exam ine the com plexity ofthe algorithm , the algorithm w as applied to
four different undirected and unw eighted netw orks w ith different num ber ofnodes
and edges. T he m inim um clique size w as varied from 3 to 8. T he clique num ber
and tim e required to process w as plotted against clique size and the result w as
observed and analysed.
4.2.1
Simulation 1
• N etw ork used: D olphin socialnetw ork[11].
• N um ber of nodes: 62
Figure 4.3: Clique size vs N umber of D olphin social netw ork
Figure 4.4: Clique size vs time D olphin social netw ork
T he algorithm w as first applied to the D olphin social netw ork[11] w ith 62
nodes. It w as observed that m axim um num ber ofcliques that is around 47 cliques
23
w ere obtained w hen k=3. T he num ber of cliques gradually reduced as the k value
w as increased and the clique num ber becam e 0 w hen k value approached 6. So the
graph is sparesly dense. T he tim e com parison show s that the algorithm perform s
better w hen the num ber of cliques is m ore.
4.2.2
Simulation 2
• N etw ork used: B ooks about U S politics[1].
• A netw ork of books about recent U S politics sold by the online bookseller
A m azon.com . Edges represent frequent co-purchasing of books by sam e
buyer. T he netw ork w as com piled by V . K rebs.
• N um ber of nodes: 105
Figure 4.5: Clique size vs N umber of B ooks about U S politics
Figure 4.6: Clique size vs time of B ooks about U S politics
T he algorithm w as applied to a second netw ork- B ooks about U S politics[1]
w ith 105 nodes. It w as observed that m axim um num ber of cliques that is around
24
100 cliques w ere obtained w hen k=3. T he num ber of cliques gradually reduced
as the k value w as increased and the clique num ber becam e 0 w hen k value approached 7. So the graph is m oderately dense. T he tim e com parison show s that
the algorithm perform s better w hen the num ber ofcliques is m ore. B oth the algorithm s take sam e tim e for clique detection as sam e m ethod is used, but the new
m ethod perform s better in determ ination of connected m ethod.
4.2.3
Simulation 3
• N etw ork used: A m erican C ollege Football[6].
• N etw ork of A m erican football gam es betw een D ivision IA colleges during
regular season Fall2000
• N um ber of nodes: 115
Figure 4.7: Clique size vs N umber of A merican College Football netw ork
Figure 4.8: Clique size vs time of A merican College Football netw ork
25
T he algorithm w as applied to a third netw ork- A m erican C ollege Football[6]
w ith 115 nodes. It w as observed that m axim um num ber of cliques that is around
110 cliques w ere obtained w hen k=3. T he num ber of cliques gradually reduced as
the k value w as increased and the clique num ber approached to 0 w hen k value
approached 8 bt did not reach zero. So the graph is m oderately denser than the
previous netw ork. T he tim e com parison show s that the algorithm perform s better
w hen the num ber of cliques is m ore. B oth the algorithm s take sam e tim e for
clique detection as sam e m ethod is used, but the new m ethod perform s better in
determ ination of connected m ethod.
4.2.4
Simulation 4
• N etw ork used: C oauthorships in netw ork science[13].
• C oauthorship netw ork of scientists w orking on netw ork theory and experim ent, as com piled by M . N ew m an in M ay 200
• N um ber of nodes: 1589
Figure 4.9: Clique size vs N umber of Coauthorships in netw ork science
T he algorithm w as applied to a fourth netw ork- C oauthorships in netw ork
science[13] w ith 1589 nodes. It w as observed that m axim um num ber ofcliques that
is around 400 cliques w ere obtained w hen k=3. T he num ber of cliques gradually
reduced as the k value w as increased and the clique num ber approached to 0 w hen
k value approached 8 bt did not reach zero.T he C lique to node ratio is sm aller
in this case than the previous netw ork. So the graph is sparesly connected than
the previous netw ork. T he tim e com parison show s that the algorithm perform s
better w hen the num ber of cliques is m ore. B oth the algorithm s take sam e tim e
26
Figure 4.10: Clique size vs time of Coauthorships in netw ork science
for clique detection as sam e m ethod is used, but the new m ethod perform s better
in determ ination of connected m ethod.
4.3
A nalysis
T he above sim ulations perform ed show that in all the four cases the num ber of
cliques detected is m axim um w hen clique size is three. T he num ber of cliques
reduces as the clique size is increased. So the overlapping nature of the partition
form ed reduces w ith the increase in clique size. T he com parision ofthe sim ulations
show s that the algorithm perform s better w hen the num ber of cliques present is
m ore and has a better tim e com plexity. A s the algorithm does not require any
extra data structure to store the cliques form ed, w e can conclude that it requires
less space overhead during the com putation.
27
4.4
Conclusions
C om m unity detection is a problem ofdetecting subgraphs w ith higher w ithin edge
density than betw een edge density in a netw ork. C om m unities can be overlapping
as in real w orld social netw ork as the individuals in the netw ork m ay be involved
in several com m unities. k-clique percolation m ethod is one of the effi cient m ethods for detecting the overlapping com m unities in a netw ork. W e im plem ented and
analysed the k-clique percolation algorithm on different netw orks w ith different
netw ork structures and varying clique size. K -clique percolation is a hard problem
to im plem ent w ell, due to the diffi culty of producing interm idiate representations
ofpercolating structures. T he m ethod is challenged by the presence oflarge num ber of cliques. W e im plem ented the m ethod by using a C lique m atrix that can
be used to store the cliques generated and at the sam e tim e can represent the
degree of overlapping betw een the cliques there by m aking the representation and
com putation of interm idiate structures sim pler.
4.5
F uture works
T he k-clique percolation algorithm depends heavily on the clique detection algorithm being used. T he k-C lique percolation algorithm can be im proved by using
m ore effi cient clique detection algorithm s w hich is w hich is a current research area.
C om putation of the degree of overlappiing am ong the cliques has O(n2 ) com plexity. It can be reduced by using better data structures to store the interm ediate
structures.
Chap ter 5
B ib liograp hy
[1] B ooks about us politics. http://networkdata.ics.uci.edu/data.php?d=
polbooks.
[2] C oen B ron and Joep K erbosch. A lgorithm 457: finding all cliques of an
undirected graph. C ommunications of the A C M , 16(9):575–577, 1973.
[3] Jiyang C hen, O sm ar R Zaı̈ane, and R andy G oebel. D etecting com m unities in
socialnetw orks using m ax-m in m odularity. SD M 2009, pages 978–989, 2009.
[4] A aron C lauset, M ark EJ N ew m an, and C ristopherM oore. F inding com m unity
structure in very large netw orks. P hysical review E , 70(6):066111, 2004.
[5] T anja Falkow ski, A nja B arth, and M yra Spiliopoulou. D engraph: A densitybased com m unity detection algorithm . In W eb Intelligence, IE E E /W IC /A C M
International C onference on, pages 112–115. IEEE, 2007.
[6] M ichelle G irvan and M ark EJ N ew m an.
C om m unity structure in social
and biological netw orks. P roceedings of the National A cademy of Sciences,
99(12):7821–7826, 2002.
[7] Enrico G regori, L uciano L enzini, and Sim one M ainardi. P arallel k-clique
com m unity detection on large-scale netw orks. 2012.
[8] JussiM K um pula, M ikko K ivelä, K im m o K aski, and JariSaram äki. Sequential algorithm for fast clique percolation. P hysical R eview E , 78(2):026109,
2008.
28
29
[9] C onrad L ee, Fergal R eid, A aron M cD aid, and N eil H urley. D etecting highly
overlapping com m unity structure by greedy clique expansion. arX iv preprint
arX iv:1002.1827, 2010.
[10] Zhenping L i, Shihua Zhang, R ui-Sheng W ang, X iang-Sun Zhang, and L uonan
C hen. Q uantitative function for com m unity detection. P hysical R eview E ,
77(3):036109, 2008.
[11] D avid L usseau, K arsten Schneider, O liver J B oisseau, P attiH aase, Elisabeth
Slooten, and Steve M D aw son. T he bottlenose dolphin com m unity ofdoubtful
sound features a large proportion of long-lasting associations. B ehavioral
E cology and Sociobiology, 54(4):396–405, 2003.
[12] M ark EJ N ew m an. D etecting com m unity structure in netw orks. The E uropean
P hysical Journal B -C ondensed M atter and C omplex Systems, 38(2):321–330,
2004.
[13] M ark EJ N ew m an. F inding com m unity structure in netw orks using the eigenvectors of m atrices. P hysical review E , 74(3):036104, 2006.
[14] G ergely P alla, Im re D erényi, Illés Farkas, and T am ás V icsek. U ncovering the
overlapping com m unity structure of com plex netw orks in nature and society.
Nature, 435(7043):814–818, 2005.
[15] C lara P izzuti. G a-net: A genetic algorithm for com m unity detection in social
netw orks. In P arallel P roblem Solving from Nature–P P SN X , pages 1081–
1090. Springer, 2008.
[16] Fergal R eid, A aron M cD aid, and N eil H urley. P ercolation com putation in
com plex netw orks. In A dvances in Social Networks A nalysis and M ining
(A SO NA M ), 2012 IE E E /A C M International C onference on, pages 274–281.
IEEE, 2012.
[17] Erin N Saw ardecker, M arta Sales-P ardo, and L uı́s A N unes A m aral. D etection
of node group m em bership in netw orks w ith group overlap. The E uropean
P hysical Journal B , 67(3):277–284, 2009.
[18] K arsten Steinhaeuser and N itesh V C haw la. C om m unity detection in a large
real-w orld social netw ork. In Social C omputing, B ehavioral M odeling, and
P rediction, pages 168–175. Springer, 2008.

Download Report

Community Detection in Social Networks Soumyakant

Paperzz.com

Your Paperzz