Power-law node degree distribution in online affiliation networks

i
i
“preparation” — 2010/5/29 — 10:33 — page 1 — #1
i
i
Power-law node degree distribution in online
affiliation networks
Szymon Chojnacki, Mieczyslaw Klopotek
Institute of Computer Science PAS,
J.K. Ordona 21, 01-237 Warsaw, Poland
[email protected]
Abstract. The purpose of this article is to present a preferential attachment
model adjusted to generation of bipartite graphs. The original model is able to
produce unipartite graphs with node degree distribution following power law relation. The motivation for extending classic model is the fact that multi-partite
graph topologies are becoming more and more popular in social networks. This
phenomenon raises questions about our ability to transform models that describe unipartite graphs to the new settings. In some cases bipartite structure is
given explicitly, in other cases the bipartite structure may be induced from more
complex topologies. We present both empirical results concerning node degree
distribution in real-life bipartite networks and a modified preferential attachment
model that is able to reflect the properties observed in these networks.
1
Introduction
Bipartite or affiliation networks describe a situation in which we have two types of
nodes and direct links only between nodes of different kinds. One type of nodes could
be interpreted as actors and the second as events in which actors take part [1]. From
this network we could induce two unipartite graphs containing only nodes of one type.
However, it has been shown that despite some similarities, in general the structural
properties of both representations may differ significantly [2].
More formally, a graph is an ordered pair G = (V, E) comprising a set of vertices
(or nodes) V and a set of edges (or links) E. A bipartite network (or bigraph) is a
graph which vertices can be labeled by two types A and B. The difference with classic
unipartite graph is the fact that V consists of two sets V = {VA ∪ VB : VA ∩ VB = }
and edges exist only between nodes of different types E ⊆ VA × VB . Degree of a node
is the number of its direct neighbors. Probability that a degree kv of node v of type A
is exactly z can be calculated as:
P (kv = z) =
|v ∈ VA : |{w : (v, w) ∈ E}| = z|
.
|VA |
(1)
In this article we focus on degree distributions obtained from our graph generative
model. Empirical results show that the density function of a degree distribution in real
life datasets often follows power law relation. It means that the frequency of an event
decreases at a higher rate than its size increases. Such property is obtained from a
polynomial relation:
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 2 — #2
i
2
i
Szymon Chojnacki, Mieczyslaw Klopotek
f (x) = a · x−k .
(2)
In the above equation a and k are constants and k is called scaling exponent. The
tail in power law distribution vanishes slower than exponentially, which is described as a
heavy - tail. The shape of density function in scale invariant (or scale free) distribution
does not change when we multiply variable x by c. One can verify that:
f (cx) = a · (cx)−k = c−k f (x) ∝ f (x).
(3)
This property is visible when we take logarithm of both sides of Eq. 2. In this way we
obtain a linear relationship between variables log(f (x)) = −k log(x) + log(a) in which
k is responsible for the slope of a line drawn by the points.
The scale invariant distribution is unavailable for purely random graphs [3]. The
first model that was able to reflect this property by an atomic process was preferential
attachment model [4]. We modify this model and show that our extension is also able
to create power law degree distribution in bipartite networks.
The rest of the paper is organized as follows: Section 2 surveys the related work.
Section 3 gives results of our exploratory analysis. In Section 4 we describe in detail
the preferential attachment model for bipartite random graphs generation, we conduct
both formal mathematical analysis and show the results of experimental simulations.
We conclude and discuss the implications of our findings in the last fifth section
2
Related work
In preferential attachment model we initialize graph with a set of m0 vertices. In each
iteration new node is added to a network and is connected to p nodes. A probability that
a node already existing in a graph would be connected to a new node is proportional
to its degree [4]. This process reflects the rich gets richer phenomenon. It generates
asymptotical power law node degree distribution with scaling exponent equal to 3.
Extensions of the base model were proposed to build networks with the exponent ranging
from 2 to infinity. In winners don’t take all model probability that a node will be selected
is modified by a mixture of preferential attachment and uniform distribution [5]. The
same idea results from stochastic urn transfer model [6]. Generalization described in
[7] allows to add edges between nodes existing in a network during each iteration. The
mixture of three processes in one iteration: adding a node, adding an edge and rewiring
an edge was proposed in [1]. Copying model is an example of a graph generator that
results in power law distribution and is not based on preferential attachment [8, 9].
Results of exploratory analysis conducted on real world networks have lead to additional observations beyond node degree distribution and construction of models that
try to reflect these characteristics. Small diameter of real networks resulting from Milgram’s experiments [10], which is usually referred to as small world phenomenon or six
degrees of separation is explained by models that allow local structures to be connected
to random distant nodes [11, 12]. It has been shown that the proportion of the number of edges to the number of nodes increases over time. This densification feature was
explained by a generic Community Guided Attachment model, in which an underlying
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 3 — #3
i
Power-law node degree distribution in online affiliation networks
i
3
hierarchical structure of nodes is considered and a probability of a link between two
nodes is inversely proportional to the distance in a tree between the two nodes [13]. It
has been shown that an effective diameter of a network decreases over time [13] and
a very powerful model to simulate this pattern, based on adjacency matrix multiplication, was proposed [14]. In [9] the users of a social network were divided into three
groups: linkers, inviters and passive, which enabled to created a model that simulates
the structure of largest connected component and middle region in a real network.
Most of the described research was focused on the analysis of unipartite graphs. Early
analysis of affiliation networks was limited by the size of available data. In [15] a network
of twenty-six chief executive officers and their membership in fifteen civic clubs and
corporate boards was analyzed. Even if row data have bipartite structure, the analysis
is usually limited to the projection onto one of the modalities. However, it has been
shown that even though some properties are common for both views (e.g. relative size of
the largest connected component), other important measures may differ (e.g. clustering
coefficient) [2]. In a model for bipartite graph generation described in [2] number of
nodes is fixed and we draw degree for each node from a predefined distribution, in the
second step the ends of generated edges are connected randomly. An iterative model for
bipartite graph generation was proposed in [16]. In each iteration one node of preselected
modality is added, its degree is drawn from a predefined distribution and for every edge
a decision is made whether to join it to existing nodes following preferential attachment
or join it to a newly created node. Empirical results show that node degree distribution
in random graphs output from this process resembles that of real - world datasets.
In this article we give empirical evidence for an observation that node degree distribution in affiliation networks induced from online social networks is power law and
present modified version of preferential attachment model, which iteratively generates
network meeting this criterion.
3
Observations
In this section we describe degree distributions obtained for two bipartite large graphs
built from three real-life datasets. The datasets are defined in the first subsections. In
the second subsection we describe the results.
3.1
BibSonomy dataset
The BibSonomy dataset was used during ECML/PKDD 2009 Discovery Challenge [17].
It contains a full snapshot of BibSonomy.org users’ activity until the end of 2008. The
users of BibSonomy(http://www.bibsonomy.org) can bookmark favorite publications
and web pages and assign tags (keywords) to saved resources. We utilized tas (tag
assignments) file which contained 401 104 tag assignments made by 3 617 user for 235
328 resources using 93 756 distinct tags. From this data we built a bipartite graph of
user - resource connections.
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 4 — #4
i
4
3.2
i
Szymon Chojnacki, Mieczyslaw Klopotek
CiteULike dataset
The structure of main table in this dataset is analogous to the tag assignment table in
the BibSonomy dataset. The two datasets differ in magnitude and type of used resource,
in case of CiteULike (http://www.citeulike.org) we had publications instead of web
pages. We obtained additional table with assignments of CiteULike users to thematic
groups. Tas file contains 2 657 227 tag assignments made by 18 467 users for 557 101
articles by means of 166 504 tags. The second file linking groups with users contains only
2 336 groups and 5 208 users. We built a bipartite graph of user - group connections.
3.3
Degree distributions
The degree distributions of nodes of two types for the eight bipartite graphs are presented at Fig. 1 and Fig. 2. In most cases the points are shaped in a straight line on
a log-log scale, which is a necessary condition for power law distribution. We obtained
scaling exponent by solving least squares linear regression problem. We checked which
observations are overly influential with respect to DFFITS statistic [18]. For most distribution, there are outliers at the beginning and at the end of the domain. Therefore,
we considered only points from the fifth to the twenty fifth for cleaned parameters
estimation. Estimated exponents are in (1,4) interval.
BibSonomy
1 000 000
100 000
frequency
10 000
φ = 3.33
user
1 000
resource
100
φ = 1.60
10
1
1
10
100
1 000
10 000
100 000
node's degree
Fig. 1. Node degree distributions for BibSonomy dataset. A straight line of points on a log-log
scale is characteristic for power law distributions. The dotted line is obtained by estimation of
linear regression with least squares method for points between the fifth and the twenty fifth
(overly influential points are omitted). Scaling coefficients are given next to the dotted lines.
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 5 — #5
i
i
5
Power-law node degree distribution in online affiliation networks
CiteULike
10 000
φ = 1.73
frequency
1 000
user
100
group
φ = 3.01
10
1
1
10
100
node's degree
Fig. 2. Node degree distributions for CiteULike dataset.
The presented empirical results indicate that similarly to uniparite graphs, bipartite
online graphs exhibit power-law feature of node degree distributions. This propery was
explained by preferential attachment mechanism in unipartite graphs. In the following
section we generalize the mechanism to the setting of bipartite graphs.
4
Generative procedure
Our model can be perceived as an extension of classic preferential attachment model
for bipartite graphs. In the following subsections we present the generative procedure
and check stability of estimated exponents.
4.1
Detailed description
Initially we start with a small number of edges m0 connecting m0 vertices of type A
with m0 vertices of type B. During each iteration we add a node v of type A and a
node w of type B to the graph. We connect both nodes with p and q nodes of different
types respectively (Fig. 3). We call p and q basic node degree.
The selection of the nodes that we will connect to is based on a preferential attachment rule, which states that the probability that a new edge will be connected to a
node is proportional to its degree. After t iterations the number of vertices and edges
is equal to |V (t)| = 2 · (t + m0 ) and |E(t)| = t · (p + q) + m0 . If we use ki to represent
the degree of node i, then for a bipartite graph Eq. 4 describes the relation between the
number of edges and the sums of node degrees.
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 6 — #6
i
6
i
Szymon Chojnacki, Mieczyslaw Klopotek
q
p
A
B
Fig. 3. During each iteration one node of type A and one node of type B are added to the
graph. A node of type A is connected to p nodes of type B and a node of type B is connected to
q nodes of type A. The selection of nodes is random, based on preferential attachment principle
(probability of being selected is proportional to nodes degree).
X
i∈VA
ki =
X
kj = |E|
(4)
j∈VB
Degree ki of node i of type B changes over time with accordance to the following
differential equation:
∂ki
ki · p
=
∂t
(p + q) · t
(5)
The denominator of right side of the equation is equal to the number of edges in
graph G(t) when t >> m0 . The nominator is a multiplication of node’s degree times p.
If we sum the right side of the equation over all nodes of type B we obtain p, which is
the amount of edges that we added to nodes of type B during one iteration. Moreover
the right side is also proportional to node degree and therefore satisfies preferential
attachment rule.
4.2
Experiments
In this subsection we verify the impact of initial number of edges on the stability of
scaling exponents in graphs obtained after finite number of iterations. The results let us
suspect that power law distribution is obtained early. We simulated the above described
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 7 — #7
i
Power-law node degree distribution in online affiliation networks
i
7
estimated scalling exponent
3
2.5
2
1.5
1
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Initial number of edges
nodes of type A nodes of type B
frequency (log scale)
Fig. 4. Estimated scaling exponent for node degree distribution of node modalities obtained by
changing initial number of edges. All points were used for least squares regression estimation.
Node’s degree during first attachment was set as p = 2 and q = 3.
node degree (log scale)
Fig. 5. Node of type B degree distribution and estimated scaling exponents. Graph generated
after 20 000 iterations, for initial number of edges m0 = 5 and base degrees p = 2 and q = 3.
procedure with initial number of nodes varying from one to thirty. We run twenty
thousand node addition iterations for each graph.
We can see a slightly increasing trend of the value of the exponent for both modalities
(Fig. 4). It means that if the number of initial edges (and initial nodes with degree equal
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 8 — #8
i
8
i
Szymon Chojnacki, Mieczyslaw Klopotek
to one) is higher than the generated distribution is steeper. In case of nodes of type B,
which have higher basic degree, the exponent destabilizes for smaller number of initial
edges. We verified the deviant distribution and the reason for flatter distribution shape
is the fact that some of the m0 nodes were not selected by preferential attachment rule
even once during 20 000 iterations.
Based on visual analysis of generated distributions we asserted that in all situations
we obtained a straight line shape of points (except for outliers), which is a good indicator
of power law distribution. In Fig. 5 we present the distribution for one of generated
graphs.
5
Conclusion
In this article we analysed node degree distributions in bipartite real-world graphs.
It turns out that power law relation is common in such networks. This observation
motivated us to modify classic preferential attachment model and analyze its properties
in case of bigraphs. We showed that preferential attachment mechanism can be used
to generate random affiliation networks with power-law node degree distributions. The
results of performed simulations suggest that the number of initial edges in random
graphs influences the stability of estimated scaling exponent even after several thousand
of iterations.
References
[1] R. Albert and A. Barabási, “Topology of evolving networks: Local events and universality,”
Physical Review Letters, vol. 85, no. 24, pp. 5234–5237, 2000.
[2] M. Newman, D. Watts, and S. Strogatz, “Random graph models of social networks,” P
Natl Acad Sci USA, vol. 99, pp. 2566–2572, 2002.
[3] P. Erdos and A. Renyi, “On the evolution of random graphs,” Publication of the Mathematical Institute of the Hungarian Academy of Sciences, 1960.
[4] A. Barabási and R. Albert, “Emergence of scaling in random networks,” Science (New
York, N.Y.), vol. 286, no. 5439, pp. 509–512, 1999.
[5] D. M. Pennock, G. W. Flake, S. Lawrence, E. J. Glover, and C. L. Giles, “Winners don’t
take all: Characterizing the competition for links on the web,” Proc. Natl. Acad. Sci. USA,
vol. 99, no. 8, pp. 5207–5211, 2002.
[6] M. Levene, T. I. Fenner, G. Loizou, and R. Wheeldon, “A stochastic model for the evolution of the web,” Computer Networks, vol. 39, no. 3, pp. 277–287, 2002.
[7] S. Dorogovtsev, J. Mendes, and A. Samulkin, “Structure of growing networks: Exact
solution of the barabasi-albert model,” Physical Review Letters, vol. 85, pp. 4633–4636,
2000.
[8] J. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, “The web as a
graph: Measurements, models, and methods,” Computing and Combinatorics, pp. 1–17,
1999.
[9] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal,
“Stochastic models for the web graph,” in Proceedings of the 41st Annual Symposium
on Foundations of Computer Science (FOCS), (Redondo Beach, CA, USA), pp. 57–65,
IEEE CS Press, 2000.
[10] S. Milgram, “The small-world problem,” Psychology Today, vol. 2, pp. 60–67, 1967.
i
i
i
i
i
i
“preparation” — 2010/5/29 — 10:33 — page 9 — #9
i
Power-law node degree distribution in online affiliation networks
i
9
[11] D. J. Watts and S. H. Strogatz, “Collective dynamics of ’small-world’ networks,” Nature,
vol. 393, pp. 440–442, jun 1998.
[12] B. Waxman, “Routing of mulitpoint connections,” IEEE Journal of Selected Areas in
Communications, vol. 6, pp. 1617–1622, 1988.
[13] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: densification laws, shrinking diameters and possible explanations,” in KDD ’05: Proceeding of the eleventh ACM
SIGKDD international conference on Knowledge discovery in data mining, (New York,
NY, USA), pp. 177–187, ACM Press, 2005.
[14] J. Leskovec, D. Chakrabarti, J. M. Kleinberg, and C. Faloutsos, “Realistic, mathematically tractable graph generation and evolution, using kronecker multiplication.,” in PKDD
(A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, eds.), vol. 3721 of Lecture
Notes in Computer Science, pp. 133–145, Springer, 2005.
[15] J. Galaskiewicz, S. Wasserman, B. Rauschenbach, W. Bielefeld, and P. Mullaney, “The influence of class, status, and market position on corporate interlocks in a regional network,”
Social Forces, 1985.
[16] J.-L. Guillaume and M. Latapy, “Bipartite structure of all complex networks.,” Inf. Process. Lett., vol. 90, no. 5, pp. 215–221, 2004.
[17] F. Eisterlehner, A. Hotho, and R. Jäschke, eds., ECML PKDD Discovery Challenge 2009
(DC09), vol. 497 of CEUR-WS.org, Sept. 2009.
[18] F. Harrell, “Regression modelling strategies,” Springer, 2001.
i
i
i
i