Crawling large graphs

1
A SURVEY PAPER ON ESTIMATING THE ONLINE SOCIAL
NETWORING SITE
Deepti Bhagwani*, Setu Kumar Chaturvedi
*[email protected]
Department of Computer Science Engineering
Technocrat Institute of Technology Bhopal
Rajiv Gandhi Proudyogiki Vishwavidyalya Bhopal M.P. India
Abstract:
A social network is a set of people or organizations or other social entities connected by set of social
relationships such as friendship, co-working or information exchange. Estimating the server load of
online social network sites is the most challenging research topic of the network management system.
Also the online social web servers are existing in various countries. This Paper provides the necessary
background on the online social networking sites and the Techniques that are applied on social
networking sites to maintain them.
Keywords: Online Social Media, Online Social Networking, Human Factor, Information System.
Introduction
With the advancement of internet age online social
networks has shown the rapid growth. Internet has
become the means to interact with people for
business communication as well as personal
contact, social networks are the means which with
the help of internet provides strong bonding
between the individual. Now-a-days there are
various resources available by which interested
individuals can become the part of the online social
networking community. Online social networks
(LSNW 2010), (FS 2013) provide a powerful
reflection of the structure and dynamics of the
society of the 21st century and the interaction of the
Internet generation with both technology and the
people. Indeed, the intense growth of social
multimedia and user generated content is revolution
of all phases of the content value chain including
fabrication, processing, allocation and utilization. It
also originated and brought to the multimedia sector
a new underestimated and now critical aspect of
science and technology: Online social interaction
2
and networking. The significance of this new
speedily evolving research field is clearly evidenced
by many associated emerging technologies and
applications including online content sharing
services, communities, multimedia communication
through the Internet, Online social multimedia
search,
interactive
services,
health
care,
entertainment, and security applications. This has
generated a new research area called Online Social
Multimedia Computing, in which well- known and
established computing and multimedia networking
technologies are brought together with emerging
social media research.
OSN Internet services are changing the way we
communicate with others, entertain and the way we
live. Social Networking is one of the primary
reasons that many people have become avid Internet
users; people who until the emergence of social
networks could not find interest in the web. It is a
very vigorous indicator of what is actually
happening online. The Web 2.0 era has passed
leaving great strength to the end-users.
Nowadays, users both produce and consume
significant quantities of multimedia content.
Moreover, these behaviors when combined with
Online Social Networking have formed a new
Internet era where multimedia content sharing
through Social Networking Sites is a daily practice.
More than 200 SNSs of worldwide impact are
known today and this number is growing rapidly.
Many of the existing top web sites are either pure
SNSs or offer some social networking capabilities
(Rupam Some 2013). Except for the well known
“first tier” social networks with hundreds of
millions of users that span throughout the world,
there are also many small social networking sites
that are equally as popular within the more limited
geographical scope of their membership, that may
be within a city, country or continent,
ONLINE SOCIAL NETWORKING SITES
Social networking sites now reach 82 percent of the
world’s online population, which represents 1.2
billion users around the world. The social
networking sites adoption trend largely mirrored the
global Internet adoption curve, and developed
proportionately, showing that as soon as people
began to get connected, they began connecting with
one another. Even more influential feature of social
networking’s emergence is the amount of time
people currently engage with it. As a percentage of
the time people spend online, social networking
activity has been more than tripled in the last few
years. In October 2011, Social Networking was
classified as the most popular content category in
worldwide engagement, accounting for 19 percent
of total time spent online. Nearly 1 minute time in
every 5 minutes spent online is now spent on social
networking sites – a stark contrast from when the
category accounted for only 6 percent of time spent
online in March 2007. Time spent on social
networking sites increased during this time by
taking share predominantly from web-based email
and instant messengers, reflecting its emergence as
the primary communication channel for users.
Clearly, it has evolved over the years to become an
integral part of the global online experience, in
various ways both mirroring and augmenting the
offline social experience (SW 2011).
Fig1: The Rise of the Global Social Networking Audience
Source: comScore Media Metrix Worldwide, March 2007 – October 2011
3
Fig 2: Time Spent Online on Key Internet Categories
Source: ComScore Media Metrix Worldwide, March 2007 – October
A social network is a set of people or organizations
or other social entities connected by set of social
relationships such as friendship, working together
or exchange of information. Social network analysis
emphases on the analysis of the pattern of
relationships among people, organizations and
social entities. This section provides an overview of
different social networking sites (Rupam Some
2013).
Flickr
Flickr is a photo-sharing site based on a social
network. The Flickr contains over 1.8 million users
and 22 million links. Flickr (M. Molloy and B.
Reed 1995) is an image and video hosting website,
web services suite, and an OSN. Flickr provides
both private and public storage of image. A user
uploading an image can set privacy controls that
determine who can view an image. A photo can be
marked as either public or private. Private images
become visible by default only to the uploader, but
they can also be marked as visible by friends and/or
family. Privacy settings can also be decided by
adding photographs from a user’s photo stream to a
“group pool”. If the group is private all members of
that group can see the photo. If the group is public
the photograph becomes public as well. Flickr also
provides a “contact list” which can be used to
control image access for a specific set of users in a
way similar to social tier tools of other OSNs (Alan
Mislove, Massimiliano Marcon and Krishna P.
Gummadi 2007).
Facebook
Facebook is the world’s largest social network, with
over 350 million active users and half of them visit
the site once per day (FS 2013). It basically
provides a platform to share a common interest,
idea, task or goal that interacts in its users where
they are able to develop or maintain personal
relationships. Moreover, it also provides facilities to
invite friends and guest to join their events. It shares
many of the professional data as well which can
also be beneficial for business purpose. New games
are also available which can be played as a
individual or in groups across the world. Facebook
provide a bulletin board for users to sell and buy
products from each other. Companies are using this
as means of advertisement of their products.
Facebook launched API for its platform on 2007,
providing a framework for software developers to
create applications that interact with core Facebook
features. But its API put several restrict to access
whole of individual’s social graph.
Orkut
Orkut a social networking site run by Google. Orkut
is a “pure” social network, as the sole purpose of
the site is social networking, and no content is being
shared. Its purpose is to provide an online meeting
place where people can socialize, make new
connections and find others who share their
interests. Features include messaging, text chat,
video chat, and an ability to personalize the view of
the site using a wide range of colors and themes.
Anyone 18 years and above can join. Orkut™ is
available in 48 languages and is especially popular
in Brazil and India. In 2010, Orkut™ had more than
100 million users worldwide. Brazil had the most
visitors with 48%, while 39.2% were from India,
and 2.2% were from the United States. The site was
earlier popular in Iran, but the Iranian government
block access to Orkut™ now, claiming it is a threat
to national security issues and Islamic values.
4
Government in the United Arab Emirates and Saudi
Arabia have also blocked access to the site (Alan
Mislove, Massimiliano Marcon and Krishna P.
Gummadi 2007)(Yong-Yeol Ahn, Seungyeop Han,
Haewoon Kwak, Young-Ho Eom, Sue
Moon,
Hawoong Jeong 2007).
Twitter
Twitter is an OSN and micro blogging service that
enables users to send and read short 140character text messages, called "tweets". Registered
users can read and post tweets, but unregistered
users can read them only, not post it. The service
quickly gained worldwide popularity, with 300
million registered users in 2012, who posted 340
million tweets on daily basis. The service also took
care of 1.6 billion search queries per day (LSNW
2010).
LinkedIn
LinkedIn is a business-oriented social networking
site, which was founded in December 2002 and
launched in May 2003 It has become fastest means
of connecting professional buddies sharing
knowledge about their area of interest and other
professional, which help individuals in exchanging
the Human resources. Many the companies are also
using this network to hire the professionals required
in various segments. You can be in touch with
thousands of professionals by liking from each
others. As of October 2009, LinkedIn had more than
50 million registered users, covering more than 200
countries and territories worldwide. LinkedIn
controls what a viewer may see based on whether
she or he has a paid account. LinkedIn allows users
to opt out of displaying their network. Compared
other OSNPs, LinkedIn’s business model is unique.
It controls what a viewer may see based on whether
she or he has a paid account (LSNW 2010).
YouTube
YouTube is a popular video-sharing site that
includes a social network. The YouTube data we
present was obtained on January 15th, 2007 and
consists of over 1.1 million users and 4.9 million
links. Similar to Flickr, YouTube exports an API.
YouTube allows links to be queried only in the
forward direction, similar to Flickr. Unfortunately,
YouTube’s user identifiers do not follow a standard
format (Alan Mislove, Massimiliano Marcon and
Krishna P. Gummadi 2007).
Live Journal
Live Journal is a popular blogging site whose users
form a social network. It contains over 5.2 million
users and 72 million links (Alan Mislove,
Massimiliano Marcon and
Krishna P. Gummadi
2007).
Cyworld
Cyworld is the largest and oldest online social
networking service in South Korea. It began
operation in September 2001, and its growth has
been explosive ever since. Cyworld’s 15 million
registered users, as of November 2006, are an
impressive number, considering the total population
of 48 million in South Korea. As any SNS, Cyworld
offers users to establish maintain and dissolve a
friend (called ilchon) relationship online (YongYeol Ahn, Seungyeop Han, Haewoon Kwak,
Young-Ho Eom, Sue Moon, Hawoong
Jeong
2007).
MySpace
MySpace is the largest social networking service in
the world, with more than 190 million users. It
began its service in July 2003, and the number of
users grew explosively. According to Alexa.com2,
it is the world’s 5th most popular website (YongYeol Ahn, Seungyeop Han, Haewoon Kwak,
Young-Ho Eom, Sue Moon, HawoongJeong 2007)
(4th among English websites).
Materials and Methods
Network Size Estimation
This presents an estimator for the graph size
(number of nodes). The estimator uses observations
of node pairs which are “far away” from each other
in the random walk (S. J. Hardiman, P. Richmond,
and S. Hutzler 2009). This assumption is needed to
ensure both nodes in a pair are (approximately)
uncorrelated: each drawn from the stationary
distribution. Specifically, the estimator examines
node pairs whose index distance is greater than a
threshold m [5]. Formally,
I = {(𝑘, 𝑙) | 𝑚 ≤ |𝑘 − 𝑙| ⋀ 1 ≤ 𝑘, 𝑙 ≤ 𝑟}
5
The estimator counts weighted neighbor collisions.
A neighbor collision is a pair of indices (k, l) such
that vxkand vxl share a common neighbor. Formally,
let Ai be the set of vertices adjacent to vi. Thus, Ai
∩Aj is the set of nodes neighboring both viand vj.
Given a random walk (x1, x2, . . ., xr), we define a
new variable φk,l = |Axk ∩ Axl|. Note that if (k, l) ∈
I, then
To see why
consider the
following combinatorial proof. For a node vk, the
number of connected triplets (vi, vk, vj) with no
restrictions on i and j is d2_ k. Thus, the total
number of connected triplets is
Alternatively for nodes vi and vj the number of
connected triplets (vi, vk, vj) is |Ai ∩ Aj |. Thus, the
total number of connected triplets can also be
expressed by
To see why
consider the
following combinatorial proof. For a node vk, the
number of connected triplets (vi, vk, vj) with no
restrictions on i and j is d2_ k. Thus, the total
number of connected triplets is
alternatively for nodes viand vjthe number of
connected triplets (vi, vk, vj) is |Ai ∩ Aj |. Thus, the
total number of connected triplets can also be
expressed by
Crawling large graphs
Crawling large, complex graphs presents unique
challenges. In this section, we describe our general
approach before discussing the details of how we
crawled each network (Alan Mislove, Massimiliano
Marcon and Krishna P. Gummadi 2007).
Crawling the entire connected component
The primary challenge in crawling large graphs is
covering the entire connected component. At each
step, one can generally only obtain the set of links
into or out of a specified node. In the case of online
social networks, crawling the graph efficiently is
important since the graphs are large and highly
dynamic. Common algorithms for crawling graphs
include breadth-first search (BFS) and depth-first
search. Often, crawling an entire connected
component is not feasible, and one must resort to
using samples of the graph. Crawling only a subset
of a graph by ending a BFS early (called the
snowball method) is known to produce a biased
sample of nodes. In particular, partial BFS crawls
are likely to overestimate node degree and
underestimate the level of symmetry (L. Becchetti,
C. Castillo, D. Donato, and A. Fazzone 2006). In
social network graphs, collecting samples via the
snowball method has been shown to underestimate
the power-law coefficient, but to more closely
match other metrics, including the overall clustering
coefficient. Some previous studies of social
networks have used small graph samples.
Using only forward links
Crawling directed graphs, as opposed to undirected
graphs, presents additional challenges. In particular,
many graphs can only be crawled by following links
in the forward direction (i.e., one cannot easily
determine the set of nodes which point into a given
node). Using only forward links does not
necessarily crawl an entire WCC; instead, it
explores the connected component reachable from
the set of seed users. This limitation is typical for
studies that crawl online networks, including
measurement studies of the Web (S. H. Lee, P.-J.
Kim, and H. Jeong 2006).
Size Estimation of Facebook
They used two crawls performed on Facebook, the
first crawl consisted of 984, 830 uniformly sampled
users collected during April 2009.11 The second
crawl was performed during October 2010 and
consisted of 988, 116 users (M. Gjoka, M. Kurant,
C. T. Butts, and A. Markopoulou 2010). This
crawl performed a simple random walk on the
Facebook graph and therefore selected users with
probability proportional to their degree (Liran
Katzir, Edo Liberty, Oren Somekh and
Ioana
A. Cosma 2012).
6
Fig 3: shows an example of a directed graph crawl. The users reached by following only forward links are shown in the
shaded cloud, and those reached using both forward and reverse links are shown in the dashed cloud. Using both forward
and reverse links allows us to crawl the entire WCC, while using only forward links results in a subset of the WCC
Subnet work size estimation
Since the actual size of Facebook is not known
(other than Facebook’s own reports) they first
estimate the size of a subgraph whose size is
known. They selected a random subset of 1, 000,
000 Facebook users and tried to estimate the size of
this sub-population using the first algorithm. This is
done for two reasons. First, to test the subgraph size
estimation algorithm. Second, to make sure that
Facebook’s network topology and statistics are
suitable for our estimators. They present an error
curve, a confidence interval curve, and a
comparison curve. These results corroborate that
their subgraph size estimators behave almost
identically to the complete graph estimators. This
was expected since their analysis is essentially
identical. A more important discovery is that the
network topology and node degree distribution of
Facebook are indeed suitable for our estimators to
perform well.
Estimating the size of Facebook
They now estimate the size of the entire Facebook
network. Presenting accuracy plots in this case is
not possible since the true size of Facebook is not
known. The uniform Facebook sample collected
during April 2009, contains 2053 collisions and
2052 non-unique elements. Substituting these into
Equations (1) and (2) yields estimates of 237, 197,
785 and 236, 984, 623 users respectively. The very
same month, Facebook ([FBS]) reported of having
“more than 200 million active users” and “more
than 250 million active users” three months later.
The crawl that was performed during October 2010
contained 4099 collisions and 4064 non-unique
users, taking 50 random walk steps between
samples. This gives estimates of 475, 566, 857 and
475, 864, 724 respectively (FS 2013). Facebook at
the same time reported of having more than 500
million active users”. This is summarized in Table
1.
April 2009
October
2010
Sampling
Uniform
Degree
6
distribution
Number
of samples
0.98 ∙ 10
1 ∙ 106
Number of collisions 2053
4099
Number of non2052
4064
6
unique
Collision estimator
237 ∙ 10
475 ∙
106 ∙
Non-unique
236 ∙ 106
475
estimator report
106 ∙
Facebook
200 – 250 ∙
500
6
106
Table 1: Crawl details and consequent size estimates10
of the entire Facebook
network for April 2009 and
October 2010.
7
CLUSTERING COEFFICIENT ESTIMATION
There are two types of clustering coefficient
techniques network average and global clustering
coefficient estimators (Stephen J. Hardiman, Liran
Katzir 2013). The main observations that are used
in both are as follows. Given a random walk (x1, x2,
. . . , xr), we define a new variable φk = Axk−1,xk+1
for every 2 ≤ k ≤ r − 1. For any function f(xk) the
following holds:
The first equality holds due to the law of total
expectation. The second equality holds because
there are d2i equal probability combinations of
(xk−1, vi, xk+1) out of which only 2li form a
triangle (vj, vi, vk) or a reverse triangle (vk, vi, vj).
Notice that in a triangle or a reverse triangle vjis
connected to vk(Aj,k = 1). The third equality holds
due to algebraic manipulation.
Result and Discussion
OSNs
Number
of
Users
Flickr
1.8 million
Facebook
350 million
Orkut
100 million
Twitter
300 million
LinkedIn
50 million
YouTube
1.1 million
LiveJournal 5.2 million
Cyworld
48 million
MySpace
190 million
Table 2: Comparative analysis of various OSN
Approach
Network
Size
Estimation
Crawling
large graphs
Size
Estimation
of Facebook
Clustering
Coefficient
Estimation
Concept
Based on Random
walk
Based on breadthfirst search (BFS)
and depth-first
search
Based on Subnet
work size
estimation
Performance
Provide good
accuracy in many
of the cases.
Provide good
performance
They consistently
provide more
accurate estimates
while using a
smaller number
of samples
algorithm is
strictly more
accurate
Based on Network
average and global
clustering
coefficient
estimators
Table 3: Comparative analysis of various Estimation Techniques
8
Conclusion
This paper provides a more current evaluation and
update of online social networking site and
estimation techniques for social networking.
Literatures have been reviewed based on different
aspects of the estimation of social networking sites.
Survey on recent works in the field of social
network analysis depicts that different research
exposures are there in the field of social networking
sites estimation.
Acknowledgements
This study is a part of the dissertation work on the
study of estimation of online social networking sites
using clustering techniques. This is self funded
and supported by the Department of Computer
Science engineering TIT Bhopal M.P. India
Refrences
A.
Broder, R. Kumar, F. Maghoul, P.
Raghavan,S. Rajagopalan, R. Stata,
A.
Tomkins, and J. Wiener.Graph(2000)
Structure in the Web: Experiments and
Models. In Proceedings of the 9th
International
World
Wide
Web
Conference (WWW’00), Amsterdam.
Alan Mislove, Massimiliano Marcon and
Krishna P. Gummadi (2007) “Measurement
and
Analysis
of
Online
Social
Networks”IMC’07, San Diego, California,
USA
Liran Katzir, Edo Liberty, Oren Somekh and
Ioana A. Cosma(2012) “Estimating Sizes
of Social Networks via Biased Sampling” ,
Microsoft Innovations Lab, Israel.
L. Becchetti, C. Castillo, D. Donato, and A.
Fazzone(2006) “A
Comparison
of
Sampling Techniques for Web Graph
Characterization”. In Proceedings of the
Workshop
on
Link
Analysis
(LinkKDD’06), Philadelphia, PA.
M. Molloy and B. Reed (1995) “A critical point
for random graphs with
a given degree
sequence"
Random
Structures
and
Algorithms, 6(2-3), 99: 161-180.
M. Gjoka, M. Kurant, C. T. Butts, and A.
Markopoulou (2010) “Walking
in
Facebook: A case study of unbiased
sampling of OSNs”. In Proc. of IEEE
INFOCOM ’10, San Diego, CA.
Stephen J. Hardiman, Liran Katzir (2013)
"Estimating Clustering Coefficients and
Size of Social Networks via Random Walk",
International World Wide Web Conference
Committee (IW3C2), Rio de Janeiro, Brazil
S. H. Lee, P.-J. Kim, and H. Jeong (2006)
Statistical
properties
of
sampled
networks. Physical Review E,73.
S. J. Hardiman, P. Richmond, and S. Hutzler
(2009) “Calculating statistics of complex
networks through random walks with an
application to the on-line social network”.
European Physics
Journal B, 71(4):611–
622.
Rupam Some (2013) "A Survey on Social
Network Analysis
and its
Future
Trends", International Journal of Advanced
Research in Computer
and
Communication Engineering 2(6).
Yong-Yeol Ahn, Seungyeop Han, Haewoon
Kwak, Young-Ho Eom, Sue Moon,
Hawoong Jeong (2007) "Analysis of
Topological Characteristics
of
Huge
Online Social Networking Services",
International Conference on World Wide
Web (WWW’07), pp 835-844.
Facebook Statistics[FS](2013)
http://www.facebook.com/press/info.php?sta
tistics
It’s a Social World: Top 10 Need-to-Knows
About Social Networking and Where It’s
Headed[SW] (2011) Available from
http://www.comscore.com
List of social networking websites[LSNW] (2010)
Available from
http://en.wikipedia.org/wiki/list_of_social_n
etworking_websites.