Influence and Correlation in Social Network

Influence and
Correlation in Social
Network
PRESENTED BY PNINA NISSIM
Correlation in Social Network
 A correlation is a single number that describes the degree of
relationship between two variables.
 It is highly interesting to interpret users’ actions in the context of
their online friends and to correlate the actions of socially connected
users.
I hate
game of
thrones!
Game of
thrones is the
worst series
ever!
Stop
watching
Game of
thrones!
game of
thrones is
the best!
game of
thrones!
I love game of
thrones
game of
thrones!
I hate
game of
thrones!
I love game of
thrones
Previous Work
 L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan (2006)
examined the membership problem in an online community.
 C. Marlow, M. Naaman, D. Boyd, and M. Davis (2006)
considered the tag usage problem in Flickr.
These studies have established the existence of correlation
between user actions and social affiliations, but they do
not address the source of the correlation.
Causes of Correlation in Social Network
 Influence (induction) – the action of a user is triggered by one of
his friend’s recent actions.
 Homophily – individuals often befriend others who are similar to
them, and hence perform similar actions.
 Environment (confounding factors, external influence)– external
factors are correlated both with the event that two individuals
become friends and also with their actions.
Social Influence
 Social influence occurs when one’s opinions, emotions, or
behaviors are effected by others.
In the presence of social influence, an idea, norm of behavior and
a product diffuses through the social network like an epidemic.
Being able to identify in which cases influence prevails is an
important step to strategy design.
Models of Social Correlation
 Directed graph G that is generated from an unknown probability.
 The nodes are the agents in the social network.
 After an agent performs the action for the first time, we say that
the agent has become active.
 Let W denote the set of agents that are active at the end of a
certain time period [0, 𝑇].
Models of Social Correlation
2
2
4
1
0
4
3
4
Homophily Model
 The set 𝑊 of active nodes is first selected according to some
distribution
 The graph 𝐺 is picked from a distribution that depends on 𝑊.
Confounding Model
 There is a confounding variable 𝑋
 Both the network 𝐺 and the set of active individuals 𝑊 come from
distributions correlated with 𝑋.
Generalization – Correlation Model
 The pair (𝐺, 𝑊) is selected according to a joint probability
distribution.
 The time of activation for individuals in W is picked i.i.d. according
to a distribution 𝜏 on [0, 𝑇].
 The main assumption : the probability that an agent is active can
be affected by whether their friends become active, but not by when
they become active.
Influence Model
 The graph 𝐺 is drawn according to some distribution.
 In each of the time steps 1, … , 𝑇 each non-active agent decides
whether to become active.
 The probability of becoming active for each agent u is a function
𝑝(𝑥) of the number 𝑥 of other agents 𝑣 that have an edge to u and
are already active.
v
u
w
The Function p(·)
 A logistic function with the logarithm of the number of friends
provides a good fit for the probability.
 The probability 𝑝 𝑎 of activation for an agent
with 𝑎 already-active friends:
𝑒 𝛼 ln 𝑎+1 +𝛽
𝑝 𝑎 =
1 + 𝑒 𝛼 ln 𝑎+1 +𝛽
Where 𝛼, 𝛽 are coefficients.
Measuring Social Correlation
 The coefficient 𝛼 measures social correlation: a large value of
𝛼 indicates a large degree of correlation.
𝑌𝑎,𝑡 − the number of users who at the beginning of time 𝑡 had 𝑎
active friends and started using the tag at time 𝑡.
 𝑁𝑎,𝑡 − the number of users who had 𝑎 active friends at time 𝑡 , but
did not start using the tag (at time 𝑡).
 𝑌𝑎 = Σ𝑡 𝑌𝑎,𝑡 , 𝑁𝑎 = Σ𝑡 𝑁𝑎,𝑡
Example
2
2
4
1
0
3
𝑌0,0 = 1
𝑌0,1 = 1
𝑌0,4 = 1
𝒀𝟎 = 𝟑
4
4
𝑌1,2 = 2
𝑌1,4 = 1
𝒀𝟏 = 𝟑
𝑁0,0 = 13
𝑁0,1 = 11
𝑁0,2 = 8
𝑁0,3 = 7
𝑁0,4 = 6
𝑵𝟎 = 𝟒𝟓
Maximum Likelihood Method
 We compute the values of 𝛼 and 𝛽 that maximize the expression
𝑎𝑝
𝑎
𝑝 𝑎 =
𝑌𝑎
1−𝑝 𝑎
𝑒 𝛼 ln 𝑎+1 +𝛽
1+𝑒 𝛼 ln 𝑎+1 +𝛽
𝑁𝑎
The Shuffle Test
 The Test:
shuffle the timestamps of user activities and check if the new
estimate of social correlation is significantly different from the
estimate based on the user activity log.
The Shuffle Test
 𝛼 − the social correlation coefficient where user 𝑤𝑖 is first
activated at time 𝑡𝑖 .
 𝛼 ′ − the social correlation coefficient where user 𝑤𝑖 is first
activated at time 𝑡′𝑖 ∶= 𝑡𝜋 𝑖 for a random permutation 𝜋.
The shuffle test declares that the model exhibits no social
influence if the values of 𝜶 and 𝜶′ are close to each other.
Why Does it Work?
 In an instance generated from the correlation model, the time
stamps 𝑡𝑖 are independent, identically distributed (i.i.d.) from a
distribution 𝜏 over [0, 𝑇].
 The second instance constructed above only permutes all time
stamps, and hence the new 𝑡𝑖′ ’𝑠 are still i.i.d. from the same
distribution 𝜏.
The two instances come from the exact same distribution, and
hence they should lead to the same expected social correlation
coefficient 𝜶.
The Edge-Reversal Test
In this test we reverse the direction of all the edges and run logistic
regression on the data using the new graph as well.
 Social influence spreads in the direction specified by the edges of
the graph, and hence reversing the edges should intuitively change
the estimate of the correlation.
Simulations
 Three generative models.
 In each model, we will try to keep other aspects of the model as
close to Flickr’s data as possible.
o Number of users and Connections
o The number of users that become active in each time step
The No-Correlation Model
 There is no social correlation, influence or otherwise, in the
pattern of activations.
 In each time step, we look at the real data to see how many new
agents use the tag, and pick the same number of agents uniformly at
random from the set of agents that have already joined the network
and have not been picked yet.
The Influence Model
 Influence is the only form of social correlation.
 This model is parameterized in terms of two parameters, 𝛼 and 𝛽.
 In every time step, each node in the set of nodes that has joined
the network but not activated yet flips a coin independently to
decide if to become active in this time step.
The Correlation (no-influence) Model
 Agents that are close to each other in the network are affected by
the same external factors that make them more likely to be
activated.
 The model is parameterized in terms of one parameter 𝐿.
 Select a set 𝑆 of 𝐿 nodes.
The Correlation (no-influence) Model –
Selecting S
 picking a number of centers at random.
The Correlation (no-influence) Model –
Selecting S
 Adding a ball of radius 2 around each node in 𝑆 to 𝑆
Stop this process as soon as the size of 𝑺 reaches the prespecified number 𝑳.
The Correlation (no-influence) Model
 Generate the set of agents that become active in each time step in
a manner similar to the one in the no-correlation model, except that
in each time step we pick the set of agents to become active
uniformly at random from 𝑆.
Measuring Correlation
The first set of experiments focuses on the measurement of
correlation in the network.
We can compute the social correlation coefficient by
applying logistic regression to each model.
Correlation model
Results
Influence model
No-correlation model
Shuffle Test for
Influence Model
We can see that value of
𝛼 decreases after
shuffling the tagging
timestamp.
Shuffle Test for
Correlation
Model
for almost all tags the
values of α retrieved are
very close with and
without the shuffle.
Edge-Reversal
Test for Influence
Model
Similarly to the previous
test, there is a significant
difference in the values
of 𝛼.
Edge-Reversal Test
for Correlation
Model
The values of 𝛼
essentially coincide.
Experiments on Real Data
 The techniques are effective for the simulated data
 Are they effective for the real-world data, namely on the
Flickr social network?
 Experiment: Analyzing the tagging behavior of users
Real Data
Images may be tagged either by the uploader
or by other users, if the uploader permits it.
The Flickr Dataset
 16 months.
 800K users
 We restricted our attention to the set of
users who have tagged any photo with any
tag, which is about 340K users.
 The proportion of u’s contacts that do not
have u as a contact is 28.5%.
The Flickr Dataset
 Out of a collection of about 10K tags that users had used, they
selected a set of 1,700, and analyzed each of them independently.
 various types (event, colors, objects, etc.)
 various numbers of users (most of them were used by more than
1,000 users)
 various growth patterns: bursty (e.g. “halloween”,“katrina”),
smooth (e.g., “photos”) and periodic (e.g., “moon”).
Measuring
Correlation
For almost all the tags
the value is higher than
1, suggesting that
correlation is prevalent
in users’ tagging
activities for almost all
the tags.
Distinguishing
influence
The Shuffle Test
The correlation cannot be attributed to influence.
Distinguishing
influence
The Edge-Reversal
Test
The correlation cannot be attributed to influence.