Porto - ECML PKDD 2015

Complexity and Efficient Algorithms Group /
Department of Computer Science
Approximating Structural Properties of Graphs by Random Walks
Christian Sohler
Complexity and Efficient Algorithms Group /
Department of Computer Science
Very Large Networks
Examples




Social networks
The human brain
Crystals
Chip design
Size


109 – 1023 vertices
Petabytes of additional information possible
2
Complexity and Efficient Algorithms Group /
Department of Computer Science
Very Large Networks
Classical graph problems




Connectivity
MinCut, MaxCut
Graphclustering
Graphisomorphism
Difficulties

Graph does not fit into main memory
3
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Vision
Exampe questions



Is a country a democracy or a totalitarian
country?
Is a patient schizophrenic?
Is software malicious?
Formalization


Given a set of graphs with class labels
(training set)
Find a classifier for new graphs
4
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Vision
A typical szenario



Hundreds or thousands of graphs
Each graph is extremly large
Graphs are sparse
A possible approach

(12,3,-5,10,0,0,…,20,3)

Describe graphs by features
(graph properties)
Apply classical learning algorithms
The challenge

Computation of ten thousands of features
for graphs with billions of vertices
5
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Sampling Approach
Random Sampling

Compute a graph property approximately
by random sampling
Informal Question

What can we learn from the local structure
of a sparse graph about its global properties?
Sampling from Graphs

How can we sample a graph?
6
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Sampling Approach
Examples of different sampling strategies
1. Sample set S of s vertices and look at all edges within S
(the subgraph G[S] induced by S)
2. Sample set S of s edges and look at their graph
3. Sample a set S of s vertices and perform a BFS from each of them
4. Sample a set S of s vertices and perform a random walk from each of them
 Many more possibilities…
Question

Which is the right sampling strategy for my learning problem?
7
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Sampling Approach
Examples of different sampling strategies
1. Sample set S of s vertices and look at all edges within S
(the subgraph G[S] induced by S)
2. Sample set S of s edges and look at their graph
3. Sample a set S of s vertices and perform a BFS from each of them
4. Sample a set S of s vertices and perform a random walk from each of them
 Many more possibilities…
Question


Which is the right sampling strategy for my learning problem?
Depends on the problem…
8
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Sampling Approach
Question 1

Assume you have some classification task that involves city maps. Which
of our four sampling methods is your method of choice?
Possible Answers
1.
2.
3.
4.
Sample set S of s vertices and look at all edges within S
Sample set S of s edges and look at their graph
Sample a set S of s vertices and perform a BFS from each of them
Sample a set S of s vertices and perform a random walk from each of them
9
Complexity and Efficient Algorithms Group /
Department of Computer Science
Classification of Very Large Networks – A Sampling Approach
Question 2

Assume you have some classification task that involves social networks.
Which of our four sampling methods is your method of choice?
Possible Answers
1.
2.
3.
4.
Sample set S of s vertices and look at all edges within S
Sample set S of s edges and look at their graph
Sample a set S of s vertices and perform a BFS from each of them
Sample a set S of s vertices and perform a random walk from each of them
10
Complexity and Efficient Algorithms Group /
Department of Computer Science
First Wrap-Up
Motivation


Some classification problems involve sets of huge graphs
No efficient algorithm for some fundamental graph problems known
Sampling approach

We would like to pick small samples from the graph(s) and use them for
graph classification
Challenge


There are many different sampling procedures
We need to understand which is the right one for which problem
11
Complexity and Efficient Algorithms Group /
Department of Computer Science
Sampling from Very Large Networks
Property Testing [Rubinfeld, Sudan, 1996, Goldreich, Goldwasser, Ron, 1998]

Formal framework to study sampling algorithms for very large networks
Relaxation of „Standard Decision Problems“




Want to distinguish whether input graph G has a property or is far away from it
If G neither has the property nor is far away from it the algorithm may give an
arbitrary answer
Randomized algorithms with bounded (worst case) error probability
Only looks at small part of the graph
Different graph models

Dense graphs, bounded degree graphs, directed graphs
12
Complexity and Efficient Algorithms Group /
Department of Computer Science
Property Testing in Bounded Degree Graphs
1
3
2



5
4
Bounded degree graphs [Goldreich,
Ron, 2002]
2
4
/
1
3
5
2
/
/
1
5
/
2
4
/
Undirected Graph G=(V,E)
Maximum degree bounded by D
D constant
Oracle access



V={1,…,n}
n is known to the algorithm
Query(i,j) returns j-th neighbor of vertex i or a
symbol that indicates that this neighbor does
not exist
13
Complexity and Efficient Algorithms Group /
Department of Computer Science
Property Testing in Bounded Degree Graphs
connected
Graph properties

A graph property is a set of graphs that is
closed under isomorphism
Definition [Goldreich, Ron, 2002]

e-far
G=(V,E) is e-far from P, if one has to modify
more than eDn edges to obtain a bounded
degree graph with property P.
14
Complexity and Efficient Algorithms Group /
Department of Computer Science
Property Testing in Bounded Degree Graphs
Property Tester for property P [Goldreich, Ron, 2002]



Oracle access to input graph G
Accepts with probability at least 2/3, if G has property P
Rejects with probability at least 2/3, if G is e-far from P
Quality measures


Query complexity: Maximum number of oracle queries
Running time
15
Complexity and Efficient Algorithms Group /
Department of Computer Science
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002]
(1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been
discovered
(4) if (b) then reject
(5) accept
16
Complexity and Efficient Algorithms Group /
Department of Computer Science
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002]
(1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been
discovered
(4) if (b) then reject
(5) accept
Observation
•
ConnectivityTester accepts every connected graph
17
Complexity and Efficient Algorithms Group /
Department of Computer Science
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002]
(1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been
discovered
(4) if (b) then reject
(5) accept
Claim
•
If G is e-far from connected, then G has more than eDn/2 connected
components.
18
Complexity and Efficient Algorithms Group /
Department of Computer Science
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002]
(1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been
discovered
(4) if (b) then reject
(5) accept
Claim
•
At least eDn/4 of the connected components have size at most 4/(eD).
19
Complexity and Efficient Algorithms Group /
Department of Computer Science
A First Example: Connectivity
Connectivitytester(G,e,D) [Goldreich, Ron, 2002]
(1) Sample set S with s=8/(eD) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) 4/(eD) vertices have been discovered or
(b) all vertices of a small connected component have been
discovered
(4) if (b) then reject
(5) accept
Theorem
•
Connectivitytester is a property tester with query complexity O(1/(e²D)).
20
Complexity and Efficient Algorithms Group /
Department of Computer Science
Second Wrap-Up – Introduction to Property Testing
Property Testing


Approximately decide based on random sampling whether a graph has a
property or is far away from it
Quality measure: Query complexity
Connectivity


Sampling + BFS
Check whether the sample violates the property
21
Complexity and Efficient Algorithms Group /
Department of Computer Science
Second Wrap-Up – Introduction to Property Testing
Question 3

Is the following algorithm a property tester for planarity (for right choice of f)?
Planaritytester(G,e,D)
(1) Sample set S with s= f(e,D) vertices uniformly at random from V
(2) For every vertex from S:
(3) Perform a BFS until
(a) f(e,D) vertices have been discovered or
(b) the discovered graph is not planar
(4) if (b) then reject
(5) accept
22
Complexity and Efficient Algorithms Group /
Department of Computer Science
Second Wrap-Up – Introduction to Property Testing
Bad news
•
There is a class of graphs such that every cycle
has Length W(log n) and that are e-far from
planar
Good news
•
The sampling is fine, we just need to modify
our acceptance condition
23
Complexity and Efficient Algorithms Group /
Department of Computer Science
Random Walks, Stationary Distributions & Convergence
Random Walk

In each step:
move from current vertex v to a neighbor chosen uniformly at random
Convergence


If G is connected and not bipartite, a random walk converges to a unique
stationary distribution
Pr[Random Walk is at vertex v]  deg(v)
24
Complexity and Efficient Algorithms Group /
Department of Computer Science
Random Walks, Stationary Distributions & Convergence
Random Walks on Maps



A random walk on a planar graph has the
tendency to stay local
It takes a long time to reach the stationary
distribution
Reason: The network has sparse cuts
Random Walks on Social Networks



A random walk will quickly move to a „random
place“
Fast convergence
The network does not have sparse cuts
25
Complexity and Efficient Algorithms Group /
Department of Computer Science
Random Walks, Stationary Distributions & Convergence
Lazy Random Walk

In each step:
- Probability to move from current vertex v to neighbor u is 1/(2D)
- stays at v with remaining probability
Convergence of Lazy Random Walks

Stationary distribution is uniform
Rate of Convergence


Can be expressed in terms of the conductance of G or the second largest
eigenvalue of the transition matrix
O(log n) steps, if G is an expander graph
26
Complexity and Efficient Algorithms Group /
Department of Computer Science
Conductance, Expanders & Small Worlds
Definition

The expansion F(U) of a set U is defined as
| {( u, v)  E : u U and v V  U } |
D |U |

The conductance FG of G is minU:1≤|U|≤|V|/2 F(U)
Definition

A graph G=(V,E) is called f-expander, if FG≥f for some constant f.
Interpretations


Expander graphs satisfy the „small-world phenomenon“
Conductance can be viewed as a measure for the social connectivity of a
network
27
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing Expanders
Facts


A lazy random walk converges to uniform distribution
A lazy random walk converges quickly in expander graphs
Hope


A lazy random walk converges much slower, if the graph is e-far from an
expander graph
In particular, we hope that the distribution of the endpoints of a Q(log n)step lazy random walk differs significantly from the uniform distribution
Question

If so, how could we exploit this to design a property testing algorithm?
28
Complexity and Efficient Algorithms Group /
Department of Computer Science
The Birthday Problem & Testing Uniform Distributions
Birthday Problem



n possible birthdays
k persons with birthday chosen uniformly at random
How large must k be so that with constant probability two person have the
same birthday?
Analysis




p=(1/n,..,1/n)T
||p||² is the collision probability of two birthdays
If we have k persons then the expected number of collision is
So, for k = Q(n) we expect to see a collision
k 
  || p || ²
 2
29
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing Uniform Distributions
Observation


The uniform distribution minimizes the expected number of pairwise
collisions
If a distribution q differs significantly from the uniform distribution then
||q||²>>||p||²
TestUniformDistribution(distribution q)
1. Sample Q(n) elements according to q
2. if the number of pairwise collisions is too large then reject
3. else accept
30
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing Expanders
TestingExpanders(G)
1. Sample set S of s vertices uniformly at random
2. for each vS do
3.
Let q be the distribution of endpoints of a Q(log n)-step lazy random walk
4.
if TestUniformDistribution(q) rejects then reject
5. accept
History
•
•
•
Algorithm was invented by [Goldreich and Ron, 2000] and algorithm
conjectured to be a property tester
First complete analysis by [Czumaj and Sohler, 2010]
(but weaker than conjectured)
Later improved by [Nachmias and Shapira, 2010] and [Kale and Seshadhri,
31
2011]
Complexity and Efficient Algorithms Group /
Department of Computer Science
Final Result
Theorem [Nachmias and Shapira, 2010, Kale and Seshadhri, 2011]

Algorithm TestingExpansion accepts every f-expander and rejects every
graph that is e-far from a Q(f²)-expander. The algorithm has a running time
of O(n1/2+d).
Key structural property of „e-far“-graphs


If G is e-far from a Q(f²)-expander then there exists a set U of W(en)
vertices with F(U) = O(f²).
Implies that for many vertices, the distribution of endpoints of a random
walk of length O(log n) is significantly different from the uniform distribution
32
Complexity and Efficient Algorithms Group /
Department of Computer Science
Third Wrap-Up – Testing Expansion
(Lazy) Random Walks



Moves from a vertex to a random neighbor
Converges to uniform distribution
Speed of convergence depends on graph structure
Testing Expansion




Random Walk converges quickly in expander graphs
Random Walk converges slower if we are far from expander graphs
Number of collisions among end points of random walks is minimized in
expander graphs
We can test expansion by counting collisions
33
Complexity and Efficient Algorithms Group /
Department of Computer Science
Graph Clustering & Web Communities
Web Graph Communities


Set of vertices that induces an expander graph and has a sparse cut to the
rest of the graph
Question: Is the web graph composed of a set of at most k communities?
Definition



A subset CV is called (Fin, Fout )-cluster, if
FG(G[C]) ≥ Fin
F(C) ≤ Fout
Definition

A partition of V into at most k (Fin, Fout )-clusters is called (k, Fin, Fout )-clustering
34
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings
Expander
Expander
Expander
Expander
A Simple Case?


Distinguish between a union of at most k expander graphs with no edges in
between and a set of more than k (large) expander graphs with no edges
in between
35
Can we use our previous algorithm to test for a k-clustering?
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings
Expander
Expander
Expander
Expander
A Simple Case?

No! We do not know the size of the clusters (expander graphs) and estimating
the support size of a distribution is hard [Raskhodnikova et al., 2009]
36
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings
Expander
Expander
Expander
Expander
New idea


If two vertices come from the same cluster, the random walks quickly
converge to the same distribution
So, we could try to sample a set of vertices and check for sets of vertices
whose random walks induce the same distributions
37
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing Closeness of Distributions
Main Idea [Batu et al. 2013; Chan et al. 2014]

if pq then then the following experiments should give roughly the same
number of collisions between elements from S and T:



Draw two sets S and T of m elements from p
Draw two sets S and T of m elements from q
Draw set S of m elements from p and set T of m elements from q

If p and q differ significantly, at least one of the three values is different
38
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing Closeness of Distributions
Theorem [Batu et al. 2013; Chan et al. 2014]

There is a tester that w.p. 2/3 accepts, if ||p-q||≤e/2 and rejects, if ||p-q||≥e.
The query complexity of the algorithms is O(b/e²), where b is an upper
bound on ||p||² and ||q||².
39
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing Closeness of Distributions
Theorem [Batu et al. 2013; Chan et al. 2014]

There is a tester that w.p. 2/3 accepts, if ||p-q||≤e/2 and rejects, if ||p-q||≥e.
The query complexity of the algorithms is O(b/e²), where b is an upper
bound on ||p||² and ||q||².

We will need b to be O(1/n)
40
Complexity and Efficient Algorithms Group /
Department of Computer Science
The Algorithm
ClusteringTest
1. Sample set S of s vertices uniformly at random
2. For any vS let D(v) be the distribution of end points of a random walk of
length Q(log n) starting at v
3. for each pair u,vS do
4.
if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“
on vertex set S
5. accept, if and only if the cluster graph is a collection of at most k cliques
41
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings
Expander
Expander
Expander
Expander
Observation

Algorithm ClusteringTest distinguishes between at most k expanders and
more than k (large) expanders
42
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings
Expander
Expander
Expander
Expander
Observation


Algorithm ClusteringTest distinguishes between at most k expanders and
more than k (large) expanders
Can we generalize it to testing of (k, Fin, Fout )-clusterings ?
43
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings - Soundness
Challenge

Since the clusters may be connected in a (k, Fin, Fout )-clustering the
stationary distribution may be uniform over G (and not over the cluster)
44
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings - Soundness
Challenge


Since the clusters may be connected in a (k, Fin, Fout )-clustering the
stationary distribution may be uniform over G (and not over the cluster)
Need to show that for proper length of the random walk there is an
„intermediate“ distribution that it is „reasonably stable“ w.r.t. l2-error
45
Complexity and Efficient Algorithms Group /
Department of Computer Science
The Algorithm
ClusteringTest
1. Sample set S of s vertices uniformly at random
2. For any vS let D(v) be the distribution of end points of a random walk of
length Q(log n) starting at v
3. for each pair u,vS do
4.
if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“
on vertex set S
5. accept, if and only if the cluster graph is a collection of at most k cliques
46
Complexity and Efficient Algorithms Group /
Department of Computer Science
The Algorithm
ClusteringTest
1. Sample set S of s vertices uniformly at random
2. For any vS let D(v) be the distribution of end points of a random walk of
length Q(log n) starting at v
3. if ||D(v)||² > O(1/n) then reject
4. for each pair u,vS do
5.
if D(u) and D(v) are close then add an edge (u,v) to the „cluster graph“
on vertex set S
6. accept, if and only if the cluster graph is a collection of at most k
connected components
47
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings - Completeness
Required Properties of a (k, Fin, Fout)-clustering


For most vertices v: The distribution D(v) of end points of a lazy random
walk of proper length has ||D(v)||² = O(1/n)
For most pairs u,v from the same cluster: ||D(v)- D(u)||² is very small
Useful Tool – Higher Order Cheeger‘s Inequality [Lee et al. 2014]

Relates (k, Fin, Fout )-clustering to the k+1 largest eigenvalues
48
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings - Soundness
Structural property of „e-far“-graphs (similarly to expanders)

If G is e-far from a (k, Fin*, Fout* )-clusterings then there exists a partition into
k+1 sets C1,…,Ck+1 each of W(e²n/k) vertices and with F(Ci) = O(Fin*/e²).
49
Complexity and Efficient Algorithms Group /
Department of Computer Science
Testing k-Clusterings
Theorem [Czumaj, Peng, Sohler, 2015]


Algorithm ClusteringTester accepts every (k, Fin, Fout)-clustering with
probability at least 2/3 and rejects every graph that is e-far from every
(k, Fin*, Fout *)-clustering with probability at least 2/3, where
Fout =O(e4 Fin²) and Fin* = Q(e4 Fin²/log n) for constants k,D.
The running time of the algorithm is O*(n).
50
Complexity and Efficient Algorithms Group /
Department of Computer Science
Fourth Wrap-Up
Testing Clusterings




End points of Random Walk of proper length should be uniform on its
cluster with not much probability „outside“
If Random Walks start from two different points of the same cluster, their
end point distributions are similar
Collision statistics can be used to pairwise test similarity of distributions
This can be used to approximate the cut structure
Take away message

The distribution of end points of random walks (possibly comparing
different starting vertices) contains a lot of information about the cut
structure of a graph
51
Complexity and Efficient Algorithms Group /
Department of Computer Science
Summary
Vision

Learning from very large sets of massive graphs
Approach


Feature computation by random sampling
Analysis in the framework of property testing
Two Examples


Expanders (connectivity measure in social networks)
Clustering (structure of social networks)
52
Complexity and Efficient Algorithms Group /
Department of Computer Science
Thank you!
Source
Slide 2: Allan Ajifo und cobalt123; creative common license
Slide 3: GustavoG und Jasper Nance; creative common license
Slide 4: Wikipedia; Jason Brown; creative common license
Slide 5: GustavoG; creative common license
Slide 6: GoldenRibbon, creative common license
53