Graph Algorithms

Graph Algorithms
Juliana Freire
Modified slides by Jimmy Lin
Today’s Agenda
¢ 
Graph problems and representations
¢ 
Parallel breadth-first search
¢ 
PageRank
What’s a graph?
¢ 
G = (V,E), where
l 
l 
l 
¢ 
Different types of graphs:
l 
l 
¢ 
V represents the set of vertices (nodes)
E represents the set of edges (links)
Both vertices and edges may contain additional information
Directed vs. undirected edges
Presence or absence of cycles
Graphs are everywhere:
l 
l 
l 
l 
Hyperlink structure of the Web
Physical structure of computers on the Internet
Interstate highway system
Social networks
Road Networks
¢ 
Route planning
¢ 
Traffic density
Document Graphs
http://en.wikipedia.org/wiki/File:WorldWideWebAroundWikipedia.png
Social Networks
Some Graph Problems
¢ 
Finding shortest paths
l 
¢ 
Finding minimum spanning trees
l 
¢ 
Breaking up terrorist cells, spread of avian flu
Bipartite matching
l 
¢ 
Airline scheduling
Identify “special” nodes and communities
l 
¢ 
Telco laying down fiber
Finding Max Flow
l 
¢ 
Routing Internet traffic and UPS trucks
Monster.com, Match.com
And of course... PageRank
Graphs and MapReduce
¢ 
Graph algorithms typically involve:
l 
l 
¢ 
Performing computations at each node: based on node features,
edge features, and local link structure
Propagating computations: “traversing” the graph
Key questions:
l 
l 
How do you represent graph data in MapReduce?
How do you traverse a graph in MapReduce?
Representing Graphs
¢ 
G = (V, E)
¢ 
Two common representations
l 
l 
Adjacency matrix
Adjacency list
Adjacency Matrices
Represent a graph as an n x n square matrix M
l 
l 
n = |V|
Mij = 1 means a link from node i to j
1
1
0
2
1
3
0
4
1
2
1
0
1
1
3
4
1
1
0
0
0
1
0
0
2
1
3
4
Adjacency Matrices
Represent a graph as an n x n square matrix M
l 
l 
n = |V|
Mij = the weight of link from node i to j
1
1
0
2
5
3
0
5
4
3
2
1
0
4
4
3
4
1
3
0
0
0
2
0
0
2
4
1
1
4
3
2
1
3
4
2
Adjacency Matrices: Critique
¢ 
Advantages:
l 
l 
l 
¢ 
Amenable to mathematical manipulation
Iteration over rows corresponds to computations on outlinks
Iteration over columns corresponds to computations on inlinks
Disadvantages:
l 
l 
l 
Many real graphs are sparse – the number of actual edges is
much smaller than the number of possible edges
Lots of zeros for sparse matrices
Lots of wasted space
Adjacency Lists
Take adjacency matrices… and throw away all the zeros
1
1
0
2
1
3
0
4
1
2
1
0
1
1
3
4
1
1
0
0
0
1
0
0
1: 2, 4
2: 1, 3, 4
3: 1
4: 1, 3
Adjacency Lists: Critique
¢ 
Advantages:
l 
l 
¢ 
Much more compact representation
Easy to compute over outlinks
Disadvantages:
l 
l 
More difficult to compute over inlinks
How can this be dealt with in MapReduce?
Challenge Question
¢ 
How would you invert a graph in MapReduce?
2
2
1
1
3
3
4
4
Single Source Shortest Path
¢ 
Problem: find shortest path from a source node to one or
more target nodes
l 
¢ 
Shortest might also mean lowest weight or cost
First, a refresher: Dijkstra’s Algorithm
Dijkstra’s Algorithm Example
for all v in V
d[v] = maxint
d(s) = 0
Q = {s, n1,n2,n3,n4}
1
∞
Get from Q the node
with smallest distance
10
2
0
3
9
4
6
7
5
∞
∞
2
Example from CLR
∞
Dijkstra’s Algorithm Example
u=s
for all v in u.AdjList
if d[v] > d[u] + w(u,v)
d[v] = d[u] + w(u,v)
d(s) = 0
d(n1) = 10
d(n2) = 5
Q = {n2,n1,n3,n4}
Get from Q the node
with smallest distance
1
10
10
2
0
3
9
4
6
7
5
5
∞
2
Example from CLR
∞
Dijkstra’s Algorithm Example
u = n2
for all v in u.AdjList
if d[v] > d[u] + w(u,v)
d[v] = d[u] + w(u,v)
d(s) = 0
d(n1) = 8
d(n2) = 5
d(n3) = 7
Q = {n3,n1,n4}
Get from Q the node
with smallest distance
1
8
10
2
0
3
9
4
6
7
5
5
7
2
Example from CLR
14
Dijkstra’s Algorithm Example
1
8
13
10
2
0
3
9
6
7
5
5
7
2
Example from CLR
4
Dijkstra’s Algorithm Example
1
1
8
9
10
2
0
3
9
6
7
5
5
7
2
Example from CLR
4
Dijkstra’s Algorithm Example
1
8
9
10
2
0
3
9
6
7
5
5
7
2
Example from CLR
4
Dijkstra’s algorithm is based on maintaining a global priori
with priorities equal to their distances from the source nod
Dijkstra’s
Algorithm
tion, the algorithm
expands the node with the shortest dist
distances to all reachable nodes.
1: Dijkstra(G, w, s)
2:
d[s]
0
3:
for all vertex v 2 V do
4:
d[v]
1
5:
Q
{V }
6:
while Q 6= ; do
7:
u
ExtractMin(Q)
8:
for all vertex v 2 u.AdjacencyList do
9:
if d[v] > d[u] + w(u, v) then
10:
d[v]
d[u] + w(u, v)
lowest-cost or lowest-weight paths). Such problems are a sta
Single Source Shortest Path
¢ 
Problem: find shortest path from a source node to one or
more target nodes
l 
Shortest might also mean lowest weight or cost
¢ 
Single processor machine: Dijkstra’s Algorithm
¢ 
MapReduce: parallel Breadth-First Search (BFS)
Finding the Shortest Path
¢ 
Consider simple case of equal edge weights
¢ 
Solution to the problem can be defined inductively
¢ 
Here’s the intuition:
l 
L 
l 
l 
Define: b is reachable from a if b is on adjacency list of a
DISTANCETO(s) = 0
For all nodes p reachable from s,
DISTANCETO(p) = 1
For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m ∈ M)
d 1 m1
…
s
…
…
d2
n
m2
d3
m3
Source: Wikipedia (Wave)
Visualizing Parallel BFS
n7
n0
n1
n2
n3
n6
n5
n4
n8
n9
From Intuition to Algorithm
¢ 
Data representation:
l 
l 
l 
¢ 
Mapper:
l 
¢ 
∀m ∈ adjacency list: emit (m, d + 1)
Sort/Shuffle
l 
¢ 
Key: node n
Value: d (distance from start), adjacency list (list of nodes
reachable from n)
Initialization: for all nodes except for start node, d = ∞
Groups distances by reachable nodes
Reducer:
l 
l 
Selects minimum distance path for each reachable node
Additional book-keeping needed to keep track of actual path
Multiple Iterations Needed
¢ 
Each MapReduce iteration advances the “known frontier”
by one hop
l 
l 
¢ 
Subsequent iterations include more and more reachable nodes as
frontier expands
Multiple iterations are needed to explore entire graph
Preserving graph structure:
l 
l 
Problem: Where did the adjacency list go?
Solution: mapper emits (n, adjacency list) as well
•  Need a wrapper object with variable to indicate the type of content
BFS Pseudo-Code
Stopping Criterion
¢ 
How many iterations are needed in parallel BFS (equal
edge weight case)?
¢ 
Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
¢ 
Now answer the question...
l 
¢ 
Six degrees of separation? Everyone on the planet is connected to
everyone else by at most 6 steps
Practicalities of implementation in MapReduce
l 
l 
Termination checking must be done outside MapReduce
Need a driver program that orchestrates the iterations
M1->R1 àM2-> R2 à…à Mn ->Rn
l 
Use Hadoop “counters” to keep track of the number of nodes with
d = infinity
Incrementing the desired counters from your map and reduce methods through the Context of your
public(instatic
startsWithLetter(String
s){
mapper or reducer
previous boolean
hadoop version
it was through the Reporter.incrCounter()
method, but the
if(
s
==
null
||
s.length()
==
0
)
reporter no longer exists in hadoop 0.20)
return false;
Hadoop Counters
So let’s see an example. We’ll take the word count example revised for version 0.20.1 to illustrate the use of
return Character.isLetter(s.charAt(0));
counters. We will }
create a counter group called WordsNature that will count how many unique tokens there is in
all, how many unique tokens starts with a digit and how many unique tokens starts with a letter.
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/Counters.html
}
So our enum declaration will look like that:
Since we are interested in unique tokens, we will put the code related with the counter into the reduce method.
WordsNature
STARTS_WITH_DIGIT,
STARTS_WITH_LETTER, ALL }
So static
here howenum
the reduce
method will {
look
like:
void
We willpublic
also need
a veryreduce(Text
basic StringUtilskey,
class:Iterable values, Context context)
throws IOException, InterruptedException {
2 of 5
package com.philippeadjiman.hadooptraining;
int sum = 0;
String token = key.toString();
public class
{
if( StringUtils
StringUtils.startsWithDigit(token)
){
context.getCounter(WordsNature.STARTS_WITH_DIGIT).increment(1
}
else if( StringUtils.startsWithLetter(token) ){
context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(
5/5/14, 11:48 AM
}
context.getCounter(WordsNature.ALL).increment(1);
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
http://www.philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/
Here is the code of the WordCountWithCounter that include this code.
If you want to run it inside our learning playground you’ll just have to update the pom with hadoop latest version:
Comparison to Dijkstra
¢ 
Dijkstra’s algorithm is more efficient
l 
¢ 
MapReduce explores all paths in parallel
l 
l 
¢ 
At any step it only pursues edges from the minimum-cost path
inside the frontier
Lots of “waste”
Useful work is only done at the “frontier”
Why can’t we do better using MapReduce?
Weighted Edges
¢ 
Now add positive weights to the edges
l 
¢ 
Simple change: adjacency list now includes a weight w for
each edge
l 
¢ 
Why can’t edge weights be negative?
In mapper, emit (m, d + wp) instead of (m, d + 1) for each node m
That’s it?
Stopping Criterion
¢ 
How many iterations are needed in parallel BFS (positive
edge weight case)?
¢ 
Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
e!
u
r
t
t
No
Additional Complexities
1
search frontier
1
n6
1
n7
n8
10
r
1
n1
1
s
p
n9
n5
1
q
n2
1
1
n4
n3
BFS only discovers the shortest path
between n1 and n6 at the 5th iteration
Stopping Criterion
¢ 
How many iterations are needed in parallel BFS (positive
edge weight case)?
l 
l 
The algorithm terminates when shortest distances at every node
no longer change
How would you implement this using counters?
Graphs and MapReduce
¢ 
Graph algorithms typically involve:
l 
l 
¢ 
Performing computations at each node: based on node features,
edge features, and local link structure
Propagating computations: “traversing” the graph
Generic recipe:
l 
l 
l 
l 
l 
l 
Represent graphs as adjacency lists
Perform local computations in mapper
Pass along partial results via outlinks, keyed by destination node
Perform aggregation in reducer on inlinks to a node
Iterate until convergence: controlled by external “driver”
Don’t forget to pass the graph structure between iterations
Random Walks Over the Web
¢ 
Random surfer model:
l 
l 
¢ 
PageRank
l 
l 
l 
¢ 
User starts at a random Web page
User randomly clicks on links, surfing from page to page
Characterizes the amount of time spent on any given page
Mathematically, a probability distribution over pages
Harder to trick
PageRank captures notions of page importance
l 
l 
l 
Correspondence to human intuition?
One of thousands of features used in web search
Note: query-independent
PageRank: Defined
Given page x with inlinks t1…tn, where
l 
l 
l 
C(t) is the out-degree of t
α is probability of random jump
N is the total number of nodes in the graph
n
PR(ti )
⎛ 1 ⎞
PR( x) = α ⎜ ⎟ + (1 − α )∑
⎝ N ⎠
i =1 C (ti )
t1
X
t2
…
tn
Computing PageRank
¢ 
Properties of PageRank
l 
l 
¢ 
Can be computed iteratively
Effects at each iteration are local
Sketch of algorithm:
l 
l 
l 
l 
Start with seed PRi values
Each page distributes PRi “credit” to all pages it links to
Each target page adds up “credit” from multiple in-bound links to
compute PRi+1
Iterate until values converge
Simplified PageRank
¢ 
First, tackle the simple case:
l 
l 
¢ 
No random jump factor
No dangling links
Then, factor in these complexities…
l 
l 
Why do we need the random jump?
Where do dangling links come from?
Sample PageRank Iteration (1)
Iteration 1
n2 (0.2)
n1 (0.2) 0.1
0.1
n2 (0.166)
0.1
n1 (0.066)
0.1
0.066
0.2
n4 (0.2)
0.066
0.066
n5 (0.2)
0.2
n5 (0.3)
n3 (0.2)
n4 (0.3)
n3 (0.166)
Sample PageRank Iteration (2)
Iteration 2
n2 (0.166)
n1 (0.066) 0.033
0.083
n2 (0.133)
0.083
n1 (0.1)
0.033
0.1
0.3
n4 (0.3)
0.1
0.1
n5 (0.3)
0.166
n5 (0.383)
n3 (0.166)
n4 (0.2)
n3 (0.183)
PageRank in MapReduce
n1 [n2, n4]
n2 [n3, n5]
n2
n3
n3 [n4]
n4 [n5]
n4
n5
n5 [n1, n2, n3]
Map
n1
n4
n2
n2
n5
n3
n3
n4
n4
n1
n2
n5
Reduce
n1 [n2, n4] n2 [n3, n5]
n3 [n4]
n4 [n5]
n5 [n1, n2, n3]
n3
n5
PageRank Pseudo-Code
Complete PageRank
¢ 
Two additional complexities
l 
l 
¢ 
n
PR(ti )
⎛ 1 ⎞
PR( x) = α ⎜ ⎟ + (1 − α )∑
⎝ N ⎠
i =1 C (ti )
What is the proper treatment of dangling nodes?
How do we factor in the random jump factor?
Solution:
l 
l 
Use counters to keep track of lost mass, or create a special key—
when dangling node is encountered emit (specialkey,pr)
Second pass to redistribute “missing PageRank mass” and
account for random jumps
⎛ 1
p' = α ⎜⎜
⎝ G
l 
l 
l 
⎞
⎛ m
⎞
⎟ + (1 − α )⎜ + p ⎟
⎟
⎜ G
⎟
⎠
⎝
⎠
p is PageRank value from before, p' is updated PageRank value
|G| is the number of nodes in the graph
m is the missing PageRank mass
PageRank Convergence
¢ 
Alternative convergence criteria: driver program needs to
check for convergence
l 
l 
¢ 
Iterate until PageRank values don’t change
Fixed number of iterations
Convergence for web graphs?
Beyond PageRank
¢ 
Link structure is important for web search
l 
l 
¢ 
PageRank is one of many link-based features: HITS, SALSA, etc.
One of many thousands of features used in ranking…
Adversarial nature of web search
l 
l 
l 
l 
Link spamming
Spider traps
Keyword stuffing
…
Efficient Graph Algorithms
¢ 
Runtime dominated by copying intermediate data across
the network
¢ 
Sparse vs. dense graphs
l 
¢ 
MapReduce algorithms are often impractical for dense graphs
Optimization is possible through the use of in-mapper
combiners
l 
l 
l 
E.g., in parallel BFS, compute the min in mapper
Only useful if nodes pointed to by multiple nodes are processed
by the same map task
Maximize opportunities for local aggregation
•  Simple tricks: sorting the dataset in specific ways
Twister
http://www.iterativemapreduce.org/
¢ 
Extensions to MapReduce
programming model
¢ 
Uses pub/sub messaging
for all the communication/
data transfer requirements
l 
Eliminates the overhead in
transferring data via file
systems
¢ 
No built-in filesystem
¢ 
Fault tolerance: work
in progress…
Twister
[Ekanayake HPDC 2010]
¢  Termination
13.
14.
15.
!
16.
17.
18.
19.
20.
21.
22.
23.
condition evaluated by main()
while(!complete){!
monitor = driver.runMapReduceBCast(cData);!
monitor.monitorTillCompletion();!
DoubleVectorData newCData = ((KMeansCombiner) driver!
.getCurrentCombiner()).getResults();!
totalError = getError(cData, newCData);!
cData = newCData;!
if (totalError < THRESHOLD) {!
complete = true;!
break;!
}!
}!
Bill Howe, UW
5/5/14
55
Spark
http://spark.apache.org
¢ 
MapReduce implements an acyclic data flow model
l 
l 
¢ 
Not suitable for applications that re-use a working data set across
multiple parallel operations
E.g., iterative machine learning, interactive data analysis
Spark supports these applications while retaining
scalability and fault tolerance
l 
l 
Resilient distributed data sets (RDDs) – read-only collections of
objects, cached in memory
Fault tolerance through lineage – RDD has enough information
about how it was derived
¢ 
Huge speedups for iterative machine learning (10x)
¢ 
Support for interactive queries over large data (39GB)
Spark
l 
Reduction output collected at driver program
•  “…does not currently support a grouped reduce operation as in
MapReduce”"
val spark = new SparkContext(<Mesos master>)!
all output
var count = spark.accumulator(0)!
sent to
for (i <- spark.parallelize(1 to 10000, 10)) {!
driver.
val x = Math.random * 2 - 1!
val y = Math.random * 2 - 1!
if (x*x + y*y < 1) count += 1!
}!
println("Pi is roughly " + 4 * count.value / 10000.0)!
Bill Howe, UW
5/5/14
57
Pregel [Malewicz PODC 2009]
¢ 
API designed for graph processing
l 
¢ 
clustering: k-means, canopy, DBScan
Computations are a sequence of iterations
l 
In each iteration, a user-defined function is invoked for each vertex
in parallel
¢ 
Assumes each vertex has access to outgoing edges
¢ 
So an edge representation …
Edge(from, to)!
¢ 
f(v,s) – behavior at vertex v in step s
l 
l 
l 
Read messages from step s-1
Send messages for step s+1
Modify the state of v and its outgoing edges
Bill Howe, UW
5/5/14
58
Haloop
http://code.google.com/p/haloop/
¢ 
HaLoop is a modified version of the Hadoop MapReduce
framework that supports iterative computations
l 
¢ 
Issue with iterations in Hadoop:
l 
¢ 
Also improves efficiency!
Much of the data may be unchanged from iteration to iteration, but
they must be re-loaded and re-processed at each iteration, wasting
I/O, network bandwidth, and CPU resources
Approach: inter-iteration caching
Approach: Inter-iteration caching
Loop body
M
…
M
r
r
M
Reducer output cache (RO)
Reducer input cache (RI)
Mapper output cache (MO)
Mapper input cache (MI)
Bill Howe, UW
5/5/14
60
Haloop
http://code.google.com/p/haloop/
¢ 
HaLoop is a modified version of the Hadoop MapReduce
framework that supports iterative computations
l 
¢ 
Also improves efficiency!
Issue with iterations in Hadoop:
l 
Much of the data may be unchanged from iteration to iteration, but
they must be re-loaded and re-processed at each iteration, wasting
I/O, network bandwidth, and CPU resources
¢ 
Approach: inter-iteration caching
¢ 
Benefits:
l 
By caching loop-invariant data, avoids shuffling the network/graph
at every step
Questions?

Download Report

Graph Algorithms

Paperzz.com

Your Paperzz