Towards Modeling the Performance of a Fast Connected

Towards Modeling the
Performance of a Fast Connected
Components Algorithm on Parallel
Machines
Steven Lumetta
Arvind Krishnamurthy
David Culler
UC-Berkeley
In proceeding of IEEE/ACM Supercomputing 1995
Problem


Connected components problem is important in
the simulation of physical phenomena
Sequential solutions are well known



DFS
BFS
As usual, the computational scientists need and
are willing to pay for more performance.

Therefore must parallelize!
Problem

Theoretical work focuses on PRAM




Example: Shiloach-Vishkin requires CRCW
Inherent contention in algorithm makes even EREW
versions difficult
This is not very useful because CRCW PRAM
machines cannot be built to scale efficiently
Parallel solutions are difficult on practical
machines

“Solutions typically emphasize performance over
portability, scalabiility, and generality, but still rarely
obtain good performance.”
Solution


Modify the S-V algorithm to work more like a data
parallel algorithm
Hybrid parallel algorithm





Divide work into local and global phases
Local phase runs fast sequential version on all processors
Global phase combines solution
Can now run efficiently on a distributed memory
machine because algorithm reflects the memory
hierarchy
As usual, want to minimize the communication to
maximum performance
Solution



Implemented in Split-C variant of C to
achieve scalable and portable
implementation for distributed memory
machines
Used the probabilistic mesh to generate
graphs to test algorithm
Demonstrated results on three different
platforms
Hybrid Algorithm Overview




Partition graph to each processor
Use DFS locally to find components in
each partition
Collapse each component into a single
node so global graph is much simpler than
original
Use variant of S-V globally to combine
results
Step 1: Local Phase



Perform local DFS on each processor’s
portion of the graph
Collapse each local connected component
into a representative node
Mark each component with a unique value
for the global phase
Step 2: Global Init


Modify the pointers of each remote edge
to point at component representative
nodes rather than the original nodes
This completes collapsing the graph and
we are now ready to start the global S-V
loop
Step 3: Global Iterations






Termination check
Hooking: Standard hooking for S-V. Conditions on
merging ensure that representative nodes remain
unique.
Star Formation: Collapse nodes in a component to
ensure that rep node is a single, consistent value.
Self-Loop Removal: Remove all edges that point to nodes
within the component
Edge-List Concatenation: Move all leaf node edges to the
root
Repeat
Step 4: Local Cleanup

Update each node with the value from its
representative.
Split-C Language


Looks just like C with some annotations and a
data parallel programming model
Mirrors basic structure of distributed memory
machines



Allows implementation to be portable across multiple
architectures
Can run on Cray T3D, IBM SP-1 and SP-2, Intel
Paragon, Thinking Machines CM-5, Meiko CS-2, and
networks of workstations
Simplifies implementation because the protocol details
are hidden
Split-C Language

Uses a Non-Uniform Memory Access
(NUMA) model



Single address space
Global pointers can point to memory on other
processors
Can distinguish between global and local
pointers if needed
Split-C Language

Split-C is implemented as an extension of
GCC 2.4.5


Local code is generated like normal C
Global accesses are optimized to take
advantage of hardware specific
communication capabilities
Parallel Platforms

Used three large-scale parallel machines
for performance analysis



Cray T3D
Meiko CS-2
Thinking Machines CM-5
Cray T3D

DEC Alpha 21064





64-bit, dual issue RISC
150 MHz
8kB split instruction and
data cache
3D Torus topology
according to Wikipedia
Global read is a mapped
into a short instruction
sequence that runs in
about 1 microsecond
Source: Wikipedia
Meiko CS-2






90 MHz dual-issue SPARC RISC Processor
Large cache
Multi-staged packet-switched fat tree topology
Each node runs Solaris
Dedicated “ELAN” communications co-processor
handled remote requests to access memory
Remote read takes about 20 microseconds
Thinking Machines CM-5




Cypress Sparc clocked at 33 MHz
64 kB unified cache
Remote read mapped to a CMAML active
message which generates a reply with the
value
12 microseconds for remote read
Local Node Performance
Probabilistic Mesh




Want to find connected components on a
probabilistic mesh graph
Comes from Swendsen-Wang cluster
dynamics algorithm
In a 2D or 3D mesh, each edge has
probability of p of being present
Easy to partition: just give each processor
a sub-square or sub-cube of equal size
Probabilistic Mesh
Performance Results

T3D had the highest performance




Best processor
Best network
Compared parallel results to C90
implementation, which was the best single
processor implementation at the time
Metric is millions of nodes per second

Utility of this is explained in a few slides
Performance Results
Performance Results
Graph Size and Dimension


Number of connected components grows linearly
with respect to number of nodes
Claim that work required for probabilistic mesh
is linear in size of the graph



They were not very rigorous
Since work is proportional to graph size, nodes per
second metric is meaningful
Surface to volume ratio of each partition is the
primary parameter for determining
communication needs in global phase

Requires more remote reads, which are slow
Graph Size and Dimension
Graph Size and Dimension
Graphs Selected for Measurement
Edge Probability

Two types of graphs




Liquid – lots of small components
Solid – a few large components
Very fast transition from liquid to solid
Transition occurs when each node has on
average two edges, which allows long
paths to form quickly
Edge Probability
Edge Probability
Sample Graph Instances


Since graphs are randomly generated, to get fair
results they run a number of experiments and
average
For a given graph size and edge probability, the
number of connected components followed a
normal distribution, which allowed them to
measure the average with a small number of
samples
Number of CC Distribution
Local Performance on CM-5
Local Performance on All
Note: This provides a baseline for parallel performance
Global Performance

Liquid graphs scale at a constant fraction
of the ideal speedup


Little communication required because they
are strongly disconnected
Solid graphs get diminishing returns


Bottlenecks are introduced by processors that
hold nodes representing the big components
This makes the algorithm run in a more
sequential manner
3D20/30 (Liquid) on T3D
3D40/30 (Solid) on T3D
Improved Algorithm for
Solid Graphs




Reduce load imbalance
problem
After DFS, assign largest
component on each
processor a value that
allows it to only connect
to other large
components
Causes the graph to
collapse faster
Still has load imbalance,
but not nearly as bad
Cray T3D Speedup
CM-5 Speedup
Speedup Analysis



T3D blew away the CM-5 because it has a
better communication network
On CM-5, remote memory access was
taking 100 seconds with contention
compared to 12 microseconds on an
uncongested network
This is a standard result for local/global
data parallel algorithms
Conclusions

Hybrid Algorithm allowed for efficient
realization of PRAM S-V algorithm on real
machines


T3D implementation outperforms all
previously known implementations.
Split-C implementation was useful because
it was portable across machines but still
high performance
Conclusions

Liquid-solid transition was abrupt and
extremely important to execution time


Made the difference between linear speedup
and diminishing returns
Load imbalances can arise in non-obvious
ways when porting PRAM algorithms to
real machines