Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines Steven Lumetta Arvind Krishnamurthy David Culler UC-Berkeley In proceeding of IEEE/ACM Supercomputing 1995 Problem Connected components problem is important in the simulation of physical phenomena Sequential solutions are well known DFS BFS As usual, the computational scientists need and are willing to pay for more performance. Therefore must parallelize! Problem Theoretical work focuses on PRAM Example: Shiloach-Vishkin requires CRCW Inherent contention in algorithm makes even EREW versions difficult This is not very useful because CRCW PRAM machines cannot be built to scale efficiently Parallel solutions are difficult on practical machines “Solutions typically emphasize performance over portability, scalabiility, and generality, but still rarely obtain good performance.” Solution Modify the S-V algorithm to work more like a data parallel algorithm Hybrid parallel algorithm Divide work into local and global phases Local phase runs fast sequential version on all processors Global phase combines solution Can now run efficiently on a distributed memory machine because algorithm reflects the memory hierarchy As usual, want to minimize the communication to maximum performance Solution Implemented in Split-C variant of C to achieve scalable and portable implementation for distributed memory machines Used the probabilistic mesh to generate graphs to test algorithm Demonstrated results on three different platforms Hybrid Algorithm Overview Partition graph to each processor Use DFS locally to find components in each partition Collapse each component into a single node so global graph is much simpler than original Use variant of S-V globally to combine results Step 1: Local Phase Perform local DFS on each processor’s portion of the graph Collapse each local connected component into a representative node Mark each component with a unique value for the global phase Step 2: Global Init Modify the pointers of each remote edge to point at component representative nodes rather than the original nodes This completes collapsing the graph and we are now ready to start the global S-V loop Step 3: Global Iterations Termination check Hooking: Standard hooking for S-V. Conditions on merging ensure that representative nodes remain unique. Star Formation: Collapse nodes in a component to ensure that rep node is a single, consistent value. Self-Loop Removal: Remove all edges that point to nodes within the component Edge-List Concatenation: Move all leaf node edges to the root Repeat Step 4: Local Cleanup Update each node with the value from its representative. Split-C Language Looks just like C with some annotations and a data parallel programming model Mirrors basic structure of distributed memory machines Allows implementation to be portable across multiple architectures Can run on Cray T3D, IBM SP-1 and SP-2, Intel Paragon, Thinking Machines CM-5, Meiko CS-2, and networks of workstations Simplifies implementation because the protocol details are hidden Split-C Language Uses a Non-Uniform Memory Access (NUMA) model Single address space Global pointers can point to memory on other processors Can distinguish between global and local pointers if needed Split-C Language Split-C is implemented as an extension of GCC 2.4.5 Local code is generated like normal C Global accesses are optimized to take advantage of hardware specific communication capabilities Parallel Platforms Used three large-scale parallel machines for performance analysis Cray T3D Meiko CS-2 Thinking Machines CM-5 Cray T3D DEC Alpha 21064 64-bit, dual issue RISC 150 MHz 8kB split instruction and data cache 3D Torus topology according to Wikipedia Global read is a mapped into a short instruction sequence that runs in about 1 microsecond Source: Wikipedia Meiko CS-2 90 MHz dual-issue SPARC RISC Processor Large cache Multi-staged packet-switched fat tree topology Each node runs Solaris Dedicated “ELAN” communications co-processor handled remote requests to access memory Remote read takes about 20 microseconds Thinking Machines CM-5 Cypress Sparc clocked at 33 MHz 64 kB unified cache Remote read mapped to a CMAML active message which generates a reply with the value 12 microseconds for remote read Local Node Performance Probabilistic Mesh Want to find connected components on a probabilistic mesh graph Comes from Swendsen-Wang cluster dynamics algorithm In a 2D or 3D mesh, each edge has probability of p of being present Easy to partition: just give each processor a sub-square or sub-cube of equal size Probabilistic Mesh Performance Results T3D had the highest performance Best processor Best network Compared parallel results to C90 implementation, which was the best single processor implementation at the time Metric is millions of nodes per second Utility of this is explained in a few slides Performance Results Performance Results Graph Size and Dimension Number of connected components grows linearly with respect to number of nodes Claim that work required for probabilistic mesh is linear in size of the graph They were not very rigorous Since work is proportional to graph size, nodes per second metric is meaningful Surface to volume ratio of each partition is the primary parameter for determining communication needs in global phase Requires more remote reads, which are slow Graph Size and Dimension Graph Size and Dimension Graphs Selected for Measurement Edge Probability Two types of graphs Liquid – lots of small components Solid – a few large components Very fast transition from liquid to solid Transition occurs when each node has on average two edges, which allows long paths to form quickly Edge Probability Edge Probability Sample Graph Instances Since graphs are randomly generated, to get fair results they run a number of experiments and average For a given graph size and edge probability, the number of connected components followed a normal distribution, which allowed them to measure the average with a small number of samples Number of CC Distribution Local Performance on CM-5 Local Performance on All Note: This provides a baseline for parallel performance Global Performance Liquid graphs scale at a constant fraction of the ideal speedup Little communication required because they are strongly disconnected Solid graphs get diminishing returns Bottlenecks are introduced by processors that hold nodes representing the big components This makes the algorithm run in a more sequential manner 3D20/30 (Liquid) on T3D 3D40/30 (Solid) on T3D Improved Algorithm for Solid Graphs Reduce load imbalance problem After DFS, assign largest component on each processor a value that allows it to only connect to other large components Causes the graph to collapse faster Still has load imbalance, but not nearly as bad Cray T3D Speedup CM-5 Speedup Speedup Analysis T3D blew away the CM-5 because it has a better communication network On CM-5, remote memory access was taking 100 seconds with contention compared to 12 microseconds on an uncongested network This is a standard result for local/global data parallel algorithms Conclusions Hybrid Algorithm allowed for efficient realization of PRAM S-V algorithm on real machines T3D implementation outperforms all previously known implementations. Split-C implementation was useful because it was portable across machines but still high performance Conclusions Liquid-solid transition was abrupt and extremely important to execution time Made the difference between linear speedup and diminishing returns Load imbalances can arise in non-obvious ways when porting PRAM algorithms to real machines
© Copyright 2026 Paperzz