KerData Team Rennes Bretagne Atlantique Research Center Investigating the Root Causes of I/O Interference in HPC Storage Systems Orçun Yildiz, Matthieu Dorier, Shadi Ibrahim, Rob Ross,Gabriel Antoniu! ! (to appear in IPDPS’16) Grid’5000 Winter School Grenoble, February 2016 Motivation ! • “A supercomputer is a device for turning compute-bound problems into I/O bound problems” — Ken Batcher! • Performance variability: still a major challenge towards Exascale!! • Many concurrent applications —> I/O Interference! • Difficult to predict the interference and its impact! 2 Objectives: • Exploring the root causes of I/O interference ! • Enabling a better understanding of the I/O interference phenomenon ! • Helping researchers to tackle the interference problem along the I/O path 3 What is Interference? • Performance degradation observed by an application in contention with other applications for the access to a shared resource. 4 Potential sources of interference 5 Micro-benchmark • IOR-like micro-benchmark: ! - Split MPI_COMM_WORLD into two groups - Each group issues a collective I/O operations with following patterns: • Contiguous— Each process issues 64 MB write request • Strided— Non-contiguous case; each process issues 256 requests of 256 KB 6 Reporting method: Introducing the “Δ-graphs” Time App A’s access dt App B’s access Performance degradation due to interferences I/O time when the application is alone ! The graph represents the point of view of one of the 2 applications. 7 Credit— CALCioM:Mitigating I/O Interference in HPC Systems through Cross-Application Coordination by Matthieu Dorier @ IPDPS’14 Experimental Setup Rennes site Parasilo and Paravance clusters Two 8-core Intel Xeon 2.4 GHz CPUs 128 GB RAM 10 G Ethernet Network - PVFS -> 12 nodes of parasilo - Benchmark -> 60 nodes of paravance (960 cores) 8 Why Grid’5000? ! • Mira, BlueWaters machines: - Partial allocation - no root access ! ! • Grid5K: - Full control - Root access with customized images - Fast support 9 Emulating HPC on Grid’5000 Parasilo Local disks Paravance Intra-cluster communication for MPI apps using TCP over Ethernet for PVFS 10G Ethernet Influence of the backend storage device • Time taken by an application running on 1 core to write 2 GB locally contiguously: Hard disks have the lowest performance and lead to highest slowdown 11 Interplay between access pattern and storage device 950 900 Write Time (s) Write Time (s) Disk 70 SSD RAM 60 50 40 30 20 10 0 -40 -30 -20 -10 0 10 20 dt (s) Contiguous- all 30 40 70 Disk SSD 65 RAM 850 800 60 Write Time (s) 80 750 700 650 600 550 500 450 -600 55 50 45 40 35 -400 -200 0 200 400 600 30 -40 -30 -20 -10 dt (s) Strided- disk 0 10 20 30 40 dt (s) Strided- others Access pattern can impact the interference behavior 12 Write Time (s) Influence of the network interface 80 75 70 65 60 55 50 45 40 35 30 -60 1 client per node 16 clients per node -40 -20 0 dt (s) 20 40 60 Fewer writers per multicore nodes also lowers the interference besides the performance improvement 13 The role of network on interference 75 12 1 G Ethernet 10 G Ethernet 70 11 Write Time (s) Write Time (s) 65 60 55 50 45 10 9 8 7 40 6 35 -60 5 -15 -40 -20 0 dt (s) 20 Sync is ON 40 60 1 G Ethernet 10 G Ethernet -10 -5 0 dt (s) 5 10 15 Sync is OFF Slower network may not cause a higher interference! 14 Impact of using disjoint storage servers 80 75 13 12 PVFS servers 6+6 PVFS servers 12 11 Write Time (s) Write Time (s) 70 12 PVFS servers 6+6 PVFS servers 65 60 55 50 10 9 8 7 45 6 40 5 35 -60 4 -15 -40 -20 0 dt (s) 20 Sync is ON 40 60 -10 -5 0 dt (s) 5 10 15 Sync is OFF Targeting separate servers can eliminate the interference 15 Impact of the request size Write Time (s) 900 80 70 Write Time (s) 1000 90 64 KB 128 KB 256 KB 512 KB 800 700 600 500 60 50 40 30 64 KB 128 KB 256 KB 512 KB 20 400 10 300 -600 -400 -200 0 dt (s) 200 Sync is ON 400 600 0 -60 -40 -20 0 dt (s) 20 40 60 Sync is OFF Interference-free behavior does not necessarily mean the optimal performance! 16 Asymmetric Δ-graphs: Revisiting the Incast* problem 200 80 70 60 50 150 100 50 0 0 5 10 15 20 25 30 35 40 Time(s) 40 30 20 10 0 100 250 80 App B-WindowSize App B-progress 200 60 150 40 100 20 50 0 Progress(%) App A-WindowSize App A-progress 300 100 90 TCP Window size(x2048 bytes) 250 Progress(%) TCP Window size(x2048 bytes) Representative clients for each application and their observed TCP window sizes: 0 10 20 30 Time(s) 40 50 0 *More about Incast: A. Phanishayee, E. Krevat, V. Vasudevan, D. Andersen, G. Ganger, G. Gibson, and S. Seshan, “Measurement and analysis of TCP throughput collapse in cluster-based storage systems,” Carnegie Mellon University, Tech. Rep., 2007. 17 Counter-intuitive results from a Incast perspective: 12 Write Time (s) 11 1 G Ethernet 10 G Ethernet 10 Using a low-bandwidth network mitigates the Incast problem by constraining the request rate 9 8 7 6 5 -15 80 75 -10 -5 0 dt (s) 5 10 15 12 PVFS servers 6+6 PVFS servers Write Time (s) 70 65 Splitting the servers into two groups -> servers has to interact with twice fewer clients 60 55 50 45 40 35 -60 -40 -20 0 dt (s) 20 40 60 18 Well, when does the Incast starts to happen? 960 clients (default) 704 clients 512 clients 352 clients 256 clients 128 clients 80 Write Time (s) 70 60 50 40 30 20 10 0 -60 -40 -20 0 dt (s) 20 40 60 Interference trend is getting symmetrical : Incast -> regular contention 19 Conclusion • Interference hurts the system performance ! • Causes of interference can be very diverse ! • Interference results from the interplay between several components along the I/O path ! - Impact of the request size on interference depends on the configuration of the components ! - Using a low-bandwidth network may eliminate the interference 20 Thank You! Orçun YILDIZ! PhD Student! KerData Team! [email protected] Grid5K tagged in HAL
© Copyright 2026 Paperzz