Investigating the Root Causes of I/O Interference in HPC

KerData Team
Rennes Bretagne Atlantique Research Center
Investigating the Root Causes of I/O
Interference in HPC Storage Systems
Orçun Yildiz, Matthieu Dorier, Shadi Ibrahim, Rob Ross,Gabriel Antoniu!
!
(to appear in IPDPS’16)
Grid’5000 Winter School
Grenoble, February 2016
Motivation
!
• “A supercomputer is a device for turning compute-bound
problems into I/O bound problems” — Ken Batcher!
• Performance variability: still a major challenge towards
Exascale!!
• Many concurrent applications —> I/O Interference!
• Difficult to predict the interference and its impact!
2
Objectives:
• Exploring the root causes of I/O interference
!
• Enabling a better understanding of the I/O
interference phenomenon
!
• Helping researchers to tackle the
interference problem along the I/O path
3
What is Interference?
• Performance degradation observed by an
application in contention with other
applications for the access to a shared
resource.
4
Potential sources of
interference
5
Micro-benchmark
• IOR-like micro-benchmark:
!
- Split MPI_COMM_WORLD into two groups
- Each group issues a collective I/O operations with
following patterns:
• Contiguous— Each process issues 64 MB write request
• Strided— Non-contiguous case; each process issues
256 requests of 256 KB
6
Reporting method:
Introducing the “Δ-graphs”
Time
App A’s access
dt
App B’s access
Performance
degradation due to
interferences
I/O time when the
application is alone
!
The graph represents the point of
view of one of the 2 applications.
7
Credit— CALCioM:Mitigating I/O Interference in HPC Systems through Cross-Application Coordination by Matthieu Dorier @ IPDPS’14
Experimental Setup
Rennes site
Parasilo and Paravance clusters
Two 8-core Intel Xeon 2.4 GHz
CPUs
128 GB RAM
10 G Ethernet Network
- PVFS -> 12 nodes of parasilo
- Benchmark -> 60 nodes of paravance (960 cores)
8
Why Grid’5000?
!
• Mira, BlueWaters machines:
- Partial allocation
- no root access
!
!
• Grid5K:
- Full control
- Root access with customized images
- Fast support
9
Emulating HPC on Grid’5000
Parasilo
Local disks
Paravance
Intra-cluster
communication
for MPI apps
using TCP over Ethernet
for PVFS
10G Ethernet
Influence of the backend
storage device
• Time taken by an application running on 1 core
to write 2 GB locally contiguously:
Hard disks have the lowest performance
and lead to highest slowdown
11
Interplay between access
pattern and storage device
950
900
Write Time (s)
Write Time (s)
Disk
70 SSD
RAM
60
50
40
30
20
10
0
-40 -30 -20 -10
0
10
20
dt (s)
Contiguous- all
30
40
70
Disk
SSD
65 RAM
850
800
60
Write Time (s)
80
750
700
650
600
550
500
450
-600
55
50
45
40
35
-400
-200
0
200
400
600
30
-40 -30 -20 -10
dt (s)
Strided- disk
0
10 20 30 40
dt (s)
Strided- others
Access pattern can impact the
interference behavior
12
Write Time (s)
Influence of the network
interface
80
75
70
65
60
55
50
45
40
35
30
-60
1 client per node
16 clients per node
-40
-20
0
dt (s)
20
40
60
Fewer writers per multicore nodes also lowers
the interference besides the performance
improvement
13
The role of network on
interference
75
12
1 G Ethernet
10 G Ethernet
70
11
Write Time (s)
Write Time (s)
65
60
55
50
45
10
9
8
7
40
6
35
-60
5
-15
-40
-20
0
dt (s)
20
Sync is ON
40
60
1 G Ethernet
10 G Ethernet
-10
-5
0
dt (s)
5
10
15
Sync is OFF
Slower network may not cause a
higher interference!
14
Impact of using disjoint storage
servers
80
75
13
12 PVFS servers
6+6 PVFS servers
12
11
Write Time (s)
Write Time (s)
70
12 PVFS servers
6+6 PVFS servers
65
60
55
50
10
9
8
7
45
6
40
5
35
-60
4
-15
-40
-20
0
dt (s)
20
Sync is ON
40
60
-10
-5
0
dt (s)
5
10
15
Sync is OFF
Targeting separate servers can eliminate
the interference
15
Impact of the request size
Write Time (s)
900
80
70
Write Time (s)
1000
90
64 KB
128 KB
256 KB
512 KB
800
700
600
500
60
50
40
30
64 KB
128 KB
256 KB
512 KB
20
400
10
300
-600
-400
-200
0
dt (s)
200
Sync is ON
400
600
0
-60
-40
-20
0
dt (s)
20
40
60
Sync is OFF
Interference-free behavior does not
necessarily mean the optimal performance!
16
Asymmetric Δ-graphs:
Revisiting the Incast* problem
200
80
70
60
50
150
100
50
0
0
5
10 15 20 25 30 35 40
Time(s)
40
30
20
10
0
100
250
80
App B-WindowSize
App B-progress
200
60
150
40
100
20
50
0
Progress(%)
App A-WindowSize
App A-progress
300
100
90
TCP Window size(x2048 bytes)
250
Progress(%)
TCP Window size(x2048 bytes)
Representative clients for each application and their observed
TCP window sizes:
0
10
20
30
Time(s)
40
50
0
*More about Incast: A. Phanishayee, E. Krevat, V. Vasudevan, D. Andersen, G. Ganger, G. Gibson, and S. Seshan, “Measurement
and analysis of TCP throughput collapse in cluster-based storage systems,” Carnegie Mellon University, Tech. Rep., 2007.
17
Counter-intuitive results from a
Incast perspective:
12
Write Time (s)
11
1 G Ethernet
10 G Ethernet
10
Using a low-bandwidth network
mitigates the Incast problem by
constraining the request rate
9
8
7
6
5
-15
80
75
-10
-5
0
dt (s)
5
10
15
12 PVFS servers
6+6 PVFS servers
Write Time (s)
70
65
Splitting the servers into two
groups -> servers has to interact
with twice fewer clients
60
55
50
45
40
35
-60
-40
-20
0
dt (s)
20
40
60
18
Well, when does the Incast
starts to happen?
960 clients (default)
704 clients
512 clients
352 clients
256 clients
128 clients
80
Write Time (s)
70
60
50
40
30
20
10
0
-60
-40
-20
0
dt (s)
20
40
60
Interference trend is getting symmetrical :
Incast -> regular contention
19
Conclusion
• Interference hurts the system performance
!
• Causes of interference can be very diverse
!
• Interference results from the interplay between
several components along the I/O path
!
- Impact of the request size on interference depends on the
configuration of the components
!
- Using a low-bandwidth network may eliminate the interference
20
Thank You!
Orçun YILDIZ!
PhD Student!
KerData Team!
[email protected]
Grid5K tagged in HAL