slides - ID-IMAG

New Experimental Results in
Communication-Aware Processor
Allocation for Supercomputers
Michael Bender, SUNY Stony Brook
David Bunde, Knox College
Vitus Leung, Sandia National Laboratories
Kevin Pedretti, Sandia National Laboratories
Cynthia Phillips, Sandia National Laboratories
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,
for the United States Department of Energy under contract DE-AC04-94AL85000.
Computational Plant (Cplant)
• Commodity-based supercomputers at Sandia National Laboratories
(off-the-shelf components)
• Up to 2048 processors
• Production computing environment
• Our Job: Improve parallel node allocation on Cplant to
optimize performance.
The Cplant System
•
•
•
•
DEC alpha processors
Myrinet interconnect (Sandia modified)
MPI
Different sizes/topologies: usually 2D or 3D grid with toroidal wraps
– Ross = 2048 proc, 3D mesh
– Zermatt = 128-proc 2D mesh
– Alaska = ~600, heavily-augmented 2D mesh (cannibalized).
• Modified Linux OS (now public domain)
• Four processors/switch (compute, I/O, service nodes)
Scheduling Environment
• Users submit jobs to queue (online)
• Users specify number of processors and runtime estimate
– If a job runs past this estimate by 5 min, it is killed
• No preemption, no migration, no multitasking (security)
• Actual runtime depends on set of processors allocated and
placement of other jobs
Goals:
• User - minimum response time
• Bureaucracy (GAO) - high utilization
Scheduler/Allocator Association
Scheduler and allocator effect each others’ performance.
Performance dependencies
Scheduler
Allocator
Scheduler/Allocator Dissociation
Job:
User
Executable
# processors
Requested time
.
.
.
PBS
Scheduler
Node
Allocator
Cplant
queue
Job
• Scheduler enforces policy
– Management sets priorities for access, utilization policy
• Allocator can optimize performance
What’s a Good Allocation?
Good allocation
For 2D mesh
Bad allocation
For 2D mesh
Objective: Allocate jobs to processors to minimize network
contention  processor locality.
• Especially important for commodity networks
Quantitative Effect of Processor Locality
=2
But, speed-up anomaly
faster than
= empty processor
Communication Hops on a 2D grid
5
4
• L1 distance = # hops (~ # switches) between 2 processors
on grid
Allocation Problem
• Given n available points on grid (some unavailable)
• Find a set of k available points with minimum average (or
total) L1 distance.
• Example: green allocation: 3(2) + 3(1) = 9
Empirical
Correlation
Leung et al, 2002
Related support:
Mache and Lo, 1996
Previous Work
• Various Work forcing a convex set
– Insufficient processor utilization
• Mache, Lo, Windisch MC algorithm
• Krumke et al 2-approximation, NP-hard w/general metric
• Complexity open for grids
• Dispersion problem (max distance) linear time for fixed k
(Fekete and Meijer)
Optimal Unconstrained Shape
[Bender,Bender,Demaine,Fekete 2004]
Almost a circle
but not quite.
Only .05
percent
difference in
area.
0.650 245 952 951
Previous Results (Bender et al 2005)
1
• 7/4-approximation (2 in d dimensions)
2d
• PTAS ((1+)-approximation in poly time for fixed )
• MC is a 4-approximation
 dynamic program 1D
• Linear-time exact
• O(n log n) time for k=3
• Simulations (performance on job streams)
Experiments: Placement Algorithm MC
• Search in shell from minimum-size region of preferred
shape.
• Weight processors by shells
• Return processor set with minimum weight.
Alternative: One-Dimensional Reduction
• Order processors so that
close in linear order  close in physical processor graph
• Consider one-dimensional processor allocation
– Bin packing (best fit, first fit, sum of squares)
– Pack jobs onto the line (or ring), allowing fragmentation
New System Red Storm
• 12,960 Dual-Core AMD Opteron 2.4Ghz
• 39.19 TB Memory, 340 TB disk
• 124 TF peak performance
• 3D Mesh
Impact
• Changed the node allocator on Cplant
– 1D default allocator
– 2D algorithms implemented
– Carried over to Red Storm system software
• 1D and 2D algorithms implemented
• Selectable at compilation
• R&D 100 winner (Leung, Bender, Bunde, Pedretti, Phillips
2006)
Red Storm Development Machine
I/O node
1 Cray XT3/4 Cabinet
Compute node
Does Bandwidth Make a Difference?
Real time
(seconds)
User time
(seconds)
Sys time
(seconds)
1/4 link
bandwidth
15623.353
1012.302
50.298
Full
bandwidth
6314.818
1010.752
50.003
• Yes!
Red Storm Development Machine
I/O node
YZ S Curve
Compute node
Red Storm Development Machine
I/O node
ZY S Curve
Compute node
Hilbert (Space-Filling) Curves
• For 2D and 3D grids
• Previous applications
– I/O efficient and cache-oblivious computation
– Compression (images)
– Domain decomposition
Red Storm Development Machine
I/O node
Compute node
Zoltan Hilbert-Space-Filling Curve
Red Storm Development Machine
I/O node
Compute node
Spliced Hilbert-Space-Filling Curve
Results (Makespan in Seconds)
YZ
ZY
random
Zoltan
spliced
MC1x1
5807.1
SS
5830.6
7003.2
6610.1
6699.6
6021.1
FF
5868.6
7039.5
6639.6
6758.7
6052.3
BF
5826.2
7022.6
6631.9
6739.1
6023.4
simple
6102.4
• Consistent with simulations (Bender et al 2005)
Results (Makespan Normalized)
YZ
ZY
random
Zoltan
spliced
MC1x1
1
SS
1.0040
1.206
1.1383
1.1537
1.0369
FF
1.0106
1.2122
1.1434
1.1639
1.0422
BF
1.0033
1.2093
1.1420
1.1605
1.0372
simple
1.0509
Red Storm Development Machine
I/O node
Compute node
Is it I/O or interprocess communication?
Results (Makespan Normalized)
YZ
ZY
random
Zoltan
spliced
BF
1
1.2053
1.1383
1.1567
1.0338
BF2
1
1.2398
1.176
1.1828
1.0443
• Not I/O
• Consistent with Cplant experiments (Leung et al 2002)
• Consistent with Pittsburgh Supercomputing Center
experiments (Weisser et al 2006)
Experiments- Test Set
• All-to-All Communications
Job Size
Number of Jobs
2
1820
5
660
15
620
20
660
• High communication, best-case for runtime improvements
• Small number of repetitions (3)
Questions
• What’s the right allocation for a stream (online)?
• Scheduling + Allocation
MPP
Jobs

Download Report

slides - ID-IMAG

Paperzz.com

Your Paperzz