Value-Based Scheduling Framework for CPU

ValuePack: Value-Based Scheduling
Framework for CPU-GPU Clusters
Vignesh Ravi, Michela Becchi, Gagan Agrawal,
Srimat Chakradhar
Context
• GPUs are used in supercomputers
– Some
of thefor
top500
supercomputers
use and
GPUs
Need
resource
managers
• Tianhe-1A
•
scheduling schemes for
– 14,336 Xeon X5670 processors
heterogeneous
clusters including
– 7,168 Nvidia Tesla M2050 GPUs
many-core GPUs
Stampede
– about 6,000 nodes:
» Xeon E5-2680 8C, Intel Xeon Phi
• GPUs are used in cloud computing
2
Categories of Scheduling Objectives
• Traditional schedulers for supercomputers aim to
improve system-wide metrics: throughput & latency
• A market-based service world is emerging: focus on
provider’s profit and user’s satisfaction
– Cloud: pay-as-you-go model
• Amazon: different users (On-Demand, Free, Spot, …)
– Recent resource managers for supercomputers (e.g. MOAB)
have the notion of service-level agreement (SLA)
3
State of the Art
Motivation
• Open-source batch schedulers start to support GPUs
– TORQUE, SLURM
– Users’ guide mapping of jobs to heterogeneous nodes
Our
Goal:
– Simple
scheduling
schemes (goals: throughput & latency)
•
Reconsider market-based scheduling
Recent proposals
describe runtimeclusters
systems & virtualization
for heterogeneous
including
frameworks
for clusters with GPUs
GPUs
– [gViM HPCVirt '09][vCUDA IPDPS '09][ rCUDA HPCS’10 ]
[gVirtuS Euro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12]
– Simple scheduling schemes (goals: throughput & latency)
• Proposals on market-based scheduling policies focus on
homogeneous CPU clusters
– [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04]
4
Considerations
• Community looking into code portability between CPU
and GPU
– OpenCL
– PGI CUDA-x86
– MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC
→ Opportunity to flexibly schedule a job on CPU/GPU
• In cloud environments oversubscription commonly used
to reduce infrastructural costs
→ Use of resource sharing to improve performance by
maximizing hardware utilization
5
Problem Formulation
• Given a CPU-GPU cluster
• Schedule a set of jobs on the cluster
– To maximize the provider’s profit / aggregate user satisfaction
• Exploit the portability offered by OpenCL
– Flexibly map the job on to either CPU or GPU
• Maximize resource utilization
– Allow sharing of multi-core CPU or GPU
Assumptions/Limitations
• 1 multi-core CPU and 1 GPU per node
• Single-node, single GPU jobs
• Only space-sharing, limited to two jobs per resource
6
Value Function
Market-based Scheduling Formulation
Yield/Value
• For each job, Linear-Decay Value Function [Irwin HPDC’04]
Yield = maxValue – decay * delay
Max Value
T
Execution time
• Max Value → Importance/Priority of job
Decay → Urgency of job
• Delay due to:
₋ queuing, execution on non-optimal resource, resource sharing
7
Overall Scheduling
Approach
Scheduling
Flow
Jobs arrive in batches
Phase 1:
Mapping
Phase 2:
Sorting
Enqueue into CPU Queue
Sort jobs to Improve Yield
Enqueue into GPU Queue
Sort jobs to Improve Yield
Phase 3:
Re-mapping
Jobs are enqueued on
their optimal resource.
Phase 1 is oblivious of
other jobs (based on
optimal walltime)
Inter-jobs scheduling
considerations
Different schemes:
- When to remap?
- What to remap?
Execute on CPU
Execute on GPU
8
Phase 1: Mapping
• Users provide walltime on GPU and GPU
– walltime used as indicator of optimal/non optimal
resource
– Each job is mapped onto its optimal resource
NOTE: in our experiments we assumed
maxValue = optimal walltime
9
Phase 2: Sorting
• Sort jobs based on Reward [Irwin HPDC’04]
Reward i 
PresentValue i  OpportunityCost i
Walltimei
• Present Value – f(maxValuei, discount_rate)
– Value after discounting the risk of running a job
– The shorter the job, the lower the risk
• Opportunity Cost
– Degradation in value due to the selection of one among several
alternatives


OpportunityCost i  Walltimei    decay j  decayi 
 j

10
Phase 3: Remapping
• When to remap:
– Uncoordinated schemes
• queue is empty and resource is idle
– Coordinated scheme
• When CPU and GPU queues are imbalanced
• What to remap:
– Which job will have best reward on non-optimal resource?
– Which job will suffer least reward penalty ?
11
Phase 3: Uncoordinated Schemes
1. Last Optimal Reward (LOR)
– Remap job with least reward on optimal resource
– Idea: least reward → least risk in moving
2. First Non-Optimal Reward (FNOR)
– Compute the reward job could produce on non-optimal resource
– Remap job with highest reward on non-optimal resource
– Idea: consider non-optimal penalty
3. Last Non-Optimal Reward Penalty (LNORP)
– Remap job with least reward degradation
RewardDegradationi
= OptimalRewardi - NonOptimalRewardi
12
Phase 3: Coordinated Scheme
Coordinated Least Penalty (CORLP)
• When to remap: imbalance between queues
– Imbalance affected by: decay rates and execution times of
jobs
– Total Queuing-Delay Decay-Rate Product (TQDP)
– Remap if |TQDPCPU – TQDPGPU| > threshold
TQDPi   queueing _ delay j  decay j
j
• What to remap
– Remap job with least penalty degradation
13
Heuristic
for Sharing
Resource Sharing
Heuristic
• Limitation: Two jobs can space-share of CPU/GPU
• Factors affecting sharing
- Slowdown incurred by jobs using half of a resource
+ More resource available for other jobs
• Jobs
– Categorized as low, medium, high scaling (based on models/profiling)
• When to enable sharing
– Large fraction of jobs in pending queues with negative yield
• What jobs share a resource
– Scalability-DecayRate factor
• Jobs grouped based on scalability
• Within each group, jobs are ordered by decay rate (urgency)
– Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay)
14
Overall System Prototype
Master Node
Compute Node
Compute Node
Compute Node
…
15
Overall System Prototype
Submission Queue
Master Node
Pending Queues
CPU
GPU
Cluster-Level Scheduler
Execution Queues
Scheduling
Schemes & Policies
CPU
GPU
TCP
Communicator
Finished Queues
CPU
GPU
Compute Node
Compute Node
Compute Node
…
Multi-core
CPU
GPU
Multi-core
CPU
GPU
Multi-core
CPU
GPU
16
Overall System Prototype
Submission Queue
Master Node
Pending Queues
CPU
GPU
Cluster-Level Scheduler
Execution Queues
Scheduling
Schemes & Policies
CPU
GPU
TCP
Communicator
Finished Queues
CPU
GPU
Compute Node
Compute Node
Node-Level Runtime
Node-Level Runtime … Node-Level Runtime
Multi-core
CPU
Multi-core
CPU
GPU
GPU
Compute Node
Multi-core
CPU
GPU
17
Overall System Prototype
Submission Queue
Master Node
Pending Queues
CPU
GPU
Cluster-Level Scheduler
Execution Queues
Scheduling
Schemes & Policies
CPU
GPU
TCP
Communicator
Finished Queues
CPU
GPU
Compute Node
Compute Node
Node-Level Runtime
Node-Level Runtime … Node-Level Runtime
Multi-core
CPU
GPU
Multi-core
CPU
GPU
Compute Node
Multi-core
CPU
GPU
TCP Communicator
CPU Execution
Processes
OS-based
scheduling & sharing
GPU Execution
Processes
GPU Consolidation
Framework
18
Overall System Prototype
Submission Queue
Master Node
Pending Queues
CPU
GPU
Centralized
decision making
Cluster-Level Scheduler
Execution Queues
Scheduling
Schemes & Policies
CPU
GPU
TCP
Communicator
Execution &
sharing
mechanisms
Finished Queues
CPU
GPU
Compute Node
Compute Node
Node-Level Runtime
Node-Level Runtime … Node-Level Runtime
Multi-core
CPU
GPU
Multi-core
CPU
GPU
Compute Node
Multi-core
CPU
GPU
TCP Communicator
CPU Execution
Processes
OS-based
scheduling & sharing
GPU Execution
Processes
GPU Consolidation
Framework
Assumption: shared file system
19
GPU Sharing Framework
GPU-related Node-Level Runtime
GPU
execution
processes
(Front-End)
CUDA
app1
CUDA
appN
CUDA
Interception
Library
…
CUDA
Interception
Library
Front End – Back End
Communication Channel
Back-End
GPU Consolidation
Framework
CUDA Runtime
CUDA Driver
GPU
20
GPU Sharing Framework
GPU-related Node-Level Runtime
GPU
execution
processes
(Front-End)
CUDA
app1
CUDA
appN
CUDA
Interception
Library
…
CUDA
Interception
Library
CUDA calls arrive
from Frontend
Back-End
Server
Front End – Back End
Communication Channel
Back-End
GPU Consolidation
Framework
Virtual
Context
Workload Consolidator
CUDA Runtime
CUDA Driver
GPU
Manipulates
kernel
configurations to
allow GPU
space sharing
CUDA
stream1
CUDA
stream2
CUDA
streamN
Simplified version of our HPDC’11 runtime
21
Experimental Setup
• 16-node cluster
– CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory
– GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory
• 256-job workload
– 10 benchmark programs
– 3 configurations: small, large, very large datasets
– Various application domains: scientific computations, financial
analysis, data mining, machine learning
• Baselines
– TORQUE (always optimal resource)
– Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99]
22
Comparison
with&Torque-based
Metrics
Throughput
Latency
TORQUE
Normalized over Best Case
1.4
1.2
MCT
LOR
FNOR
10-20% better
LNORP
CORLP
~ 20% better
1.0
0.8
0.6
0.4
0.2
0.0
Comp. Time-UM
Comp. Time-BM
COMPLETION TIME
Ave. Lat-UM
Metrics
Ave. Lat-BM
AVERAGE LATENCY
• Baselines suffer from idle resources
• By privileging shorter jobs, our schemes reduce queuing delays
23
Results
Average
Yield:
Effectwith
of Job
MixYield Metric
Torque
Relative Ave. Yield
10
MCT
FNOR
LNORP
LOR
CORLP
up to 8.8x better
8
6
4
up to 2.3x better
2
0
25C/75G
Skewed-GPU
50C/50G
CPU/GPU Job Mix Ratio
Uniform
75C/25G
Skewed-CPU
• Better on skewed job mixes:
− More idle time in case of baseline schemes
− More room for dynamic mapping
24
Results
Average
Yield Metric
Yield: Effect
ofwith
Value
Function
TORQUE
MCT
LOR
8
FNOR
LNORP
CORLP
up to 6.9x better
Relative Ave. Yield
7
6
5
up to 3.8x better
4
3
2
1
0
Linear Decay
Step Decay
Value Decay Function
• Adaptability of our schemes to different value functions
25
with
Average
Yield Metric
Yield: Results
Effect of
System
Load
TORQUE
MCT
LOR
FNOR
LNORP
Relative Ave. Yield
10
CORLP
up to 8.2x better
8
6
4
2
0
128
256
384
512
Total No. of Jobs
• As load increases, yield from baselines decreases linearly
• Proposed schemes achieve initially increased yield and
then sustained yield
26
Yield
Improvements
Yield:
Effect
of Sharingfrom Sharing
CPU only Sharing Benefit
GPU only Sharing Benefit
CPU & GPU Sharing Benefit
Yield Improvement (%)
25
up to 23x improvement
20
15
10
5
0
0.1
0.2
0.3
0.4
Sharing K Factor
0.5
0.6
Fraction of jobs to share
• Careful space sharing can help performance by freeing
resources
• Excessive sharing can be detrimental to performance
27
Conclusion
Summary
• Value-based Scheduling on CPU-GPU clusters
- Goal: improve aggregate yield
• Coordinated and uncoordinated scheduling schemes
for dynamic mapping
• Automatic space sharing of resources based on
heuristics
• Prototypical framework for evaluating the proposed
schemes
• Improvement over state-of-the-art
- Based on completion time & latency
- Based on average yield
28