ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar Context • GPUs are used in supercomputers – Some of thefor top500 supercomputers use and GPUs Need resource managers • Tianhe-1A • scheduling schemes for – 14,336 Xeon X5670 processors heterogeneous clusters including – 7,168 Nvidia Tesla M2050 GPUs many-core GPUs Stampede – about 6,000 nodes: » Xeon E5-2680 8C, Intel Xeon Phi • GPUs are used in cloud computing 2 Categories of Scheduling Objectives • Traditional schedulers for supercomputers aim to improve system-wide metrics: throughput & latency • A market-based service world is emerging: focus on provider’s profit and user’s satisfaction – Cloud: pay-as-you-go model • Amazon: different users (On-Demand, Free, Spot, …) – Recent resource managers for supercomputers (e.g. MOAB) have the notion of service-level agreement (SLA) 3 State of the Art Motivation • Open-source batch schedulers start to support GPUs – TORQUE, SLURM – Users’ guide mapping of jobs to heterogeneous nodes Our Goal: – Simple scheduling schemes (goals: throughput & latency) • Reconsider market-based scheduling Recent proposals describe runtimeclusters systems & virtualization for heterogeneous including frameworks for clusters with GPUs GPUs – [gViM HPCVirt '09][vCUDA IPDPS '09][ rCUDA HPCS’10 ] [gVirtuS Euro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12] – Simple scheduling schemes (goals: throughput & latency) • Proposals on market-based scheduling policies focus on homogeneous CPU clusters – [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04] 4 Considerations • Community looking into code portability between CPU and GPU – OpenCL – PGI CUDA-x86 – MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC → Opportunity to flexibly schedule a job on CPU/GPU • In cloud environments oversubscription commonly used to reduce infrastructural costs → Use of resource sharing to improve performance by maximizing hardware utilization 5 Problem Formulation • Given a CPU-GPU cluster • Schedule a set of jobs on the cluster – To maximize the provider’s profit / aggregate user satisfaction • Exploit the portability offered by OpenCL – Flexibly map the job on to either CPU or GPU • Maximize resource utilization – Allow sharing of multi-core CPU or GPU Assumptions/Limitations • 1 multi-core CPU and 1 GPU per node • Single-node, single GPU jobs • Only space-sharing, limited to two jobs per resource 6 Value Function Market-based Scheduling Formulation Yield/Value • For each job, Linear-Decay Value Function [Irwin HPDC’04] Yield = maxValue – decay * delay Max Value T Execution time • Max Value → Importance/Priority of job Decay → Urgency of job • Delay due to: ₋ queuing, execution on non-optimal resource, resource sharing 7 Overall Scheduling Approach Scheduling Flow Jobs arrive in batches Phase 1: Mapping Phase 2: Sorting Enqueue into CPU Queue Sort jobs to Improve Yield Enqueue into GPU Queue Sort jobs to Improve Yield Phase 3: Re-mapping Jobs are enqueued on their optimal resource. Phase 1 is oblivious of other jobs (based on optimal walltime) Inter-jobs scheduling considerations Different schemes: - When to remap? - What to remap? Execute on CPU Execute on GPU 8 Phase 1: Mapping • Users provide walltime on GPU and GPU – walltime used as indicator of optimal/non optimal resource – Each job is mapped onto its optimal resource NOTE: in our experiments we assumed maxValue = optimal walltime 9 Phase 2: Sorting • Sort jobs based on Reward [Irwin HPDC’04] Reward i PresentValue i OpportunityCost i Walltimei • Present Value – f(maxValuei, discount_rate) – Value after discounting the risk of running a job – The shorter the job, the lower the risk • Opportunity Cost – Degradation in value due to the selection of one among several alternatives OpportunityCost i Walltimei decay j decayi j 10 Phase 3: Remapping • When to remap: – Uncoordinated schemes • queue is empty and resource is idle – Coordinated scheme • When CPU and GPU queues are imbalanced • What to remap: – Which job will have best reward on non-optimal resource? – Which job will suffer least reward penalty ? 11 Phase 3: Uncoordinated Schemes 1. Last Optimal Reward (LOR) – Remap job with least reward on optimal resource – Idea: least reward → least risk in moving 2. First Non-Optimal Reward (FNOR) – Compute the reward job could produce on non-optimal resource – Remap job with highest reward on non-optimal resource – Idea: consider non-optimal penalty 3. Last Non-Optimal Reward Penalty (LNORP) – Remap job with least reward degradation RewardDegradationi = OptimalRewardi - NonOptimalRewardi 12 Phase 3: Coordinated Scheme Coordinated Least Penalty (CORLP) • When to remap: imbalance between queues – Imbalance affected by: decay rates and execution times of jobs – Total Queuing-Delay Decay-Rate Product (TQDP) – Remap if |TQDPCPU – TQDPGPU| > threshold TQDPi queueing _ delay j decay j j • What to remap – Remap job with least penalty degradation 13 Heuristic for Sharing Resource Sharing Heuristic • Limitation: Two jobs can space-share of CPU/GPU • Factors affecting sharing - Slowdown incurred by jobs using half of a resource + More resource available for other jobs • Jobs – Categorized as low, medium, high scaling (based on models/profiling) • When to enable sharing – Large fraction of jobs in pending queues with negative yield • What jobs share a resource – Scalability-DecayRate factor • Jobs grouped based on scalability • Within each group, jobs are ordered by decay rate (urgency) – Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay) 14 Overall System Prototype Master Node Compute Node Compute Node Compute Node … 15 Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues Scheduling Schemes & Policies CPU GPU TCP Communicator Finished Queues CPU GPU Compute Node Compute Node Compute Node … Multi-core CPU GPU Multi-core CPU GPU Multi-core CPU GPU 16 Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues Scheduling Schemes & Policies CPU GPU TCP Communicator Finished Queues CPU GPU Compute Node Compute Node Node-Level Runtime Node-Level Runtime … Node-Level Runtime Multi-core CPU Multi-core CPU GPU GPU Compute Node Multi-core CPU GPU 17 Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Cluster-Level Scheduler Execution Queues Scheduling Schemes & Policies CPU GPU TCP Communicator Finished Queues CPU GPU Compute Node Compute Node Node-Level Runtime Node-Level Runtime … Node-Level Runtime Multi-core CPU GPU Multi-core CPU GPU Compute Node Multi-core CPU GPU TCP Communicator CPU Execution Processes OS-based scheduling & sharing GPU Execution Processes GPU Consolidation Framework 18 Overall System Prototype Submission Queue Master Node Pending Queues CPU GPU Centralized decision making Cluster-Level Scheduler Execution Queues Scheduling Schemes & Policies CPU GPU TCP Communicator Execution & sharing mechanisms Finished Queues CPU GPU Compute Node Compute Node Node-Level Runtime Node-Level Runtime … Node-Level Runtime Multi-core CPU GPU Multi-core CPU GPU Compute Node Multi-core CPU GPU TCP Communicator CPU Execution Processes OS-based scheduling & sharing GPU Execution Processes GPU Consolidation Framework Assumption: shared file system 19 GPU Sharing Framework GPU-related Node-Level Runtime GPU execution processes (Front-End) CUDA app1 CUDA appN CUDA Interception Library … CUDA Interception Library Front End – Back End Communication Channel Back-End GPU Consolidation Framework CUDA Runtime CUDA Driver GPU 20 GPU Sharing Framework GPU-related Node-Level Runtime GPU execution processes (Front-End) CUDA app1 CUDA appN CUDA Interception Library … CUDA Interception Library CUDA calls arrive from Frontend Back-End Server Front End – Back End Communication Channel Back-End GPU Consolidation Framework Virtual Context Workload Consolidator CUDA Runtime CUDA Driver GPU Manipulates kernel configurations to allow GPU space sharing CUDA stream1 CUDA stream2 CUDA streamN Simplified version of our HPDC’11 runtime 21 Experimental Setup • 16-node cluster – CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory – GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory • 256-job workload – 10 benchmark programs – 3 configurations: small, large, very large datasets – Various application domains: scientific computations, financial analysis, data mining, machine learning • Baselines – TORQUE (always optimal resource) – Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99] 22 Comparison with&Torque-based Metrics Throughput Latency TORQUE Normalized over Best Case 1.4 1.2 MCT LOR FNOR 10-20% better LNORP CORLP ~ 20% better 1.0 0.8 0.6 0.4 0.2 0.0 Comp. Time-UM Comp. Time-BM COMPLETION TIME Ave. Lat-UM Metrics Ave. Lat-BM AVERAGE LATENCY • Baselines suffer from idle resources • By privileging shorter jobs, our schemes reduce queuing delays 23 Results Average Yield: Effectwith of Job MixYield Metric Torque Relative Ave. Yield 10 MCT FNOR LNORP LOR CORLP up to 8.8x better 8 6 4 up to 2.3x better 2 0 25C/75G Skewed-GPU 50C/50G CPU/GPU Job Mix Ratio Uniform 75C/25G Skewed-CPU • Better on skewed job mixes: − More idle time in case of baseline schemes − More room for dynamic mapping 24 Results Average Yield Metric Yield: Effect ofwith Value Function TORQUE MCT LOR 8 FNOR LNORP CORLP up to 6.9x better Relative Ave. Yield 7 6 5 up to 3.8x better 4 3 2 1 0 Linear Decay Step Decay Value Decay Function • Adaptability of our schemes to different value functions 25 with Average Yield Metric Yield: Results Effect of System Load TORQUE MCT LOR FNOR LNORP Relative Ave. Yield 10 CORLP up to 8.2x better 8 6 4 2 0 128 256 384 512 Total No. of Jobs • As load increases, yield from baselines decreases linearly • Proposed schemes achieve initially increased yield and then sustained yield 26 Yield Improvements Yield: Effect of Sharingfrom Sharing CPU only Sharing Benefit GPU only Sharing Benefit CPU & GPU Sharing Benefit Yield Improvement (%) 25 up to 23x improvement 20 15 10 5 0 0.1 0.2 0.3 0.4 Sharing K Factor 0.5 0.6 Fraction of jobs to share • Careful space sharing can help performance by freeing resources • Excessive sharing can be detrimental to performance 27 Conclusion Summary • Value-based Scheduling on CPU-GPU clusters - Goal: improve aggregate yield • Coordinated and uncoordinated scheduling schemes for dynamic mapping • Automatic space sharing of resources based on heuristics • Prototypical framework for evaluating the proposed schemes • Improvement over state-of-the-art - Based on completion time & latency - Based on average yield 28
© Copyright 2026 Paperzz