GPU usage pattern

Optimizing Resource Provisioning
by using GPU Usage Pattern Extraction
in GPU-based Cloud Environment
Outline
• Introduction
• Background and Motivation
• System Overview
• Scheduling Polices
• Experimental Evaluation
• Related Work
• Conclusion and Future Work
2
Introduction
• Many of the large-scale cloud providers such as Amazon EC2, Nimbix,
Peer1 Hosting and Penguin Computing supply GPU services.
• The usage effectiveness of GPUs in such cloud environment presents to
be low resource utilization, long turnaround time and low system
throughput.
• It is limited by static provisioning of GPU resources (the dedicated access).
• In order to optimize resource provisioning, one approach is to schedule
multiple applications on multiple GPUs.
• Nevertheless, that multiple applications run on the same multi-tasked
GPU device may lead to performance degradation for one or more
applications.
3
Introduction (cont.)
• Therefore, for optimizing resource provisioning, it is crucial that:
• Obtain the characteristics/behaviors of applications before the actual execution.
• Explore some scheduling algorithms.
• The support for acquiring the characteristics and behaviors of
applications has advanced over years.
• CUPTI
• PAPI, Tau, Vampir
• Mystic
• The disadvantages of the
existing approaches.
Related Work
Obtain applications
behavior before execution
Modify the
source code
The extra
overhead
The accuracy of
the result
CUPTI [11]
No
No
Little
high
PAPI [12], Tau[13],
Vampir [14]
No
Yes
Little
high
Mystic [10]
Yes
No
Large
XXX
Yes
No
Little
low
4
high
Introduction (cont.)
• In this paper, we define GPU usage pattern as an application’s access
pattern to a GPU device during the execution of the application,
represented by a directed graph in which each vertex indicates a pivotal
CUDA activity such as GPU kernel function execution, GPU memory
allocation, Host-to-Device memory copy and Device-to-Host memory
copy, etc.
• Extract GPU usage pattern.
• Propose 2 scheduling algorithm.
• Our contributions:
• Propose a method of extracting GPU usage pattern by using intermediate code.
• Two scheduling methods are proposed for different application scenarios.
• system XXX is implemented, which is readily deployable within current data
centers without hardware modification.
5
Background and Motivation
• The rationality of using intermediate code
• Source code vs. intermediate code (Safety, friendly)
• CUDA also supports Java, Python and other
interpreted languages . After compiling, the byte code
is generated, that is what we call the intermediate
code. Similarly, if given the intermediate code which
comes from C/C++ program, it can also be analyzed
too.
• In the analysis of these intermediate codes, the
analysis of the C/C++ intermediate code is the most
difficult. This paper takes the analysis of the C/C++
intermediate code as an example.
• The feasibility of using static analysis
• GPU based applications’ feature 1: total time of using
CPU << total time of using GPU
• GPU based applications’ feature 2: CPU code control
process, prepare data while GPU code is responsible
for computing
6
Background and Motivation (cont.)
• The reason of extracting program features to optimize the scheduling
• Applications have different GPU usage pattern. That multiple applications run on the same multitasked GPU device may lead to performance degradation. Resource contention is the fundamental
factor that causes the decline of program performance. The performance degradation means the
increase of program turnaround time and the decrease of system throughput.
• Lexical analyzer
• Lexical analyzer used to parse the meaning of each string in the code. In this paper, we will use it to
identify the key function, syntax module and so on.
7
System Overview
• Architecture
• Key algorithms
• GPU usage pattern extraction (3 stage)
• Compute GPU key CUDA call time
8
Scheduling Polices
• Interference aware scheduler by using GPU resource demand
• If there are other idle GPUs, the application is assigned to the GPU.
• Otherwise, get the GPU resource demand vector 𝑣1 of the application, compute
the similarity between 𝑣1 and every vector in GPU. The higher the similarity, the
higher the interference score.
• Interference aware scheduler by using GPU key call graph
• If there are other idle GPUs, the application is assigned to the GPU.
• Otherwise, get the GPU key call graph of the application, compute the similarity
between it and every graph in GPU. The higher the similarity, the higher the
interference score
9
Experimental Evaluation
• The accuracy of GPU usage pattern extraction
• Scheduling performance ( Our scheduling vs. LL, RR)
• System performance (ANTT, STP)
• GPU utilization (Our scheduling vs. LL)
• Quality of launch sequence selection (COV)
• Scheduling decision quality (scheduling fairness)
• Scheduling overhead using GPU usage pattern extraction
10
Related Work
• Obtain the characteristics/behavirs of application
• S. Browne, J. Dongarra, et al. "A Portable Programming Interface for Performance
Evaluation on Modern Processors." International Journal of High Performance Computing
Applications 14.3(2000):189-204.
• S. S. Shende, A. D. Malony. "The Tau Parallel Performance System." International Journal
of High Performance Computing Applications 20.2(2006):287-311.
• Knüpfer, Andreas, et al. "The Vampir Performance Analysis Tool-Set." TOOLS for High
PERFORMANCE Computing - Proceedings of the, International Workshop on Parallel
TOOLS for High PERFORMANCE Computing, July 2008, Hlrs, Stuttgart DBLP, 2008:139-155.
• Scheduling Polices
• Phull, Rajat, et al. "Interference-driven resource management for GPU-based
heterogeneous clusters." International Symposium on High-Performance Parallel and
Distributed Computing ACM, 2012:109-120.
• Sengupta, Dipanjan, et al. "Scheduling multi-tenant cloud workloads on accelerator-based
systems." High Performance Computing, Networking, Storage and Analysis, SC14:
International Conference for. IEEE, 2014:513-524.
• Ukidave, Yash, X. Li, and D. Kaeli. "Mystic: Predictive Scheduling for GPU Based Cloud
Servers Using Machine Learning." IEEE International Parallel and Distributed Processing
Symposium IEEE, 2016:353-362.
11
Thanks!
12