Dynamic Workload Division in GPU

Dynamic Workload Division in GPU-CPU Heterogeneous Systems
THESIS
Presented in Partial Fulfillment of the Requirements for the Degree Master of Science
in the Graduate School of The Ohio State University
By
Wei Chen
Graduate Program in Electrical and Computer Engineering
The Ohio State University
2013
Master's Examination Committee:
Dr. Xiaorui Wang, Advisor
Dr. Füsun Özgüner
Copyright by
Wei Chen
2013
Abstract
GPU provides powerful computational capabilities and huge potential optimization
possibility of efficient. As a result, the CPU-GPU heterogeneous architecture is still the
hot zone of the high performance computation. However, the energy consuming is still
the bottle neck of the entire the system, when the system and its corresponding
framework need massive scale calculation. Most of the existing study is focus on how to
lower the GPU power requirement. However, they did not considered CPU and GPU as
an entire architecture.
This thesis is based on the GreenGPU heterogeneous architecture. Because of the new
generation of platform, one of the assumptions is that the operation system and its
correspondence driver will adjust the DVFS to its best optimizing point. I implement a
workload division algorithm using Tesla CUDA GPUs and AMD CPUs to balance the
time difference caused by the workload. The real physical testbed results show the new
workload division algorithm can provide at least 5X accuracy than previous algorithm
without extra energy cost in the workload division procedure. This more accurate
workload division algorithm could benefits the overall system energy consumption
especially when the workload is huge.
ii
Dedication
This document is dedicated to my family.
iii
Acknowledgments
First and foremost, I would like to express my sincere gratitude to my advisor, Dr.
Xiaorui Wang for his insightful inspiration and dedicated teaching. His invaluable advice
and constructible guidance motivate me to explore the intricacy territories of energy
efficiency of high performance computing.
My gratitude also goes to my fellow student Kai Ma for his discussion and comments
on my research topics.
Additionally, I would like to thank Dr. Füsun Özgüner for being a member of my
thesis committee.
iv
Vita
2011 ........................................................... B.S. Electrical Information Engineering,
Sichuan University.
2012 to present ........................................... M.S. Department of Electrical and
Computer Engineering, The Ohio State
University
Publications
Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. “GreenGPU: A Holistic
Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures”. Accepted by
ICPP 2012
Zitao Liu, Wenchao Yu, Wei Chen, Shuran Wang, and Fengyi Wu. “Short Text
Feature Selection for Micro-Blog Mining”. Accepted by CiSE2010
Fields of Study
Major Field: Electrical and Computer Engineering
v
Table of Contents
Abstract ..................................................................................................................... ii
Dedication ................................................................................................................ iii
Acknowledgments .................................................................................................... iv
Vita .............................................................................................................................v
Table of Contents ...................................................................................................... vi
List of Tables .......................................................................................................... viii
List of Figures ........................................................................................................... ix
Chapter 1: Introduction ..............................................................................................1
Chapter 2: GPU-CPU Heterogeneous Architecture......................................................4
2.1 Related Work.....................................................................................................4
2.2 Motivation .........................................................................................................5
2.3 Contribution ......................................................................................................6
Chapter 3: System Design and Implementation ...........................................................7
3.1 System Design ...................................................................................................7
3.2 Algorithm Implementation .................................................................................9
3.3 Physical Test-bed and implementation ............................................................. 11
vi
Chapter 4: Result analysis and Conclusion ................................................................ 14
4.1 Implementation on Tesla GPU ......................................................................... 14
4.2 Division Precision improvement ...................................................................... 16
4.3 Binary Search Implementation ......................................................................... 18
4.4 Conclusion and Future study ............................................................................ 23
References ................................................................................................................ 25
vii
List of Tables
Table 1 Relationship of Division Ratio and Power Reduction......................................9
Table 2 Software Deployment on Test-bed................................................................ 12
viii
List of Figures
Figure 1 Single Processor Performance reaches a bottleneck .......................................2
Figure 2 CUDA Execution Path ..................................................................................8
Figure 3 Algorithm Running Details explains boundary moving ............................... 10
Figure 4 Kmeans result indicate the balanced ratio point ........................................... 14
Figure 5 Hotspot results prove the successful plantation to new platform .................. 15
Figure 6 NN results presents a special condition........................................................ 16
Figure 7 Updated Hotspot results (b) ......................................................................... 17
Figure 8 Updated Hotspot results (b) ......................................................................... 17
Figure 9 Updated Hotspot results (c) ......................................................................... 18
Figure 10 Recursion time of Binary Search Implement ............................................. 19
Figure 11 BS %1 Hotspot result (a) ........................................................................... 20
Figure 12 BS %1 Hotspot result (b) ........................................................................... 20
Figure 13 BS %1 Hotspot result (c) ........................................................................... 21
Figure 14 Other test bench results from different initial GPU/CPU workload ............ 22
Figure 15 BS and Green GPU Energy Consumption Comparison ............................. 23
ix
Chapter 1: Introduction
The development of semiconductor technology increases the number of transistors in a
single die. Thus the computation capability of processors is getting stronger and stronger.
The graphics processing unit (GPU) was originally designed to accelerate the
construction process of images with relative simple architecture, however, the chip
resource is considered to have great potential computation capability. With the
architecture advance, especially number of cores, the GPU should have much better
performance than general CPU. Additionally, the GPU programing environment is
improving, e.g., NVIDIA provide CUDA SDK, then Nsight Integrated Compilation
Environed in Visual Studio, then Nsight LINUX/UNIX eclipse version, etc. Those
motivate the GPU that it becomes a conceptual heterogeneous CPU accelerator in parallel
computation, which is the hot zone in High Performance Computation territories.
The single processor performance improvement stops 2003, almost 10 years ago. As it
shows red line in the Figure 1, instead, the multiprocessor with parallelism takes the
trend. The parallelism provide new models such as Data/Thread/Request level
parallelism, however, these new models require explicit restructuring of the application.
A great inspiration and discussion should start from Flynn’s Taxonomy. This computer
architecture classification indicates that Single Instruction Multiple Data (SIMD)
structure which means that given workload, using single instruction to process multiple
data segment can provide better quality in performance and energy efficient [1]. NVIDIA
1
CUDA-enabled GPUs offers optimization in SIMD and that create Tianhe-1A, one of
Top 10 supercomputer until June 2012, with Xeon X5670 and NVIDIA Tesla 2050 GPUs
heterogeneous architecture [2].
Figure 1 Single Processor Performance reaches a bottleneck
This GPU-CPU heterogeneous architecture provide huge performance gain, in the
meanwhile, it also offers potential energy saving opportunities. The current generation of
GPU has hundreds of cores. Although the new generation of CPU will have multiple-core
architecture, from the aspect of power consumption per core, GPU has much more
advantage. So evolving GPU in the computation is good choice. However, this condition
does not mean that GPU should do all the jobs. When CPU is in the idling status, that part
is definitely a waste of power. We know trying to move part of the workload to the
idling CPU and let it working. The intuitive idea is that the evolving of CPU will reduce
2
the total running test of a given workload, which will cost a reduction in the total energy
consumption.
The rest of the thesis is organized as follows. Chapter 2 present the theoretical
background of GPU-CPU heterogeneous study the motivations. In Chapter 3, the system
design will be reiterated. The algorithm details will be presented with its implementation.
Then we will discuss the real physical test-bed where we could get a reliable result.
Finally the Chapter 4 will present the result and its analysis. This also pointed out the
direction for future research work.
3
Chapter 2: GPU-CPU Heterogeneous Architecture
In this Chapter 2, we will first discuss about the related territory and achievement that
has been down and then we will generally point out the motivation of this study. The
contribution of this thesis will be listed at the end of this Chapter.
2.1 Related Work
Workload division arouses attentions of researcher in recent years. This thesis is
based on the GreenGPU prototype. Ma et al. [1] propose GreenGPU architecture that
implements a two tier solution that dynamically splits and distributes workload to GPU
and CPU in the first tier and performs a DVFS control in both CPU and GPU in the
second tier. Luk et al. [4] propose Qilin framework. This work dynamically schedules the
different processing elements with different workload to reduce the total execution time.
However the energy consumption was not discussed in the previous work.
Initial researches on GPU power provide multiple improvements in the energy and
power efficiency. Takizawa et al. [5] proposed a stream programming framework that can
dynamically selects the best available processor for running a specified workload task on
CPU GPU hybrid system. Rofouei et al. [6] list the experimental result of comparing the
cost/performance the GPU operation and CPU-only system. Those studies show the
potential of GPU-CPU heterogeneous architecture in optimizing.
4
2.2 Motivation
Power and Energy issue is the critical in the construction territory of HPC. No matter
how huge a HPC system is, there is a Thermal Design Power used as a target for power
supply and cooling system. New architecture and framework are implemented in those
huge systems or even in smaller servers and workstations so that clock rate of processors
can be adjust dynamically to reduce the total power consumption. We made a comparison
of dynamic energy and dynamic power.
The dynamic energy equals
𝐷𝐸 =
1
× 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑖𝑣𝑒 𝐿𝑜𝑎𝑑 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒 2
2
The dynamic power equals
𝐷𝑃 =
1
× 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑖𝑣𝑒 𝐿𝑜𝑎𝑑 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒 2 × ∆𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
2
This small difference shows the frequency adjustment does nothing to the energy
consumption; however it can affect the power of the system.
However the hardware development provides the newest driver and correspondent
operation system support to perform a capsuled DVFS control. Recent research shows
due to the limited memory location and its synchronizing scheme, mass parallel program
using GPU may result in a less-accurate result.
One of the spotlights of GreenGPU is Kai et al. implement workload division in a real
physical test-bed, AMD Phenom II CPUs and NVIDIA GeForce 8800 GTX GPU. To
duplicate and improve their experiments, I reselect the hardware platform as NVIDIA
Tesla 2070. However, due to the reliability of HPC requirement, NVIDIA improve their
5
power management system. This updated system perform a better management scheme in
DVFS and they prohibit the manually configuration of DVFS, which could be done in the
previous GeForce platform. In a similar way, the CPU power configuration has updated
to another schedule scheme. So I could make an assumption that the existing drivers will
find the best DVFS level, which makes one tier of GreenGPU unnecessary. However, the
CPU-GPU workload division precision, 5%, in the GreenGPU architecture is not
matching the requirement of real world workload. e.g., many biometric Quality
Threshold Clustering (QTC) tasks are with huge data input and should running for days,
even 0.1% of workload division might cause different in execution time and total energy
cost.[3]
2.3 Contribution
Based on the previous work, I simplify the previous GreenGPU to a one tier solution.
In particular, the contributions of this thesis are:
1) Implement the modified GreenGPU architecture in a current CPU GPU physical
test-bed
2) Improve the workload division precision by introducing Binary Search
3) Improve the workload division algorithm running time.
4) Deployed edge value control to prevent potential deadlocks.
6
Chapter 3: System Design and Implementation
In this Chapter 3, we will generally discuss the design the workload division system
and the methodology of the algorithm. We also provide details about the implementation
of the real physical test-bed.
3.1 System Design
In GPU-CPU heterogeneous parallel system, GPU and CPU usually connected by
external bus, and they have their individual external off-chip DRAM storage. So they can
communicate with DMA. Generally, CPU and GPU work cooperatively in a MasterSlave mode. CPU is the Master end, which perform the data organize and schedule
execution of GPU. Figure 2 [7] shows a classic CUDA program pseudo-code. The serial
code which is considered as blocks of program segments that runs on the CPU (host), we
could call it CPU task. Meanwhile the GPU operation including data transfer between the
CPU and GPU storage (Communication task) and its running kernel was called GPU
task. Because the CPU and GPU execution process is relatively independent, and their
data transfer can be performed by an individual DMA module, thus the CPU task and
GPU task and the communication task can run simultaneously if there is no data
dependency. Additionally, CUDA provide asynchronies interfaces and functions to
ensure the program can run properly with different level of synchronization.
7
CPU
GPU
BLOCK1
Mem Copy to Device
BLOCK1
Incoke Kerneal
Data flow
BLOCK1
Serial Code(host)
Parallel Kernel(device)
KernelA<<<nBlk,nTid>>>(args);
Serial Code(host)
Parallel Kernel(device)
KernelB<<<nBlk,nTid>>>(args);
Control flow
Mem Copy to Host
BLOCK1
Figure 2 CUDA Execution Path
Based on the previous description, CPU and GPU may wasting energy because CPU
or GPU might have already stop working and change itself into an idle mode at some the
synchronize points, due to the different computational capability and workload. From the
aspect of energy consumption, intuitively, this workload arrangement is not the best
optimized. One possible solution is to run the “quicker” tasks in a low frequency and low
voltage, then it can reach the synchronize point with the “slower” tasks. It could save
energy. However another solution is transferring some of the workload to the free
processor and keeps it busy.
The workload division unit exists in the serial CPU code and it could dynamically
adjust the workloads between CPU and GPU based on their execution time iteration by
iteration to eliminate the idle time. Another basic assumption is that the operations inside
each iteration are similar. Thus after a few iterations, the next iteration division ratio
8
could be predicted in a reasonable range. Although the input data condition varies even
for a single huge task, however, the neighbor of fetched data and incoming data are
similar in many ways. Thus the workload division program select a reasonable ratio for
the current program segment, it still can be considered as useable for a short period of
time based on the stability of the input data. Fortunately, before we run that complex
tasks, people always have a general review of the distribution of the input data. So a
periodically invoking workload division unit could perform a reliable and efficient
energy efficiency optimizing adjustment.
3.2 Algorithm Implementation
The input data could be divided by lines. There is a barrier in the input stream that
decides which part goes into GPU and CPU. Accordingly, Table 1 shows the position of
barrier, which means the workload ratio r = workload on CPU / total workload, could
affect the running time and result in extra energy consumption.
Table 1 Relationship of Division Ratio and Power Reduction
Value of r
r > r_mean[i]
r < r_mean[i]
r = r_mean[i]
Running Time
t_CPU > t_GPU
t_CPU > t_GPU
t_CPU = t_GPU
Potential Energy Reduction
P_CPU*(t_CPU-t_GPU)
P_GPU*(t_CPU-t_GPU)
0
At the beginning, the workload division unit will be assigned an expected guessing
number. It could be various from program to program. The default number could be
50%. The ration could be an integer or float number at the beginning.
9
The basic idea of the workload division is using the binary search to find the
optimized point as soon as possible. The testing code with a sample input data will be
run several times. Each iteration will push either upper bound or lower bound of the
division range closer to the expected optimized point.
Upper
Bound
Lower
Bound
Upper
Bound
Figure 3 Algorithm Running Details explains boundary moving
In details, for example, as is shown in Figure 3, the orange part is the r running on the
CPU and the green part is the ratio running on the GPU. We use running time of each
version of code as a measurement. If the r is bigger than it should be, which means the
CPU portion execute that part of workload for too long and it should be considered as
over-burdened. Thus this barrier position will be considered as the upper bound of this
workload division. In a similar way, if the GPU takes longer time to finish their part, the
barrier position should be considered as the lower bound. This vibration will eventually
10
locate one reasonable proportion of workload division with expected precision. Because
this algorithm is based on binary search, the precision will be up to more than 1% if the
iteration time as much as 7.Because,
100% ÷ 27 < 1%
After our algorithm reach the 1% algorithm, we could still let it go deeper to find more
precisely division ratio if it is required.
One more feature about this algorithm is called the edge value control which is used to
prevent deadlock. Before more details, we should mention the termination condition of
algorithm is set by a pre-defined TIMELIMIT. If the difference between GPU and CPU
running time is less than this TIMELIMIT, we could consider the division has been
successfully deployed. However, the limitation of precision provide a potential situation
that the GPU or CPU should do all the work. For example, a basic MatrixMul, if CPU do
1% of the work will lower total system running efficiency and the workload division
precision is 1%. We should let GPU do all the work. We add edge control to this situation
to terminate the algorithm. Another situation is that potential calculation error may result
the Lower Bound and Upper Bound miss the best balance point, in that case we should
expand the Lower Bound and Upper Bound to make sure the balance point is between
them.
3.3 Physical Test-bed and implementation
In this section, we will introduce the hardware test-bed. The physical test-bad in our
experiments include a workstation, AMD Opteron CPU [8], and NVIDIA Tesla 2070
11
GPU Card[9]. An external power meter is connected between workstation and 110V AC
outlet. We could read the power through USB from power meter to workstation. This
heuristic test-bed cannot perform a global optimization due to the time record function
precision and the lack of insight investigation of GPU-CPU DVFS control, overclock,
low power state for DRAM, disk I/O state and even turning off cores. However, this
study could be easily integrated with other optimization system, e.g., [10 ] to achieve a
better energy saving gain.
In the implementation part, our testbench was deployed on a well configure system,
the detail information are listed as Table 2:
Table 2 Software Deployment on Test-bed
Software
WorkStation Operation System
HPC OS
NVIDIA driver
CUDA version
GCC/G++ version
Rodinia version
Version
Ubuntu 12.04 Desktop
Red Hat Enterprise LS release 6.3
301.32
4.2.18
4.6.x
2.0
Rodinia is the testbench suite we used in our study. This testbench suite provides
multiple testbenchs with GPU only, CPU only and GPU-CPU hybrid programs with
formatted input data. However, the input data of the program we need is too small that
multiple iterations could be done within second. This would not enough to trigger the
power meter change and remain in a reliable readable value. I manually increase the size
of the input data by simply copying the input file contents multiple times.
12
One of the defect of this implementation is the power measurement should be more
accurate. One power meter implementation can only reflect the total system power and
GPU working system power. What’s more, the GPU graphic card could only provide
computation service other than X-server, which will also increase the reliability of the
result.
13
Chapter 4: Result analysis and Conclusion
In this Chapter, we will generally discuss the testbench result. I will start from the
implementation in the New GPU platform and then I will present an algorithm
comparison to the old generation GreenGPU.
4.1 Implementation on Tesla GPU
We use one of our testcase fom Rodinia as example, Kmeans shown as Figure 4, a
data clustering algorithm. An interesting issue is that those graphic algorithms are not
usually have perfect GPU performance gain. Because their loops are always not match
the tail recursion requirement and they have tons of data dependency during each
iteration.
Figure 4 Kmeans result indicate the balanced ratio point
14
In this particularly test result, we initialed assign the GPU 95% of the total workload.
After several iteration, the CPU execution time and its corresponding portion of workload
is increasing. The end point shows that the best optimized point is: GPU is only taking
care of about 28% of the total workload.
The other testbench HOTSPOT provides a more general case.
Figure 5 Hotspot results prove the successful plantation to new platform
This Figure 5 shows that we initial the 20% workload to CPU at the beginning. Due to
the difference in the GPU and CPU execution time, the workload on GPU is increasing
until their execution time will be almost the same. It also reveals this version of hotspot
code is not a good parallel version, because GPU has 40X cores than CPU but they will
finish the same workload at the same time. But it also reveals the previous algorithm has
been implemented on the new platform due to the changes in the workload.
15
Additionally, GreenGPU architecture also point out a situation for fine-parallelism
program, which means no matter how many workload was assigned to CPU initially, it
will eventually transfer all workload to GPU. My implementation could also duplicate
that situation as shown in Figure 6.
Figure 6 NN results presents a special condition
NN results provide both workload division and its corresponding energy consumption.
After just a few iterations, all workload are assigned to GPU side and CPU energy
consumption is reducing significantly with the CPU workload division reduction.
Division refers to the ratio of CPU Energy Consuming / (CPU+GPU Energy Consuming)
4.2 Division Precision improvement
Previous experiments implement the workload division algorithm in the new platform,
however the each iteration of the workload division will only move 5%. In HPC
16
territory, 5% of the total workload would also be a huge task. What’s more, due to the
previous algorithm, the other issues is the system might not get an final division solution
because it might vibrate between two percentage, e.g., the best optimal division point is
52%, the algorithm might bounce back and forth between 50% and 55%.
A heuristic solution is to reduce the precision step length, e.g., from 5% to 3%. In fact,
this solution provides a little bit more accuracy to our workload division algorithm.
Figure 7 Updated Hotspot results (b)
Figure 8 Updated Hotspot results (b)
17
Figure 9 Updated Hotspot results (c)
Figure 7-9 shows the result of reducing step length to 3%. Although the final result
show the best workload division point is about 51%, if the initialed workload is far away
from that value, it takes too much iteration to find that point. For Figure 7 and 9, this
algorithm almost runs 20 times to get a result.
Our experiment could also choose 1% instead of 3%, however, we can expect that will
take dozens of iteration to calculate a point location. In that case, just running the
program without division might be faster.
4.3 Binary Search Implementation
As we have discussed in Chapter 3.2, the Binary Search method was deploy into the
algorithm. With the same data input, it significantly reduces the number of iteration time
and also reaches the 0.1% precision.
18
Figure 10 Recursion time of Binary Search Implement
In Figure 10, the G marks refer to the result of using GreenGPU and B marks refer to
our BS algorithm. GreenGPU iteration times vary depending on the initial GPU workload
ratio. And the BS algorithm more concerns about the precision. BS also can reach more
precision within the same iteration as GreenGPU shows.
In details, within 10 iterations, the new workload division algorithm can find the best
division point location and confirm that is the location by a checking scheme. This
algorithm also can find a more accurate division percentage, if we adjust the precision
parameter. That will not highly increase cost of the algorithm complex. The first two
column shows the set-step length algorithm in the precision of 3% and 5%. Our BS
algorithm start from the 3rd column to the 6th column with 5% to 0.1% precision. Its
iteration has been significant reduced. The 7th and 8th is a fair comparison between
hotspot 3% set-length division and 0.1% BS division. The 9th is the NN testbench of
19
initial GPU Ratio 95% and the comparison from left to right is set-step length 3%, 5%
and BS 0.1%
From the aspect of running time and iterations, Figure 11-13 provide the comparison
result with Figure 7-9.
Figure 11 BS %1 Hotspot result (a)
Figure 12 BS %1 Hotspot result (b)
20
Figure 13 BS %1 Hotspot result (c)
Although the result show the best division is average at 54%, this result shows a great
improvement in iteration thus the total running time has been reduce because of much
more less iteration. Although we cannot avoid the situation of some worst CPU-GPU
divisions, however, the step-length won’t avoid that either. And they will have more
chance to try some other bad divisions around the worst divisions because of set-step
Length.
21
Figure 14 Other test bench results from different initial GPU/CPU workload
Figure 14 shows other test results also limits the division iteration in about 10, which
is approximately log2 (1000). That could lead a precision of 0.1%. It’s also show the total
running time has reached a minimum value of this specific testbench. Those results are
generated by the WorkStation platform. However the HPC platform can generate very
similar result since we used the same type of GPU card and single processor.
22
Figure 15 BS and Green GPU Energy Consumption Comparison
From the aspect of energy, we only focus on the workload division procedure, using
Kmeans and Hotspot as examples. The G marks refer to the result of using GreenGPU
and B marks refer to our BS algorithm. Depending on the difference of initial GPU ratio,
the result shows that we average have 2-6X less energy consumption, comparing with the
previous workload division algorithm in the workload division procedure. And this
algorithm provides more accurate division solution which will also benefit the total
energy consumption.
4.4 Conclusion and Future study
Current study shows more interests in energy efficiency in GPU-CPU architectures.
Base on a prototype study of GreenGPU, a heterogeneous architecture, I deploy a
workload division algorithm in new NVIDIA Tesla GPU platform. This algorithm could
23
provide a more accurate solution to the benefit the overall system energy consumption.
Compared to the previous GreenGPU algorithm that GreenGPU needs 20-30 iterations to
finish the workload division, the Binary Search Algorithm needs only 7-10 iterations.
With less iteration, the new algorithm speeds up the execution time of the workload
division procedure. We could get a 2-6X faster workload division procedure according to
the precision demands. That makes this algorithm can be more easily to integrate with
other optimization system.
In the future study, workload division unit should be put in a higher level beyond the
testbenchs instead of within it. New interfaces could be implemented to divide input data
properly with integrated data clustering method. And it also can provide access to cloud
computation that distribute workload to GPU or CPU in separate external devices.
24
References
[1] Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. GreenGPU: A Holistic
Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures. In The 41st
International Conference on Parallel Processing, September 2012.
[2] Yisong Li, Xuejun Yang, Tao Tang, Guibin Wang, and Xinhai Xu. An Integrated
Energy Optimization Approach for CPU-GPU Heterogeneous Systems Based on Critical
Path Analysis. Chinese Journal of Computers, Vol. 35 No.1 Jan. 2012.
[3] Danalis, Anthony, McCurdy Collin, and Vetter Jeffrey S. Efficient Quality
Threshold Clustering for Parallel Architectures. IEEE International Parallel & Distributed
Processing Symposium (IPDPS), May 2012
[4] C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous
multiprocessors with adaptive mapping. In MICRO, 2009.
[5] H. Takizawa, K. Sato1, and H. Kobayashi. SPRAT: Runtime Processor Selection
for Energy-aware Computing. In The Third international Workshop on Automatic
Performance Tuning, 2008.
[6] M. Rofouei, T. Stathopoulos, S. Ryffel, W. Kaiser, and M Sarrafzadeh. EnergyAware High Performance Computing with Graphic Processing Units. InWorkshop on
Power Aware Computing and System, December 2008.
25
[7] NVIDIA. NVIDIA CUDA C Programming Guide version 4.20.
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Pr
ogramming_Guide.pdf
[8] AMD Server Processors.
http://www.amd.com/US/PRODUCTS/SERVER/PROCESSORS/Pages/serverprocessors.aspx
[9]NVIDIA. NVIDIA Tesla C2050/C2070 GPU Computing Processor.
http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf
[10] S. Hong and H. Kim. An integrated GPU power and performance model. In
ISCA, 2010.
26