Dynamic Workload Division in GPU-CPU Heterogeneous Systems THESIS Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Wei Chen Graduate Program in Electrical and Computer Engineering The Ohio State University 2013 Master's Examination Committee: Dr. Xiaorui Wang, Advisor Dr. Füsun Özgüner Copyright by Wei Chen 2013 Abstract GPU provides powerful computational capabilities and huge potential optimization possibility of efficient. As a result, the CPU-GPU heterogeneous architecture is still the hot zone of the high performance computation. However, the energy consuming is still the bottle neck of the entire the system, when the system and its corresponding framework need massive scale calculation. Most of the existing study is focus on how to lower the GPU power requirement. However, they did not considered CPU and GPU as an entire architecture. This thesis is based on the GreenGPU heterogeneous architecture. Because of the new generation of platform, one of the assumptions is that the operation system and its correspondence driver will adjust the DVFS to its best optimizing point. I implement a workload division algorithm using Tesla CUDA GPUs and AMD CPUs to balance the time difference caused by the workload. The real physical testbed results show the new workload division algorithm can provide at least 5X accuracy than previous algorithm without extra energy cost in the workload division procedure. This more accurate workload division algorithm could benefits the overall system energy consumption especially when the workload is huge. ii Dedication This document is dedicated to my family. iii Acknowledgments First and foremost, I would like to express my sincere gratitude to my advisor, Dr. Xiaorui Wang for his insightful inspiration and dedicated teaching. His invaluable advice and constructible guidance motivate me to explore the intricacy territories of energy efficiency of high performance computing. My gratitude also goes to my fellow student Kai Ma for his discussion and comments on my research topics. Additionally, I would like to thank Dr. Füsun Özgüner for being a member of my thesis committee. iv Vita 2011 ........................................................... B.S. Electrical Information Engineering, Sichuan University. 2012 to present ........................................... M.S. Department of Electrical and Computer Engineering, The Ohio State University Publications Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. “GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures”. Accepted by ICPP 2012 Zitao Liu, Wenchao Yu, Wei Chen, Shuran Wang, and Fengyi Wu. “Short Text Feature Selection for Micro-Blog Mining”. Accepted by CiSE2010 Fields of Study Major Field: Electrical and Computer Engineering v Table of Contents Abstract ..................................................................................................................... ii Dedication ................................................................................................................ iii Acknowledgments .................................................................................................... iv Vita .............................................................................................................................v Table of Contents ...................................................................................................... vi List of Tables .......................................................................................................... viii List of Figures ........................................................................................................... ix Chapter 1: Introduction ..............................................................................................1 Chapter 2: GPU-CPU Heterogeneous Architecture......................................................4 2.1 Related Work.....................................................................................................4 2.2 Motivation .........................................................................................................5 2.3 Contribution ......................................................................................................6 Chapter 3: System Design and Implementation ...........................................................7 3.1 System Design ...................................................................................................7 3.2 Algorithm Implementation .................................................................................9 3.3 Physical Test-bed and implementation ............................................................. 11 vi Chapter 4: Result analysis and Conclusion ................................................................ 14 4.1 Implementation on Tesla GPU ......................................................................... 14 4.2 Division Precision improvement ...................................................................... 16 4.3 Binary Search Implementation ......................................................................... 18 4.4 Conclusion and Future study ............................................................................ 23 References ................................................................................................................ 25 vii List of Tables Table 1 Relationship of Division Ratio and Power Reduction......................................9 Table 2 Software Deployment on Test-bed................................................................ 12 viii List of Figures Figure 1 Single Processor Performance reaches a bottleneck .......................................2 Figure 2 CUDA Execution Path ..................................................................................8 Figure 3 Algorithm Running Details explains boundary moving ............................... 10 Figure 4 Kmeans result indicate the balanced ratio point ........................................... 14 Figure 5 Hotspot results prove the successful plantation to new platform .................. 15 Figure 6 NN results presents a special condition........................................................ 16 Figure 7 Updated Hotspot results (b) ......................................................................... 17 Figure 8 Updated Hotspot results (b) ......................................................................... 17 Figure 9 Updated Hotspot results (c) ......................................................................... 18 Figure 10 Recursion time of Binary Search Implement ............................................. 19 Figure 11 BS %1 Hotspot result (a) ........................................................................... 20 Figure 12 BS %1 Hotspot result (b) ........................................................................... 20 Figure 13 BS %1 Hotspot result (c) ........................................................................... 21 Figure 14 Other test bench results from different initial GPU/CPU workload ............ 22 Figure 15 BS and Green GPU Energy Consumption Comparison ............................. 23 ix Chapter 1: Introduction The development of semiconductor technology increases the number of transistors in a single die. Thus the computation capability of processors is getting stronger and stronger. The graphics processing unit (GPU) was originally designed to accelerate the construction process of images with relative simple architecture, however, the chip resource is considered to have great potential computation capability. With the architecture advance, especially number of cores, the GPU should have much better performance than general CPU. Additionally, the GPU programing environment is improving, e.g., NVIDIA provide CUDA SDK, then Nsight Integrated Compilation Environed in Visual Studio, then Nsight LINUX/UNIX eclipse version, etc. Those motivate the GPU that it becomes a conceptual heterogeneous CPU accelerator in parallel computation, which is the hot zone in High Performance Computation territories. The single processor performance improvement stops 2003, almost 10 years ago. As it shows red line in the Figure 1, instead, the multiprocessor with parallelism takes the trend. The parallelism provide new models such as Data/Thread/Request level parallelism, however, these new models require explicit restructuring of the application. A great inspiration and discussion should start from Flynn’s Taxonomy. This computer architecture classification indicates that Single Instruction Multiple Data (SIMD) structure which means that given workload, using single instruction to process multiple data segment can provide better quality in performance and energy efficient [1]. NVIDIA 1 CUDA-enabled GPUs offers optimization in SIMD and that create Tianhe-1A, one of Top 10 supercomputer until June 2012, with Xeon X5670 and NVIDIA Tesla 2050 GPUs heterogeneous architecture [2]. Figure 1 Single Processor Performance reaches a bottleneck This GPU-CPU heterogeneous architecture provide huge performance gain, in the meanwhile, it also offers potential energy saving opportunities. The current generation of GPU has hundreds of cores. Although the new generation of CPU will have multiple-core architecture, from the aspect of power consumption per core, GPU has much more advantage. So evolving GPU in the computation is good choice. However, this condition does not mean that GPU should do all the jobs. When CPU is in the idling status, that part is definitely a waste of power. We know trying to move part of the workload to the idling CPU and let it working. The intuitive idea is that the evolving of CPU will reduce 2 the total running test of a given workload, which will cost a reduction in the total energy consumption. The rest of the thesis is organized as follows. Chapter 2 present the theoretical background of GPU-CPU heterogeneous study the motivations. In Chapter 3, the system design will be reiterated. The algorithm details will be presented with its implementation. Then we will discuss the real physical test-bed where we could get a reliable result. Finally the Chapter 4 will present the result and its analysis. This also pointed out the direction for future research work. 3 Chapter 2: GPU-CPU Heterogeneous Architecture In this Chapter 2, we will first discuss about the related territory and achievement that has been down and then we will generally point out the motivation of this study. The contribution of this thesis will be listed at the end of this Chapter. 2.1 Related Work Workload division arouses attentions of researcher in recent years. This thesis is based on the GreenGPU prototype. Ma et al. [1] propose GreenGPU architecture that implements a two tier solution that dynamically splits and distributes workload to GPU and CPU in the first tier and performs a DVFS control in both CPU and GPU in the second tier. Luk et al. [4] propose Qilin framework. This work dynamically schedules the different processing elements with different workload to reduce the total execution time. However the energy consumption was not discussed in the previous work. Initial researches on GPU power provide multiple improvements in the energy and power efficiency. Takizawa et al. [5] proposed a stream programming framework that can dynamically selects the best available processor for running a specified workload task on CPU GPU hybrid system. Rofouei et al. [6] list the experimental result of comparing the cost/performance the GPU operation and CPU-only system. Those studies show the potential of GPU-CPU heterogeneous architecture in optimizing. 4 2.2 Motivation Power and Energy issue is the critical in the construction territory of HPC. No matter how huge a HPC system is, there is a Thermal Design Power used as a target for power supply and cooling system. New architecture and framework are implemented in those huge systems or even in smaller servers and workstations so that clock rate of processors can be adjust dynamically to reduce the total power consumption. We made a comparison of dynamic energy and dynamic power. The dynamic energy equals 𝐷𝐸 = 1 × 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑖𝑣𝑒 𝐿𝑜𝑎𝑑 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒 2 2 The dynamic power equals 𝐷𝑃 = 1 × 𝐶𝑎𝑝𝑎𝑐𝑖𝑡𝑖𝑣𝑒 𝐿𝑜𝑎𝑑 × 𝑉𝑜𝑙𝑡𝑎𝑔𝑒 2 × ∆𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 2 This small difference shows the frequency adjustment does nothing to the energy consumption; however it can affect the power of the system. However the hardware development provides the newest driver and correspondent operation system support to perform a capsuled DVFS control. Recent research shows due to the limited memory location and its synchronizing scheme, mass parallel program using GPU may result in a less-accurate result. One of the spotlights of GreenGPU is Kai et al. implement workload division in a real physical test-bed, AMD Phenom II CPUs and NVIDIA GeForce 8800 GTX GPU. To duplicate and improve their experiments, I reselect the hardware platform as NVIDIA Tesla 2070. However, due to the reliability of HPC requirement, NVIDIA improve their 5 power management system. This updated system perform a better management scheme in DVFS and they prohibit the manually configuration of DVFS, which could be done in the previous GeForce platform. In a similar way, the CPU power configuration has updated to another schedule scheme. So I could make an assumption that the existing drivers will find the best DVFS level, which makes one tier of GreenGPU unnecessary. However, the CPU-GPU workload division precision, 5%, in the GreenGPU architecture is not matching the requirement of real world workload. e.g., many biometric Quality Threshold Clustering (QTC) tasks are with huge data input and should running for days, even 0.1% of workload division might cause different in execution time and total energy cost.[3] 2.3 Contribution Based on the previous work, I simplify the previous GreenGPU to a one tier solution. In particular, the contributions of this thesis are: 1) Implement the modified GreenGPU architecture in a current CPU GPU physical test-bed 2) Improve the workload division precision by introducing Binary Search 3) Improve the workload division algorithm running time. 4) Deployed edge value control to prevent potential deadlocks. 6 Chapter 3: System Design and Implementation In this Chapter 3, we will generally discuss the design the workload division system and the methodology of the algorithm. We also provide details about the implementation of the real physical test-bed. 3.1 System Design In GPU-CPU heterogeneous parallel system, GPU and CPU usually connected by external bus, and they have their individual external off-chip DRAM storage. So they can communicate with DMA. Generally, CPU and GPU work cooperatively in a MasterSlave mode. CPU is the Master end, which perform the data organize and schedule execution of GPU. Figure 2 [7] shows a classic CUDA program pseudo-code. The serial code which is considered as blocks of program segments that runs on the CPU (host), we could call it CPU task. Meanwhile the GPU operation including data transfer between the CPU and GPU storage (Communication task) and its running kernel was called GPU task. Because the CPU and GPU execution process is relatively independent, and their data transfer can be performed by an individual DMA module, thus the CPU task and GPU task and the communication task can run simultaneously if there is no data dependency. Additionally, CUDA provide asynchronies interfaces and functions to ensure the program can run properly with different level of synchronization. 7 CPU GPU BLOCK1 Mem Copy to Device BLOCK1 Incoke Kerneal Data flow BLOCK1 Serial Code(host) Parallel Kernel(device) KernelA<<<nBlk,nTid>>>(args); Serial Code(host) Parallel Kernel(device) KernelB<<<nBlk,nTid>>>(args); Control flow Mem Copy to Host BLOCK1 Figure 2 CUDA Execution Path Based on the previous description, CPU and GPU may wasting energy because CPU or GPU might have already stop working and change itself into an idle mode at some the synchronize points, due to the different computational capability and workload. From the aspect of energy consumption, intuitively, this workload arrangement is not the best optimized. One possible solution is to run the “quicker” tasks in a low frequency and low voltage, then it can reach the synchronize point with the “slower” tasks. It could save energy. However another solution is transferring some of the workload to the free processor and keeps it busy. The workload division unit exists in the serial CPU code and it could dynamically adjust the workloads between CPU and GPU based on their execution time iteration by iteration to eliminate the idle time. Another basic assumption is that the operations inside each iteration are similar. Thus after a few iterations, the next iteration division ratio 8 could be predicted in a reasonable range. Although the input data condition varies even for a single huge task, however, the neighbor of fetched data and incoming data are similar in many ways. Thus the workload division program select a reasonable ratio for the current program segment, it still can be considered as useable for a short period of time based on the stability of the input data. Fortunately, before we run that complex tasks, people always have a general review of the distribution of the input data. So a periodically invoking workload division unit could perform a reliable and efficient energy efficiency optimizing adjustment. 3.2 Algorithm Implementation The input data could be divided by lines. There is a barrier in the input stream that decides which part goes into GPU and CPU. Accordingly, Table 1 shows the position of barrier, which means the workload ratio r = workload on CPU / total workload, could affect the running time and result in extra energy consumption. Table 1 Relationship of Division Ratio and Power Reduction Value of r r > r_mean[i] r < r_mean[i] r = r_mean[i] Running Time t_CPU > t_GPU t_CPU > t_GPU t_CPU = t_GPU Potential Energy Reduction P_CPU*(t_CPU-t_GPU) P_GPU*(t_CPU-t_GPU) 0 At the beginning, the workload division unit will be assigned an expected guessing number. It could be various from program to program. The default number could be 50%. The ration could be an integer or float number at the beginning. 9 The basic idea of the workload division is using the binary search to find the optimized point as soon as possible. The testing code with a sample input data will be run several times. Each iteration will push either upper bound or lower bound of the division range closer to the expected optimized point. Upper Bound Lower Bound Upper Bound Figure 3 Algorithm Running Details explains boundary moving In details, for example, as is shown in Figure 3, the orange part is the r running on the CPU and the green part is the ratio running on the GPU. We use running time of each version of code as a measurement. If the r is bigger than it should be, which means the CPU portion execute that part of workload for too long and it should be considered as over-burdened. Thus this barrier position will be considered as the upper bound of this workload division. In a similar way, if the GPU takes longer time to finish their part, the barrier position should be considered as the lower bound. This vibration will eventually 10 locate one reasonable proportion of workload division with expected precision. Because this algorithm is based on binary search, the precision will be up to more than 1% if the iteration time as much as 7.Because, 100% ÷ 27 < 1% After our algorithm reach the 1% algorithm, we could still let it go deeper to find more precisely division ratio if it is required. One more feature about this algorithm is called the edge value control which is used to prevent deadlock. Before more details, we should mention the termination condition of algorithm is set by a pre-defined TIMELIMIT. If the difference between GPU and CPU running time is less than this TIMELIMIT, we could consider the division has been successfully deployed. However, the limitation of precision provide a potential situation that the GPU or CPU should do all the work. For example, a basic MatrixMul, if CPU do 1% of the work will lower total system running efficiency and the workload division precision is 1%. We should let GPU do all the work. We add edge control to this situation to terminate the algorithm. Another situation is that potential calculation error may result the Lower Bound and Upper Bound miss the best balance point, in that case we should expand the Lower Bound and Upper Bound to make sure the balance point is between them. 3.3 Physical Test-bed and implementation In this section, we will introduce the hardware test-bed. The physical test-bad in our experiments include a workstation, AMD Opteron CPU [8], and NVIDIA Tesla 2070 11 GPU Card[9]. An external power meter is connected between workstation and 110V AC outlet. We could read the power through USB from power meter to workstation. This heuristic test-bed cannot perform a global optimization due to the time record function precision and the lack of insight investigation of GPU-CPU DVFS control, overclock, low power state for DRAM, disk I/O state and even turning off cores. However, this study could be easily integrated with other optimization system, e.g., [10 ] to achieve a better energy saving gain. In the implementation part, our testbench was deployed on a well configure system, the detail information are listed as Table 2: Table 2 Software Deployment on Test-bed Software WorkStation Operation System HPC OS NVIDIA driver CUDA version GCC/G++ version Rodinia version Version Ubuntu 12.04 Desktop Red Hat Enterprise LS release 6.3 301.32 4.2.18 4.6.x 2.0 Rodinia is the testbench suite we used in our study. This testbench suite provides multiple testbenchs with GPU only, CPU only and GPU-CPU hybrid programs with formatted input data. However, the input data of the program we need is too small that multiple iterations could be done within second. This would not enough to trigger the power meter change and remain in a reliable readable value. I manually increase the size of the input data by simply copying the input file contents multiple times. 12 One of the defect of this implementation is the power measurement should be more accurate. One power meter implementation can only reflect the total system power and GPU working system power. What’s more, the GPU graphic card could only provide computation service other than X-server, which will also increase the reliability of the result. 13 Chapter 4: Result analysis and Conclusion In this Chapter, we will generally discuss the testbench result. I will start from the implementation in the New GPU platform and then I will present an algorithm comparison to the old generation GreenGPU. 4.1 Implementation on Tesla GPU We use one of our testcase fom Rodinia as example, Kmeans shown as Figure 4, a data clustering algorithm. An interesting issue is that those graphic algorithms are not usually have perfect GPU performance gain. Because their loops are always not match the tail recursion requirement and they have tons of data dependency during each iteration. Figure 4 Kmeans result indicate the balanced ratio point 14 In this particularly test result, we initialed assign the GPU 95% of the total workload. After several iteration, the CPU execution time and its corresponding portion of workload is increasing. The end point shows that the best optimized point is: GPU is only taking care of about 28% of the total workload. The other testbench HOTSPOT provides a more general case. Figure 5 Hotspot results prove the successful plantation to new platform This Figure 5 shows that we initial the 20% workload to CPU at the beginning. Due to the difference in the GPU and CPU execution time, the workload on GPU is increasing until their execution time will be almost the same. It also reveals this version of hotspot code is not a good parallel version, because GPU has 40X cores than CPU but they will finish the same workload at the same time. But it also reveals the previous algorithm has been implemented on the new platform due to the changes in the workload. 15 Additionally, GreenGPU architecture also point out a situation for fine-parallelism program, which means no matter how many workload was assigned to CPU initially, it will eventually transfer all workload to GPU. My implementation could also duplicate that situation as shown in Figure 6. Figure 6 NN results presents a special condition NN results provide both workload division and its corresponding energy consumption. After just a few iterations, all workload are assigned to GPU side and CPU energy consumption is reducing significantly with the CPU workload division reduction. Division refers to the ratio of CPU Energy Consuming / (CPU+GPU Energy Consuming) 4.2 Division Precision improvement Previous experiments implement the workload division algorithm in the new platform, however the each iteration of the workload division will only move 5%. In HPC 16 territory, 5% of the total workload would also be a huge task. What’s more, due to the previous algorithm, the other issues is the system might not get an final division solution because it might vibrate between two percentage, e.g., the best optimal division point is 52%, the algorithm might bounce back and forth between 50% and 55%. A heuristic solution is to reduce the precision step length, e.g., from 5% to 3%. In fact, this solution provides a little bit more accuracy to our workload division algorithm. Figure 7 Updated Hotspot results (b) Figure 8 Updated Hotspot results (b) 17 Figure 9 Updated Hotspot results (c) Figure 7-9 shows the result of reducing step length to 3%. Although the final result show the best workload division point is about 51%, if the initialed workload is far away from that value, it takes too much iteration to find that point. For Figure 7 and 9, this algorithm almost runs 20 times to get a result. Our experiment could also choose 1% instead of 3%, however, we can expect that will take dozens of iteration to calculate a point location. In that case, just running the program without division might be faster. 4.3 Binary Search Implementation As we have discussed in Chapter 3.2, the Binary Search method was deploy into the algorithm. With the same data input, it significantly reduces the number of iteration time and also reaches the 0.1% precision. 18 Figure 10 Recursion time of Binary Search Implement In Figure 10, the G marks refer to the result of using GreenGPU and B marks refer to our BS algorithm. GreenGPU iteration times vary depending on the initial GPU workload ratio. And the BS algorithm more concerns about the precision. BS also can reach more precision within the same iteration as GreenGPU shows. In details, within 10 iterations, the new workload division algorithm can find the best division point location and confirm that is the location by a checking scheme. This algorithm also can find a more accurate division percentage, if we adjust the precision parameter. That will not highly increase cost of the algorithm complex. The first two column shows the set-step length algorithm in the precision of 3% and 5%. Our BS algorithm start from the 3rd column to the 6th column with 5% to 0.1% precision. Its iteration has been significant reduced. The 7th and 8th is a fair comparison between hotspot 3% set-length division and 0.1% BS division. The 9th is the NN testbench of 19 initial GPU Ratio 95% and the comparison from left to right is set-step length 3%, 5% and BS 0.1% From the aspect of running time and iterations, Figure 11-13 provide the comparison result with Figure 7-9. Figure 11 BS %1 Hotspot result (a) Figure 12 BS %1 Hotspot result (b) 20 Figure 13 BS %1 Hotspot result (c) Although the result show the best division is average at 54%, this result shows a great improvement in iteration thus the total running time has been reduce because of much more less iteration. Although we cannot avoid the situation of some worst CPU-GPU divisions, however, the step-length won’t avoid that either. And they will have more chance to try some other bad divisions around the worst divisions because of set-step Length. 21 Figure 14 Other test bench results from different initial GPU/CPU workload Figure 14 shows other test results also limits the division iteration in about 10, which is approximately log2 (1000). That could lead a precision of 0.1%. It’s also show the total running time has reached a minimum value of this specific testbench. Those results are generated by the WorkStation platform. However the HPC platform can generate very similar result since we used the same type of GPU card and single processor. 22 Figure 15 BS and Green GPU Energy Consumption Comparison From the aspect of energy, we only focus on the workload division procedure, using Kmeans and Hotspot as examples. The G marks refer to the result of using GreenGPU and B marks refer to our BS algorithm. Depending on the difference of initial GPU ratio, the result shows that we average have 2-6X less energy consumption, comparing with the previous workload division algorithm in the workload division procedure. And this algorithm provides more accurate division solution which will also benefit the total energy consumption. 4.4 Conclusion and Future study Current study shows more interests in energy efficiency in GPU-CPU architectures. Base on a prototype study of GreenGPU, a heterogeneous architecture, I deploy a workload division algorithm in new NVIDIA Tesla GPU platform. This algorithm could 23 provide a more accurate solution to the benefit the overall system energy consumption. Compared to the previous GreenGPU algorithm that GreenGPU needs 20-30 iterations to finish the workload division, the Binary Search Algorithm needs only 7-10 iterations. With less iteration, the new algorithm speeds up the execution time of the workload division procedure. We could get a 2-6X faster workload division procedure according to the precision demands. That makes this algorithm can be more easily to integrate with other optimization system. In the future study, workload division unit should be put in a higher level beyond the testbenchs instead of within it. New interfaces could be implemented to divide input data properly with integrated data clustering method. And it also can provide access to cloud computation that distribute workload to GPU or CPU in separate external devices. 24 References [1] Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang. GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures. In The 41st International Conference on Parallel Processing, September 2012. [2] Yisong Li, Xuejun Yang, Tao Tang, Guibin Wang, and Xinhai Xu. An Integrated Energy Optimization Approach for CPU-GPU Heterogeneous Systems Based on Critical Path Analysis. Chinese Journal of Computers, Vol. 35 No.1 Jan. 2012. [3] Danalis, Anthony, McCurdy Collin, and Vetter Jeffrey S. Efficient Quality Threshold Clustering for Parallel Architectures. IEEE International Parallel & Distributed Processing Symposium (IPDPS), May 2012 [4] C.-K. Luk, S. Hong, and H. Kim. Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In MICRO, 2009. [5] H. Takizawa, K. Sato1, and H. Kobayashi. SPRAT: Runtime Processor Selection for Energy-aware Computing. In The Third international Workshop on Automatic Performance Tuning, 2008. [6] M. Rofouei, T. Stathopoulos, S. Ryffel, W. Kaiser, and M Sarrafzadeh. EnergyAware High Performance Computing with Graphic Processing Units. InWorkshop on Power Aware Computing and System, December 2008. 25 [7] NVIDIA. NVIDIA CUDA C Programming Guide version 4.20. http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Pr ogramming_Guide.pdf [8] AMD Server Processors. http://www.amd.com/US/PRODUCTS/SERVER/PROCESSORS/Pages/serverprocessors.aspx [9]NVIDIA. NVIDIA Tesla C2050/C2070 GPU Computing Processor. http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf [10] S. Hong and H. Kim. An integrated GPU power and performance model. In ISCA, 2010. 26
© Copyright 2026 Paperzz