Graphic Processing Unit (GPU) Based Hardware Acceleration for High Speed 3D Cone-Beam Computed Tomography (CT) Reconstruction Szi-Wen Chen and Chang-Yuan Chu Department of Electronic Engineering, Chang Gung University, Tao-Yuan, Taiwan Abstract—The use of cone-beam projection based Computed Tomography (CT) is growing in the clinical area due to its being able to provide 3D information. On the other hand, rapid volumetric image reconstruction is of paramount importance to clinicians for prompt diagnosis and analysis of complex tissue alternations. In order to meet the high demand on computation for 3D image reconstruction algorithms, High Performance Computing (HPC) platforms such as the Graphic Processing Unit (GPU) actually can provide a new possible solution for such a highly computation-demanding application. This study aims to parallelize an analytical 3D CT image reconstruction algorithm, the so-called FDK method, and then to implement it on an existing commercial GPU, towards a highspeed 3D image reconstruction. The performance evaluation results of our study indicated that the use of GPU would represent a significant benefit to effectively accelerating the execution of the 3D image reconstruction task. Keywords— Computer Tomography (CT), Graphic Processing Unit (GPU), hardware acceleration, 3D reconstruction. I. INTRODUCTION Computed Tomography (CT) has become an important noninvasive tool in diagnostic medicine. To meet the requirement of high quality, safety, and time efficiency, modern Computed Tomography (CT) technique employs conebeam X-ray that produces 2D projections required for directly reconstructing a 3D volume. This is because the conebeam based technique can scan a wider area, in a shorter time, than the conventional multi-slice reconstruction. In fact, the cone-beam CT is applied not only to medical diagnoses, but also to non-destructive inspection systems and explosive detection systems in airports. However, the difficulty of 3D reconstruction still stands in front of us. According to [1], the volume reconstruction from 2D projections is done using some specific algorithms with high operational complexity O(N4), where N is the number of detector pixels in one detector row. In general, the cone-beam X-ray CT device usually requires special accelerating hardware for the 3D image reconstruction. For example, when using 400 views of 2D projections with 1024Ý1024 in size in floating format, the input projection data occupies approximately 1.6 GB and the output reconstruction of a 10243-voxel data occupies 4GB. Also, it would take over an hour for executing the 3D reconstruction on a single-core PC [2]. Obviously, a tremendous demand for computing power and memory is the clear and present barrier to CT applications. In fact, the multi-core technology has become widely adopted and has been demonstrated to be very useful to various High Performance Computing (HPC) applications. With these powerful dual-core, quad-core, or even eightcore processors, we may find some brand-new solutions to deal with a variety of CT image reconstruction problems which are considered the massive computational tasks [1][6]. On the other hand, recent development of Graphic Processing Units (GPUs) has provided a hopeful start for accessing to HPC applications into a number of scientific problems, including the hardware acceleration of CT reconstruction [7]-[10]. Since current GPU uses massive multithreading capability that can handle the computational complexity of 3D cone-beam reconstruction, it seems that GPU is even more powerful than multi-core CPU. In fact, GPU has become even more popular for HPC applications since NVIDIA developed a C-like programming environment years ago, called Compute Unified Device Architecture (CUDA). In this paper, a GPU based implementation for 3D cone-beam CT image reconstruction is presented. This paper is organized as follows. Section II provides the description of the GPU employed in this study, the FDK algorithm, a well-known analytical 3D image reconstruction method, and the proposed parallel implementation strategy of the FDK algorithm for the GPU. The performance was then evaluated by comparing the results of our implementations with those obtained from a number of previous related works and all these are described in section III. Finally, the paper is briefly concluded by section IV. II. MATERIALS AND METHODS A. The GPU Architecture and CUDA-based Programming In fact, the GPU structure is developed for accelerating computer graphics rendering from the microprocessors. In addition to being able to efficiently manipulate computer graphics, the highly parallel architecture of modern GPUs Á. Jobbágy (Ed.): 5th European IFMBE Conference, IFMBE Proceedings 37, pp. 567–570, 2011. www.springerlink.com 568 S.-W. Chen and C.-Y. Chu B. The Feldkamp/FDK Algorithm Similar to the original filtered back-projection algorithm, the Feldkamp algorithm [12], or alternatively known as the FDK algorithm, was specifically developed for cone-beam CT reconstruction. Fig. 1 shows a schematic picture of the cone-beam CT system with a planar detector. Belonging to the analytical reconstruction, the FDK algorithm conducts a series of steps that transform a number of 2D projections taken on the planar detector into a 3D reconstructed region. In general, the FDK algorithm mainly consists of three steps listed as below: Fig. 1. A schematic drawing of the cone beam CT system. makes them even more effective for a wide range of complex HPC applications than general purpose CPUs. In fact, computing intensity is growing much faster on GPU than on CPU since most of the transistors in GPU are designed to perform data calculations involving the computer graphics or scientific computational algorithms rather than perform data caching. Therefore, one may speculate that GPU would be perfectly suitable to the problems that can be well formulated and expressed as data-parallel calculations so that these calculations may be simultaneously executed on many processing units, referred to as the thread processors [9]. Due to a degree of inconvenience and clumsiness in GPU programming model, NVIDIA has created a C-like software platform, called Computed Unified Device Architecture (CUDA), for massively parallel high performance computing on the GPUs produced by NVIDIA. In general, CUDA includes C/C++ development tools, a number of function libraries that simplify programming, and a hardware abstraction mechanism [11]. With CUDA, a programmer first needs to analyze the algorithm to be implemented as well as the data, in order to determine the optimal numbers of threads, blocks and grids (as defined below) so that the GPU may be fully utilized. Then, the implementing solution can be expressed simply by writing C/C++ code for CUDA. For the current NVIDIA architectures, the resulting compiled CUDA function (or alternatively known as “kernel” in CUDA’s term) can be called by multiple threads, thus the kernel can be concurrently run in large blocks of threads. Note that here a thread block is defined as a number of threads that can synchronously execute on a cluster of thread processors and manipulate on the same data in the shared memory. Moreover, blocks that execute on the same kernel can be further grouped together into a grid of blocks. Therefore, in a GPU the maximum number of concurrent threads devoted to executing a single kernel could be very large. 1. 2. 3. Weighting the projection data P(u,v,θ). Filtering the weighted projection image row by row. Accumulating the back-projected data contributed by each projection into the corresponding voxel r(x,y,z), where P(u,v,θ) represents the projection image pixel indexed by (u,v), and θ is the incident angle of X-ray. First, the projection data P(u,v,θ) is weighted by a weighting function, denoted as w, representing the variation in intensity of each light due to the shape of cone-beam Xray. Then, like the original filtered back-projection reconstruction, in order to reduce the average smooth effect caused by the accumulating process each projection image has to be preprocessed by a high-pass filter. Finally, the value of an individual voxel r(x,y,z) is estimated by summing up the corresponding filtered projection pixels Pf(u,v,θ) over all the projection angles θ, where Pf(u,v,θ) denotes the weighted filtered 2D projection data obtained from the projection angle θ. It should be noted that the coordinates (x,y,z) and (u,v,θ) are related by a two-step transformation: first rotating from (x,y) to (s,t) by an angle θ, then projecting the 3D coordinate (s,t,z) onto a 2D one (u,v) through a conebeam projection, as illustrated in Fig. 1. Therefore, given a 3D coordinate (x,y,z), the FDK algorithm allows one to find the corresponding projection coordinate (u,v) for a given θ, and then accumulate its associated projection data P(u,v,θ) into the reconstruction voxel r(x,y,z). C. GPU-based Implementation As described previously, the FDK reconstruction method consists of three steps: 1) weighting the projection images, 2) ramp filtering the projection data row by row, and 3) backprojecting the filtered projection data into the voxel value. Since the third step is computationally intensive and more than 95% of the total reconstruction time is devoted to the back-projection algorithm, we here mainly focus on the GPU-based implementation of this portion of the recon- IFMBE Proceedings Vol. 37 Graphic Processing Unit (GPU) Based Hardware Acceleration for High Speed 3D Cone-Beam CT Reconstruction 569 Step 1. Access all projections on DRAM; once a projection pixel is back-projected into its corresponding voxel, write the updated voxel value to DRAM. Step 2. Access all projections on DRAM; after all the projection data over all angles are back-projected into their corresponding voxel, write the final reconstructed voxel value to DRAM. Step 3. Upload each projection consecutively to the GPU texture memory. Step 4. Upload a subset of projections consecutively to the texture memory by optimizing the use of texture. Step 5. Restrict the number of registers used. Processing time (ms) Optimization step Fig. 2. The execution times required for back-projection run by GPU with applying the five optimization steps gradually. struction process. In this aspect, the weighted and filtered projection data set was divided into smaller subsets and then each subset was consecutively uploaded to the GPU texture memory since storing the data locally would reduce the number of off-chip memory accesses, thus enhancing the performance. A voxel-based back-projection is proposed in this study. Note that here the calculation of each voxel value is simply performed on one thread. Also, in each thread the coordinate transformation for each voxel is computed first in order to determine its corresponding projection data. In fact, the 3D volumic region is partitioned into 2D slices, and each slice is then further divided into a number of rows of voxels. As a result, in our application each row or a portion of row consisting of 256 voxels can form a thread block (i.e., the block size = 256) and the back-projection kernel is simply called by these concurrent threads. Moreover, it should be noted that in addition to the number of thread processors, the number of registers required by each thread also limits the maximum number of concurrent threads. In general, the CUDA compiler can automatically determine the optimal number of registers for each thread and typically the compiler assigns 20-32 registers per thread [11]. In our implementation, we restrict the number of registers per thread to 1520 so that the maximum possible number of concurrent threads can be further increased, thus further accelerating the reconstruction. Note that in our GPU-based back-projection, we gradually apply the five optimization steps (as itemized below) in order to enhance the overall performance step by step, towards the optimum: Fig. 2 shows the execution times required for backprojection run by GPU with applying the above five optimization steps gradually. The numerical experimental results in Fig. 2 were derived from reconstructing a 2563 volume using an input data set consisting of 128 views of 256ͪ256 pixels each. According to the numerical experimental results as indicated by Fig. 2, one may see that Step 4 definitely plays a crucial role in the GPU-based implementation. Such an observation actually supports to a fact that optimizing the use of texture would significantly speed up the entire reconstruction process (speeding up 3.71 times compared to the case without the use of texture in Step 1). III. PERFORMANCE EVALUATION In our evaluation, we used the “Take” program [13] developed by Jens Müller-Merbach to generate the input data set of cone-beam projection data which consists of 128 projections of 512ͪ512 pixels each. We evaluated the GPU performance by reconstructing a 512 cubed volume from the other data set consisting of 128 views, the size of each projection is 512 × 512 pixels. Table 1 provides a performance comparison of a variety of CT reconstruction systems from a number of existing implementations for Field Programmable Gate Array (FPGA), multi-core CPU and GPU, including our works. All the results were obtained by reconstructing a 5123 volume. As a result, it is indicated from Table 1 that our GPU-based implementation still achieves the best performance (58.80pps) in comparison to these existing techniques. (Note that here pps represents the number of processed projections per second.) On the other hand, one may also see that the performance of the multi-core CPU approach is only comparable with the FPGA one when reconstructing a 5123 volume (7.85pps vs. 8.96pps). In fact, we may speculate that although it has been demonstrated in this study that a shared cache mechanism in the multi-core CPU enables a remarkable enhancement of computational performance in CT reconstructions, massive IFMBE Proceedings Vol. 37 570 S.-W. Chen and C.-Y. Chu amount of data due to 2D projections of a higher resolution (for example, 512×512 pixels) would actually result in a lower cache hit rate so the number of DDR3 memory accesses may inevitably increase, thus undesirably reducing the execution speed. Therefore, we may further conclude from all these performance results that the GPU may be considered the best candidate for performing high-speed CT reconstructions, in comparison to both the multi-core CPU and FPGA. Table 1 The execution times in the back-projection step required for reconstructing a 3D region with the size of 5123 voxels achieved by a number of existing implementations Hardware Time (s) Processed projections/second (pps) FPGA by Li et al. [14] 33.5 8.96 Multi-core CPU by Chu and Chen [5] 16.304 7.85 GPU (8800 GTX) by Xu and Muller [7] 8.9 40.45 GPU (8800 GTX) by Yang et al. [8] 8.7 41.38 GPU (GTS 250) by this work 2.177 58.80 2. 3. 4. 5. 6. 7. 8. 9. 10. IV. CONCLUSION In this paper, a GPU based implementation of an analytical cone-beam based 3D CT image reconstruction method, called the FDK algorithm, is presented. The performance was evaluated by comparing the results of our implementation with those obtained from a number of previous related works using different platforms including FPGA, multi-core CPU and GPU, respectively. Consequently, it is revealed from our performance results that the CUDA-based GPU actually substantially outperforms the multi-core CPU and FPGA for massively parallel HPC applications. Therefore, we may conclude that nowadays the GPU is still considered the best candidate for performing high-speed CT reconstructions, compared to multi-core CPU and FPGA. ACKNOWLEDGMENT This work was supported by the National Science Council, Taiwan, under Contract NSC 99-2221-E-182-020. REFERENCES 1. Sakamoto M, Murase M (2007) Parallel implementation for 3-D CT image reconstruction on Cell Broadband Engine. IEEE Proc., ICME, Beijing, China, 2007, pp. 276–279. 11. 12. 13. 14. Sakamoto M, Murase M (2009) A parallel implementation of 3-D CT image reconstruction on the Cell Broadband Engine. International Journal of Adaptive Control and Signal Processing 24:117–127. Liang R, Pan Z, Krokos M et al. (2008) Fast hardware-accelerated volume rendering of CT scans. Journal of Display Technolog 4: 431– 436. Cruvinel P, Pereira M, Saito J et al. (2009) Performance improvement of tomographic image reconstruction based on DSP processors. IEEE Trans. on Instrumentation and Measurement 58:3295-3304. Chu C, Chen S (2009) Parallel implementation for cone-beam based 3D Computed Tomography (CT) medical image reconstruction on multi-core processors. IFMBE Proc. vol. 25/IV, World Congress on Med. Phys. & Biomed. Eng., Munich, Germany, 2009, pp. 2066–2069. Chen C, Lee S, Cho Z (1990) A parallel implementation of 3-D CT image reconstruction on hypercube multiprocessor. IEEE Trans. on Nuclear Science 37:1333–1346. Xu F, Mueller K (2007) Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Physics in Medicine and Biology 52:3405–3419. Yang H, Li M, Koizumi K et al. (2007) Accelerating backprojections via CUDA architecture. Proc., 9th International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine, Lindau, Germany, 2007, pp. 52–55. Noël P, Walczak A, Hoffmann K et al. (2008) Clinical evaluation of GPU-based cone beam computed tomography. Proc. HighPerformance Medical Image Computing and Computer Aided Intervention (HP-MICCAI'08). Scherl H, Keck B, Kowarschik M et al. (2007) Fast GPU-Based CT Reconstruction using the Common Unified Device Architecture (CUDA). IEEE Proc. Vol. 6, 2007 Nuclear Science Symposium Conference Record (NSS ‘07), 2007, pp. 4464–4466. Halfhill T (2008) Parallel processing with CUDA: Nvidia’s highperformance computing platform uses massive multithreading. Microprocessor Report 22:1–8. Feldkamp L, Davis L, Kress J (1984) Practical cone-beam algorithm. J. Opt. Soc. Am. A1:612–619. J. Müller-Merbach (1996) Simulation of X-ray projections for experimental 3D tomography, ISSN 1400-3902, report no. LiTH-ISY-R1866. Li J, Papachristou C, Shekhar R (2005) An FPGA-based computing platform for real-time 3D medical imaging and its application to cone-beam CT reconstruction. The Journal of Imaging Science and Technology 49:237–245. Author: Institute: Street: City: Country: Email: Szi-Wen Chen Dept. of Electronic Engineering, Chang Gung University 259 Wen-Hwa 1st Road Kwei-Shan, Tao-Yuan Taiwan [email protected] Author: Institute: Street: City: Country: Email: Chang-Yuan Chu Dept. of Electronic Engineering, Chang Gung University 259 Wen-Hwa 1st Road Kwei-Shan, Tao-Yuan Taiwan [email protected] IFMBE Proceedings Vol. 37
© Copyright 2026 Paperzz