Graphic Processing Unit (GPU) Based Hardware Acceleration for

Graphic Processing Unit (GPU) Based Hardware Acceleration for High Speed 3D
Cone-Beam Computed Tomography (CT) Reconstruction
Szi-Wen Chen and Chang-Yuan Chu
Department of Electronic Engineering, Chang Gung University, Tao-Yuan, Taiwan
Abstract—The use of cone-beam projection based Computed Tomography (CT) is growing in the clinical area due to
its being able to provide 3D information. On the other hand,
rapid volumetric image reconstruction is of paramount importance to clinicians for prompt diagnosis and analysis of complex tissue alternations. In order to meet the high demand on
computation for 3D image reconstruction algorithms, High
Performance Computing (HPC) platforms such as the Graphic
Processing Unit (GPU) actually can provide a new possible
solution for such a highly computation-demanding application.
This study aims to parallelize an analytical 3D CT image reconstruction algorithm, the so-called FDK method, and then to
implement it on an existing commercial GPU, towards a highspeed 3D image reconstruction. The performance evaluation
results of our study indicated that the use of GPU would
represent a significant benefit to effectively accelerating the
execution of the 3D image reconstruction task.
Keywords— Computer Tomography (CT), Graphic Processing
Unit (GPU), hardware acceleration, 3D reconstruction.
I. INTRODUCTION
Computed Tomography (CT) has become an important
noninvasive tool in diagnostic medicine. To meet the requirement of high quality, safety, and time efficiency, modern Computed Tomography (CT) technique employs conebeam X-ray that produces 2D projections required for directly reconstructing a 3D volume. This is because the conebeam based technique can scan a wider area, in a shorter
time, than the conventional multi-slice reconstruction. In
fact, the cone-beam CT is applied not only to medical diagnoses, but also to non-destructive inspection systems and
explosive detection systems in airports. However, the difficulty of 3D reconstruction still stands in front of us. According to [1], the volume reconstruction from 2D projections is done using some specific algorithms with high
operational complexity O(N4), where N is the number of
detector pixels in one detector row.
In general, the cone-beam X-ray CT device usually requires special accelerating hardware for the 3D image reconstruction. For example, when using 400 views of 2D
projections with 1024Ý1024 in size in floating format, the
input projection data occupies approximately 1.6 GB and
the output reconstruction of a 10243-voxel data occupies
4GB. Also, it would take over an hour for executing the 3D
reconstruction on a single-core PC [2]. Obviously, a tremendous demand for computing power and memory is the
clear and present barrier to CT applications.
In fact, the multi-core technology has become widely
adopted and has been demonstrated to be very useful to
various High Performance Computing (HPC) applications.
With these powerful dual-core, quad-core, or even eightcore processors, we may find some brand-new solutions to
deal with a variety of CT image reconstruction problems
which are considered the massive computational tasks [1][6]. On the other hand, recent development of Graphic
Processing Units (GPUs) has provided a hopeful start for
accessing to HPC applications into a number of scientific
problems, including the hardware acceleration of CT reconstruction [7]-[10]. Since current GPU uses massive multithreading capability that can handle the computational complexity of 3D cone-beam reconstruction, it seems that GPU
is even more powerful than multi-core CPU. In fact, GPU
has become even more popular for HPC applications since
NVIDIA developed a C-like programming environment
years ago, called Compute Unified Device Architecture
(CUDA). In this paper, a GPU based implementation for 3D
cone-beam CT image reconstruction is presented. This paper is organized as follows. Section II provides the description of the GPU employed in this study, the FDK algorithm,
a well-known analytical 3D image reconstruction method,
and the proposed parallel implementation strategy of the
FDK algorithm for the GPU. The performance was then
evaluated by comparing the results of our implementations
with those obtained from a number of previous related
works and all these are described in section III. Finally, the
paper is briefly concluded by section IV.
II. MATERIALS AND METHODS
A. The GPU Architecture and CUDA-based Programming
In fact, the GPU structure is developed for accelerating
computer graphics rendering from the microprocessors. In
addition to being able to efficiently manipulate computer
graphics, the highly parallel architecture of modern GPUs
Á. Jobbágy (Ed.): 5th European IFMBE Conference, IFMBE Proceedings 37, pp. 567–570, 2011.
www.springerlink.com
568
S.-W. Chen and C.-Y. Chu
B. The Feldkamp/FDK Algorithm
Similar to the original filtered back-projection algorithm,
the Feldkamp algorithm [12], or alternatively known as the
FDK algorithm, was specifically developed for cone-beam
CT reconstruction. Fig. 1 shows a schematic picture of the
cone-beam CT system with a planar detector. Belonging to
the analytical reconstruction, the FDK algorithm conducts a
series of steps that transform a number of 2D projections
taken on the planar detector into a 3D reconstructed region.
In general, the FDK algorithm mainly consists of three steps
listed as below:
Fig. 1. A schematic drawing of the cone beam CT system.
makes them even more effective for a wide range of complex HPC applications than general purpose CPUs. In fact,
computing intensity is growing much faster on GPU than on
CPU since most of the transistors in GPU are designed to
perform data calculations involving the computer graphics
or scientific computational algorithms rather than perform
data caching. Therefore, one may speculate that GPU would
be perfectly suitable to the problems that can be well formulated and expressed as data-parallel calculations so that
these calculations may be simultaneously executed on many
processing units, referred to as the thread processors [9].
Due to a degree of inconvenience and clumsiness in GPU
programming model, NVIDIA has created a C-like software
platform, called Computed Unified Device Architecture
(CUDA), for massively parallel high performance computing on the GPUs produced by NVIDIA. In general, CUDA
includes C/C++ development tools, a number of function
libraries that simplify programming, and a hardware abstraction mechanism [11]. With CUDA, a programmer first
needs to analyze the algorithm to be implemented as well as
the data, in order to determine the optimal numbers of
threads, blocks and grids (as defined below) so that the
GPU may be fully utilized. Then, the implementing solution
can be expressed simply by writing C/C++ code for CUDA.
For the current NVIDIA architectures, the resulting compiled CUDA function (or alternatively known as “kernel” in
CUDA’s term) can be called by multiple threads, thus the
kernel can be concurrently run in large blocks of threads.
Note that here a thread block is defined as a number of
threads that can synchronously execute on a cluster of
thread processors and manipulate on the same data in the
shared memory. Moreover, blocks that execute on the same
kernel can be further grouped together into a grid of blocks.
Therefore, in a GPU the maximum number of concurrent
threads devoted to executing a single kernel could be very
large.
1.
2.
3.
Weighting the projection data P(u,v,θ).
Filtering the weighted projection image row by row.
Accumulating the back-projected data contributed by
each projection into the corresponding voxel r(x,y,z),
where P(u,v,θ) represents the projection image pixel indexed by (u,v), and θ is the incident angle of X-ray.
First, the projection data P(u,v,θ) is weighted by a
weighting function, denoted as w, representing the variation
in intensity of each light due to the shape of cone-beam Xray. Then, like the original filtered back-projection reconstruction, in order to reduce the average smooth effect
caused by the accumulating process each projection image
has to be preprocessed by a high-pass filter. Finally, the
value of an individual voxel r(x,y,z) is estimated by summing up the corresponding filtered projection pixels Pf(u,v,θ)
over all the projection angles θ, where Pf(u,v,θ) denotes the
weighted filtered 2D projection data obtained from the projection angle θ. It should be noted that the coordinates (x,y,z)
and (u,v,θ) are related by a two-step transformation: first
rotating from (x,y) to (s,t) by an angle θ, then projecting the
3D coordinate (s,t,z) onto a 2D one (u,v) through a conebeam projection, as illustrated in Fig. 1. Therefore, given a
3D coordinate (x,y,z), the FDK algorithm allows one to find
the corresponding projection coordinate (u,v) for a given θ,
and then accumulate its associated projection data P(u,v,θ)
into the reconstruction voxel r(x,y,z).
C. GPU-based Implementation
As described previously, the FDK reconstruction method
consists of three steps: 1) weighting the projection images, 2)
ramp filtering the projection data row by row, and 3) backprojecting the filtered projection data into the voxel value.
Since the third step is computationally intensive and more
than 95% of the total reconstruction time is devoted to the
back-projection algorithm, we here mainly focus on the
GPU-based implementation of this portion of the recon-
IFMBE Proceedings Vol. 37
Graphic Processing Unit (GPU) Based Hardware Acceleration for High Speed 3D Cone-Beam CT Reconstruction
569
Step 1. Access all projections on DRAM; once a projection pixel is back-projected into its corresponding voxel,
write the updated voxel value to DRAM.
Step 2. Access all projections on DRAM; after all the
projection data over all angles are back-projected into their
corresponding voxel, write the final reconstructed voxel
value to DRAM.
Step 3. Upload each projection consecutively to the GPU
texture memory.
Step 4. Upload a subset of projections consecutively to
the texture memory by optimizing the use of texture.
Step 5. Restrict the number of registers used.
Processing time (ms)
Optimization step
Fig. 2. The execution times required for back-projection run by GPU with
applying the five optimization steps gradually.
struction process. In this aspect, the weighted and filtered
projection data set was divided into smaller subsets and then
each subset was consecutively uploaded to the GPU texture
memory since storing the data locally would reduce the
number of off-chip memory accesses, thus enhancing the
performance.
A voxel-based back-projection is proposed in this study.
Note that here the calculation of each voxel value is simply
performed on one thread. Also, in each thread the coordinate transformation for each voxel is computed first in order
to determine its corresponding projection data. In fact, the
3D volumic region is partitioned into 2D slices, and each
slice is then further divided into a number of rows of voxels.
As a result, in our application each row or a portion of row
consisting of 256 voxels can form a thread block (i.e., the
block size = 256) and the back-projection kernel is simply
called by these concurrent threads. Moreover, it should be
noted that in addition to the number of thread processors,
the number of registers required by each thread also limits
the maximum number of concurrent threads. In general, the
CUDA compiler can automatically determine the optimal
number of registers for each thread and typically the compiler assigns 20-32 registers per thread [11]. In our implementation, we restrict the number of registers per thread to 1520 so that the maximum possible number of concurrent
threads can be further increased, thus further accelerating
the reconstruction.
Note that in our GPU-based back-projection, we gradually apply the five optimization steps (as itemized below) in
order to enhance the overall performance step by step, towards the optimum:
Fig. 2 shows the execution times required for backprojection run by GPU with applying the above five optimization steps gradually. The numerical experimental results
in Fig. 2 were derived from reconstructing a 2563 volume
using an input data set consisting of 128 views of 256ͪ256
pixels each. According to the numerical experimental results as indicated by Fig. 2, one may see that Step 4 definitely plays a crucial role in the GPU-based implementation.
Such an observation actually supports to a fact that optimizing the use of texture would significantly speed up the entire
reconstruction process (speeding up 3.71 times compared to
the case without the use of texture in Step 1).
III. PERFORMANCE EVALUATION
In our evaluation, we used the “Take” program [13] developed by Jens Müller-Merbach to generate the input data
set of cone-beam projection data which consists of 128
projections of 512ͪ512 pixels each. We evaluated the GPU
performance by reconstructing a 512 cubed volume from
the other data set consisting of 128 views, the size of each
projection is 512 × 512 pixels. Table 1 provides a performance comparison of a variety of CT reconstruction systems from a number of existing implementations for Field
Programmable Gate Array (FPGA), multi-core CPU and
GPU, including our works. All the results were obtained by
reconstructing a 5123 volume. As a result, it is indicated
from Table 1 that our GPU-based implementation still
achieves the best performance (58.80pps) in comparison to
these existing techniques. (Note that here pps represents the
number of processed projections per second.)
On the other hand, one may also see that the performance
of the multi-core CPU approach is only comparable with the
FPGA one when reconstructing a 5123 volume (7.85pps vs.
8.96pps). In fact, we may speculate that although it has been
demonstrated in this study that a shared cache mechanism in
the multi-core CPU enables a remarkable enhancement of
computational performance in CT reconstructions, massive
IFMBE Proceedings Vol. 37
570
S.-W. Chen and C.-Y. Chu
amount of data due to 2D projections of a higher resolution
(for example, 512×512 pixels) would actually result in a
lower cache hit rate so the number of DDR3 memory accesses may inevitably increase, thus undesirably reducing
the execution speed. Therefore, we may further conclude
from all these performance results that the GPU may be
considered the best candidate for performing high-speed CT
reconstructions, in comparison to both the multi-core CPU
and FPGA.
Table 1 The execution times in the back-projection step required for
reconstructing a 3D region with the size of 5123 voxels achieved by a
number of existing implementations
Hardware Time (s) Processed projections/second (pps)
FPGA by Li et al. [14]
33.5
8.96
Multi-core CPU by Chu and Chen [5]
16.304
7.85
GPU (8800 GTX) by Xu and Muller [7]
8.9
40.45
GPU (8800 GTX) by Yang et al. [8]
8.7
41.38
GPU (GTS 250) by this work
2.177
58.80
2.
3.
4.
5.
6.
7.
8.
9.
10.
IV. CONCLUSION
In this paper, a GPU based implementation of an analytical cone-beam based 3D CT image reconstruction method,
called the FDK algorithm, is presented. The performance
was evaluated by comparing the results of our implementation with those obtained from a number of previous related
works using different platforms including FPGA, multi-core
CPU and GPU, respectively. Consequently, it is revealed
from our performance results that the CUDA-based GPU
actually substantially outperforms the multi-core CPU and
FPGA for massively parallel HPC applications. Therefore,
we may conclude that nowadays the GPU is still considered
the best candidate for performing high-speed CT reconstructions, compared to multi-core CPU and FPGA.
ACKNOWLEDGMENT
This work was supported by the National Science Council, Taiwan, under Contract NSC 99-2221-E-182-020.
REFERENCES
1.
Sakamoto M, Murase M (2007) Parallel implementation for 3-D CT
image reconstruction on Cell Broadband Engine. IEEE Proc., ICME,
Beijing, China, 2007, pp. 276–279.
11.
12.
13.
14.
Sakamoto M, Murase M (2009) A parallel implementation of 3-D CT
image reconstruction on the Cell Broadband Engine. International
Journal of Adaptive Control and Signal Processing 24:117–127.
Liang R, Pan Z, Krokos M et al. (2008) Fast hardware-accelerated
volume rendering of CT scans. Journal of Display Technolog 4: 431–
436.
Cruvinel P, Pereira M, Saito J et al. (2009) Performance improvement
of tomographic image reconstruction based on DSP processors. IEEE
Trans. on Instrumentation and Measurement 58:3295-3304.
Chu C, Chen S (2009) Parallel implementation for cone-beam based
3D Computed Tomography (CT) medical image reconstruction on
multi-core processors. IFMBE Proc. vol. 25/IV, World Congress on
Med. Phys. & Biomed. Eng., Munich, Germany, 2009, pp. 2066–2069.
Chen C, Lee S, Cho Z (1990) A parallel implementation of 3-D CT
image reconstruction on hypercube multiprocessor. IEEE Trans. on
Nuclear Science 37:1333–1346.
Xu F, Mueller K (2007) Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Physics in Medicine
and Biology 52:3405–3419.
Yang H, Li M, Koizumi K et al. (2007) Accelerating backprojections
via CUDA architecture. Proc., 9th International Meeting on Fully
Three-Dimensional Image Reconstruction in Radiology and Nuclear
Medicine, Lindau, Germany, 2007, pp. 52–55.
Noël P, Walczak A, Hoffmann K et al. (2008) Clinical evaluation of
GPU-based cone beam computed tomography. Proc. HighPerformance Medical Image Computing and Computer Aided Intervention (HP-MICCAI'08).
Scherl H, Keck B, Kowarschik M et al. (2007) Fast GPU-Based CT
Reconstruction using the Common Unified Device Architecture
(CUDA). IEEE Proc. Vol. 6, 2007 Nuclear Science Symposium Conference Record (NSS ‘07), 2007, pp. 4464–4466.
Halfhill T (2008) Parallel processing with CUDA: Nvidia’s highperformance computing platform uses massive multithreading. Microprocessor Report 22:1–8.
Feldkamp L, Davis L, Kress J (1984) Practical cone-beam algorithm. J. Opt. Soc. Am. A1:612–619.
J. Müller-Merbach (1996) Simulation of X-ray projections for experimental 3D tomography, ISSN 1400-3902, report no. LiTH-ISY-R1866.
Li J, Papachristou C, Shekhar R (2005) An FPGA-based computing
platform for real-time 3D medical imaging and its application to
cone-beam CT reconstruction. The Journal of Imaging Science and
Technology 49:237–245.
Author:
Institute:
Street:
City:
Country:
Email:
Szi-Wen Chen
Dept. of Electronic Engineering, Chang Gung University
259 Wen-Hwa 1st Road
Kwei-Shan, Tao-Yuan
Taiwan
[email protected]
Author:
Institute:
Street:
City:
Country:
Email:
Chang-Yuan Chu
Dept. of Electronic Engineering, Chang Gung University
259 Wen-Hwa 1st Road
Kwei-Shan, Tao-Yuan
Taiwan
[email protected]
IFMBE Proceedings Vol. 37