Abstract
The nonrigid registration algorithm based on Bspline FreeForm Deformation (FFD) plays a key role and is widely applied in medical image processing due to the good flexibility and robustness. However, it requires a tremendous amount of computing time to obtain more accurate registration results especially for a large amount of medical image data. To address the issue, a parallel nonrigid registration algorithm based on Bspline is proposed in this paper. First, the Logarithm Squared Difference (LSD) is considered as the similarity metric in the Bspline registration algorithm to improve registration precision. After that, we create a parallel computing strategy and lookup tables (LUTs) to reduce the complexity of the Bspline registration algorithm. As a result, the computing time of three timeconsuming steps including Bsplines interpolation, LSD computation, and the analytic gradient computation of LSD, is efficiently reduced, for the Bspline registration algorithm employs the Nonlinear Conjugate Gradient (NCG) optimization method. Experimental results of registration quality and execution efficiency on the large amount of medical images show that our algorithm achieves a better registration accuracy in terms of the differences between the best deformation fields and ground truth and a speedup of 17 times over the singlethreaded CPU implementation due to the powerful parallel computing ability of Graphics Processing Unit (GPU).
1. Introduction
Image registration is a process used in medical image analysis to determine a spatial transformation that aligns image data according to the spatial coordinate of pixels. The process involves a reference image and a moving image; the moving image is deformed by a spatial transformation to align with the reference image [1]. With the development of medical imaging technologies, it is easy to obtain many images with different modalities and gain a more complete understanding of anatomy and function by registering and fusing these images. Generally, according to spatial transformation, image registration is categorized into two types: rigid registration and nonrigid registration. The rigid registration only describes the motion which is limited to global rotations and translations, and the nonrigid registration always includes very complex local and elastic deformations except rigid transformation.
Practically, nonrigid registration is always widely applied in the many clinical procedures to improve the geometric precision, such as disease diagnostic [2], neurosurgery [3] and ImageGuided Radiation Therapy (IGRT) [4]. Up to present, several classical nonrigid registration algorithms have been proposed and improved in the performance of registration results, including viscous fluid registration [5], demons registration [6], Finite Element Model (FEM) [7] and Bspline registration [8]. Bspline registration always employs uniform cubic Bspline curves to determine a displacement vector field which maps voxels in the moving images to those in the reference images [8]. Each voxel’s displacement between the moving image and the reference image is parameterized in terms of uniformly spaced control points that are aligned with the voxel mesh, and displacement vectors are obtained via interpolation of the controlpoint coefficients using piecewise continuous Bspline basis functions. Recently, Bspline registration became very popular because of their flexibility and robustness and is therefore able to perform both monomodality and multimodality image registration. However, In the Bspline registration algorithm, there are three timeconsuming steps: Bspline interpolation, similarity metric computation, and similarity metric gradient computation. First, in the process of Bspline interpolation, a coarse array of Bspline coefficients is considered as the input and a fine array of displacement values is computed as the output defining the deformation vector field from the moving image to the reference image. Therefore, Bspline interpolation is expensive in terms of executing time for the fine deformation mesh with many control points. In addition, similarity metric is a very crucial step to measure the registration results, and it has an important influence on the accuracy of the registration results. In addition, similarity metric is calculated in each iteration of the Bspline registration algorithm, and its computational efficiency has a serious impact on the performance of the whole registration process. Finally, similarity metric gradient computation requires evaluating the partial derivatives of the similarity metric with respect to each splinecoefficient value of all control points; it is therefore a very timeconsuming step to obtain better accuracy of registration results in the case of much more control points. In summary, the Bspline registration algorithm always needs more execution time for the larger amount of medical image data, which limits the clinical applications of Bspline registration [9, 10].
Aiming to improve the accuracy and significantly accelerate Bspline registration algorithm, we develop a Bsplinebased nonrigid registration algorithm that is suitable for executing on GPU by considering the LSD as similarity metric. In our algorithm, all of the computationintensive tasks are designed and implemented on GPU using the parallel computing framework, Computing Unified Device Architecture (CUDA). Experiments show that our algorithm improves the accuracy of registration results and achieves a speedup ratio of 17 times comparing to the singlethreaded serial implementation. The main contributions are summarized as follows:(1)We introduce the LSD as similarity metric in the Bspline registration algorithm to obtain the higher registration accuracy for medical images.(2)We design a parallel computing strategy and create LUTs for Bsplines registration to improve the execution performance. Meanwhile, three timeconsuming steps including Bsplines interpolation, the similarity metric LSD, and its gradient computation are designed in the form of the kernel functions which are executed in parallel on GPU.
The rest of this paper is organized as follows. Section 2 presents the related works of accelerating Bspline registration using cluster, manycore and GPU. Section 3 describes our Bspline registration algorithm employed the LSD as the similarity metric and NCG optimization. Section 4 explains the parallel computing details of our proposed algorithm based on CUDA. Section 5 shows experiments results and analysis. Finally, Section 6 concludes our work.
2. Related Work
The parallel computing technology is an effective way to accelerate many general purpose applications and has a very crucial influence on the field of high performance computing. Parallel operations can be performed independently on different partition of medical images due to the good data parallel characteristic of medical images. At present, parallel computing has been widely used in medical image processing [11, 12], for example, image registration [13, 14]. There are three parallel computing technologies for accelerating medical image registration: cluster, manycore, and GPU computing.
Cluster computing is a wellestablished technology of accelerating image processing algorithms and nonrigid registration is also no exception. The main strategy of cluster computing is composed of two stages: dividing the medical image into many different small partitions and processing them independently on different computers while minimizing the communication cost. Zheng et al. [15] used mutual information as a similarity metric and developed parallel nonrigid registration algorithms on a computer cluster. Results reported in Zheng et al. [15] for Bsplines show a speedup of for processors compared to a sequential implementation.
The recent developments in manycore processor technology offer new opportunities for truly largescale and costeffective parallel computing on a single chip. Meanwhile, multiprocessor computing techniques are utilized for accelerating image registration algorithms. Jiang et al. [16] used a FPGAbased implementation leading to a speedup of 3.2 times compared with a 2.666 GHz CPU execution. Rohlfing and Maurer Jr. [17] reduced computation time more than 50 times by using 64 CPUs of a shared memory supercomputer. More recently, Rohrer et al. [18] presented a multicore implementation of the Bspline computation based on a Cell Broadband Engine platform, reporting a speedup of 40 times over a sequential CPU implementation. Although these techniques provide considerable computation time improvements, they require either high technical knowledge or hardware with prices inhibiting wide adoption.
Modern GPU is the extension of manycore technology and has a large number of cores on the chip, and this manycore architecture of GPU is more suitable for parallel computing. For algorithms that can be parallelized within the SIMD model, a single GPU offers the computing power equivalent to a small cluster of desktop computers. As noted in the introduction, a bottleneck of the Bspline FFD registration algorithm is the cubic Bspline computation. Consequently, some work has been done to speed up this part by using GPU. For example, Sigg and Hadwiger [19] developed a cubic Bspline interpolation method utilizing linear interpolation of the GPU hardware; however, such an approach suffers from the limitations imposed by the low fixedpoint precision of the hardware interpolator as reported by Ruijters et al. [20], which limits its applicability to robust and precise medical image registration.
In addition, researchers proposed many improved approaches to accelerate Bspline registration algorithm with respect to Bspline interpolation and similarity metric computation. Modat et al. [21] presented a parallelfriendly formulation algorithm that is suitable for GPU, it only take less than 1 min to perform registration of T1weighted MR images without reducing the registration accuracy. Shackleford et al. [22] employed the SSD as similarity metric and accelerated Bspline registration within the streamprocessing model using GPU, and this algorithm achieved a good speedup over the singlethreaded CPU. Ikeda et al. [23] proposed an efficient GPUbased acceleration method for the nonrigid registration of multimodal images, and their main contribution is efficient utilization of onchip memory for Normalized Mutual Information (NMI). Ellingwood et al. [24] adopted the sum of squared tissue volume difference as a similarity metric and thus developed a multilevel Bspline transform method to implement nonrigid masspreserving registration of two CT lung images on GPU, and this method outperforms its multithreaded CPU version in terms of running time.
3. Our Algorithm
Aiming to significantly improve the accuracy and the performance of Bspline registration algorithm, we propose a CUDAbased parallel Bspline registration algorithm which employs LSD as the similarity metric and used NCG as optimization strategy. In this section, we describe the details of our Bspline nonrigid image registration algorithm. Firstly, we represent the theory of describing nonrigid deformation using the cubic Bspline basis function. Secondly, we introduce the LSD as the similarity metric of our algorithm. Finally, we explain the NCG optimization procedure and the gradient of similarity metric in our algorithm.
3.1. BSpline FreeForm Deformation
Bspline FFD is used as the nonrigid transformation model, and threedimensional (3D) deformation field is defined by 3D mesh of control points in the Cartesian coordinates. The deformation field for a voxel located at coordinates in the reference image can be mathematically described as follows [15, 21]:
In (1), is the cubic Bspline basis function in the direction [8], and represents the deformation mesh. In the 3D medical image data, a partition region is composed by the eight adjacent control points, and the 3D spacing coordinate of this partition region is shown as follows:where , , and are the voxel numbers of the , , and directions of the control point mesh. Displacement of voxels in each region are affected around 64 control points. The 64 control points coordinates are calculated in the following scheme [8, 25]:
In the medical image volume data, the voxel with the coordinate has the local coordinate in the region , and the value is normalized in as follows [8, 25]:
3.2. The Similarity Metric LSD
A similarity metric quantifies the degree of alignment and is used to optimize the transformation’s parameters to achieve maximal alignment. The similarity metric has a key influence on the accuracy of registration results. Sum of Squared Difference (SSD) is a very classical similarity metric based on pixel computation, and it is widely used in the monomodality medical image registration. However, when SSD is used as the similarity metric, due to the fact that the SSD values of the different iterations are very closed in the later stage of optimization process, it leads to appear premature convergence rate and also affects the accuracy of the registration results.
Since the function value of the logarithm transformation has different influence on the different intervals of the definition domain, especially when the independent variable value is more and more small, the function value changes more quickly. In other words, the difference between two values of logarithmic function is more sensitive to the interval which has a small logarithmic values than the interval which has a large logarithmic values. In the latter stage of the registration process, the difference between the reference image and the moving image subject to local deformation becomes much smaller, the logarithmic operation can make the SSD of similarity metric to change more quickly, and it is helpful to find best deformation mesh more effectively. Therefore, in order to further improve the accuracy of registration results, we perform logarithmic operation on SSD and introduce the LSD as the similarity metric of the Bspline registration algorithm. LSD can be mathematically described as follows:
In (5), represents the reference image, is the moving image with Bspline deformation, and , , and are separately the voxel numbers in the , , and directions of reference and moving images. In order to promote smooth deformation, a penalty term is added to the LSD value. The final cost function to be optimized is a balance between the LSD and the deformation penalty:where . The penaltyterm was first used for image registration by Rueckert et al. [8].
3.3. BSpline Registration Optimization Based on NCG
Conjugate Gradient (CG) is a very effective optimization method for optimization problem. CG overcomes the shortcoming of slow convergence of the gradient descent method and also avoids the shortage from Newton’s method, which needs to store and compute the Hessian matrix and the inverse matrix [26]. The NCG method is an extension of CG and suitable for minimizing general nonlinear functions. Due to the fact that the Bspline registration process is a nonlinear optimization problem, the NCG method is adopted for the entire registration process optimization in our algorithm. So it is important to calculate the gradient of the similarity metric for control points in every iteration and update the control point mesh according to the negative gradient direction from the previous search. Iteratively, the process of updating parameters of control point mesh can be mathematically described in where denotes the displacement vector of control points in the iteration process. is the iteration number. is a scalar, which represents the displacement magnitude of the negative gradient direction, and is obtained using the inexact linear search method. represents the search direction, which is calculated aswhere is a variable, which is calculated as
As seen in (10), we thus require the derivative of the cost function:
In order to decrease the computational complexity, we propose a voxelcentric approach to evaluate the pointcentric gradient using the chain rule. We first compute the gradient for every voxel and then gather the information from all voxels to compute the gradient for control points. In (11), is calculated via the chain rule aswhere denotes the partial derivative of similarity metric for voxels, and it is calculated as follows:where is the moving image after the deformation, is the reference image, and denotes the initial gradient of the moving image. In (12), the second term is the derivative value of deformation field with respect to a control point, and the calculation form is described in
In (14), is the only metric related to the Bspline basis function, and it is also a constant value when the local coordinates of voxels in the mesh partition are determined. In order to simplify the parallel computing process of our algorithm, we set to 0.
4. Parallel Accelerating Procedure for Our Algorithm
To effectively improve the execution efficiency of the Bspline registration, we employ the GPU to accelerate the timeconsuming steps of our algorithm. In this section, we present the parallel computing details of our algorithm using the GPU. First, we explain the main procedure of our algorithm based on GPU. Secondly, we develop the parallel computing strategy of Bspline registration based on the characteristic of locality of Bspline transformation and design the LUTs for storing the reusable data for parallel kernel functions executing on the GPU. Then, we present kernel functions for these timeconsuming steps. Finally, we summarize our proposed parallel algorithm based on GPU.
4.1. The Main Procedure of Our Algorithm
In general, data parallelism is the main requirement for an algorithm to benefit from GPU execution. Our nonrigid registration algorithm based on Bspline FFD comprises three components, which may be considered independently: transformation of the moving image using the splines interpolation, evaluation of a similarity metric, and optimization against this similarity metric. In these three components, the optimization process always requires adequate iterations and the current iteration is always dependent to the previous iteration, so it is serial process and is not implemented in parallel on the GPU. Individually, Bsplines interpolation and evaluation of a similarity metric may be formulated in a data parallel manner as they mainly consist of voxelwise computations. In our algorithm, NCG is used for the optimization, so the first order derivative of the similarity metric is computed to determine the optimal direction of the next iteration. Since the LSD is used as the similarity metric for evaluating the registration quality in our algorithm, we also design the parallel kernel function of the gradient of similarity metric LSD in terms of the highly parallel data features. The main framework of our algorithm using GPU is shown in Algorithm 1.

4.2. Parallel Computing Strategy for BSpline Registration
In the Bspline registration algorithm, the elastic deformation of the moving image is uniform cubic Bspline transformation which is parameterized by control point mesh. Figure 1 shows that the elastic deformation of the 2D moving image is described using control point mesh. In Figure 1(a), since the red control point is the current central control point of 4 × 4 regions, the displacement vector of the central control point will be established by the around 4 × 4 regions and is entirely unrelated to the other regions of the moving image; in Figure 1(b), the displacement vector of each pixel located in the partition region is determined by its neighboring 4 × 4 control points when updating the Bspline transformation on the moving images. In other words, relative displacement changes of any control points only affect the pixels in the local region and are independent of the other global pixels of the moving images, so Bspline transformation has a good ability to describe the local elastic deformation of moving images. Therefore, this locality characteristic of Bspline transformation is a key foundation for designing the parallel computing strategy.
(a)
(b)
To improve effectively the executing speed of our Bspline registration algorithm using GPU, we develop a parallel computing strategy for Bspline registration and create LUTs to store the reusable data in the parallel processing procedure. Our parallel computing strategy is dividing the moving images into many different small and nonoverlapping partition regions in terms of the locality characteristic of Bspline transformation and then designing the parallel kernel functions to process them independently by many different threads on the different cores of GPU while utilizing the different memory of GPU effectively and minimizing the communication cost of thread blocks.
In order to ensure that the registration process on the GPU can effectively be executed in parallel, the following initialization needs to be done on the CPU before the entire process: image volume data is loaded into host memory and copied from the host memory to the GPU memory; then bounding it to the texture memory of the GPU can improve the data access efficiency; the initial Bspline mesh parameters are loaded from the host memory into the global memory of the GPU; we precompute some reusable values, such as the Bspline basis function products, which are constructed and stored as LUTs in the texture memory of the GPU. In terms of the symmetry of volume data, we can create three LUTs for the parallel process as follows:(1)The lookup table of basis function products. For the same normalized local coordinate , we precompute basis function products and store them in the , which can be reused in the kernel functions of computing similarity metric and the gradient of similarity metric. Therefore, it can avoid repeated calculations and effectively improve the computational speed. For example, in Figure 1(a), the two pixels (A and B) are in their respective local partition regions with the same local coordinate, so they can reuse the same value of Bspline basis function products.(2)The lookup table of control point indexes. Since all voxels in the same region compute their displacement vectors using the same 64 control points, we precompute the index set of control points in the same region and create this LUT for storing them to improve the executing efficiency.(3)The spatial gradient lookup table of moving images. Because the gradient of the initial moving image remains the same value in the entire registration process, we precompute the gradient of moving images on the CPU and create a LUT to save them in the texture memory of GPU, which can reduce the complexity of parallel calculation.
4.3. Parallel Kernel Functions of Our Algorithm
In every iteration of our algorithm, Bspline transformation, similarity metric, and the gradient of similarity metric are three timeconsuming steps. In terms of the above parallel computing strategy, we need to design three kernel functions for these timeconsuming steps, and these kernel functions are executed in parallel on the GPU by multiple threads to improve the execution efficiency of our Bspline image registration algorithm. Since Bspline transformation and similarity metric computation are calculated for each voxel of the moving images, we only design a parallel kernel function which sequentially calculates these two steps.
4.3.1. BSpline Transformation and Similarity Metric Computation
We use the LSD as the similarity metric to evaluate the quality of deformation field in the proposed algorithm. As described in Section 3.2, LSD accumulates the square of the intensity difference between the reference images and the moving images subject to the deformation field. In consideration of the computation of LSD with a good parallelization, we design a kernel function kernelFunctionSimilarity() to compute LSD and Bspline transformation in parallel on the GPU in this paper. The kernel function is illustrated in Algorithm 2.

4.3.2. Similarity Metric Gradient for Control Points
As shown in Figure 1, the displacement of each voxel in the same region is affected by displacements of 16 control points around the region when the moving image is subject to Bspline deformation. We compute the partial derivatives of all voxels for each control point in the same region, and all voxels housing in the same region have the contributions to 16 control points around this partition region, so each thread is corresponding to a partition region rather than a control point to improve the executing performance of the kernel function on the GPU. The kernel function CalSimilarityGradientKernelFunc() calculating the gradient of similarity metric for control points is designed as Algorithm 3.

4.4. Summary of the Proposed Algorithm
Our algorithm is summarized and illustrated in Figure 2. In the beginning of the registration process, the image data, precomputing LUTs, and the initial Bspline mesh deformation parameters are loaded from the CPU to the GPU. More specifically, in order to improve the efficiency of accessing the memory in GPU, reference images and precomputing LUTs are bound to the texture memory of the GPU, and the initial Bspline mesh data and moving images are loaded into the global memory of the GPU; secondly, two kernel functions are executed in parallel by multithreading on the GPU, which are designed to calculate the Bspline transformation for the moving images according to the deformation mesh, the similarity metric, and the similarity metric gradient for control points respectively; and finally the similarity metric and its gradient values are transferred from the GPU to the CPU for the NCG optimization; the optimal Bspline deformation mesh is outputted when the optimized condition was meted; otherwise update Bspline deformation mesh parameters and continue to repeat the above registration process until the optimal values are found.
5. Experiment Results and Discussion
We evaluated the performance of the proposed algorithm on a medical imaging dataset that was obtained from the database of retrospective image registration evaluation (RIRE) at the Vanderbilt university [27]. This dataset contains a lot of medical image volume data of different patients from several protocols, including CT, T1weighted (MRT1), T2weighted (MRT2), and proton density (MRPD) with a variety of slice thickness. All the corresponding slices from different protocols are originally aligned with each other. We implemented the proposed algorithm using CUDA C in the Microsoft Visual Studio .NET 2010 and used CUDA6.0 of driver and runtime libraries [28]. The experimental hardware configuration is shown in Table 1.
5.1. Accuracy Comparison with Different Similarity Metrics
In the first experiment, to validate the registration accuracy of the proposed algorithm, we distorted the reference image MRT1 with a known nonrigid geometric transformation field, or the socalled ground truth. Then, we applied the proposed algorithm and compared the registration results to the other two Bspline registration algorithms using Sum of Absolute Difference (SAD) and SSD as a similarity metric, respectively, which are implemented in the Insight Segmentation and Registration Toolkit (ITK) [29]. And finally, we compared the obtained deformations fields with the ground truth. Figure 3 shows the registration results of three algorithms for brain MRT1 images. Note that the registered moving images obtained by the proposed algorithm is visually more similar in deformation to the reference image than the images produced by the SSDbased algorithm and the SADbased algorithm, respectively. Moreover, the estimated transformation field resulting from our algorithm is more similar to the ground truth than those obtained using the other two algorithms.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
To measure the registration accuracy of the proposed algorithm, we computed the mean and standard deviation of the MSE between the ground truth and estimated displacement vectors. Table 2 displays the MSE statistics of the estimated deformation, when compared to the ground truth. The first column shows the mean ground truth deformation, which represents the magnitude of the displacement vector that is used to generate moving images in each experiment. For each mean value of ground truth, eight different transformation fields with this mean are generated and applied to the reference image to generate the corresponding moving image. The other columns display the average and standard deviation of MSE obtained by using three different similarity metrics for the generated eight pairs of referencemoving images. The results obtained using the proposed algorithm are considerably small compared to those of SADbased and SSDbased algorithm, and our algorithm shows the superior accuracy in comparison to the other two algorithms.
5.2. Accuracy Comparison with CPU Implementation
For our Bspline registration algorithm employing LSD as a similarity metric and NCG as a optimization strategy, we, respectively, implemented it based on CPU and GPU and compared the registration accuracy impact from the parallel computing framework for the four groups of medical images. Since the resolution of these images is 512 × 512, the control point mesh size, the number of control points, and the number of regions are 16 × 16, 35 × 35, and 32 × 32, respectively. In the GPUbased implementation of our algorithm, we set the grid size to 32 × 32 and the block size to 16 × 16 for the kernel function kernelFunctionSimilarity() and set the grid size to 4 × 4 and the block size to 8 × 8 for the kernel function CalSimilarityGradientKernelFunc(). Owing NCG is, respectively, adopted in the two implementations; we also set the maximum iteration number to 100 in the optimization process. Figure 4 shows the results using two implementations for the first group of chest CT images. Figure 5 shows the results that a reference image and moving image of the human head are matched using two implementations. As can be seen from Figures 4 and 5, the CPUbased implementation and the GPUbased implementation of our algorithm can both work well in describing the elastic deformation between reference and moving images, and they can both obtain highly accurate results.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
In order to evaluate the accuracy of registration results, some researchers used some mathematical metrics. Ellingwood et al. [24] used the normalized Root Mean Square Error (RMSE) to mathematically evaluate accuracy of the GPU resultant displacement field against the singlethreaded CPU implementation. RMSE is calculated as follows:where denotes the resultant displacement field associated with a voxel at the coordinate . To measure the registration quality, we generate the deformation field by running the registration process for 100 iterations and then compare the results against two CPUbased implementations (singlecore and dualcore). The average RMSE for the four groups of medical images is 0.027 ± 0.011 comparing our GPUbased implementation against the singlecore CPU implementation, and the average RMSE value is 0.024 ± 0.008 comparing our GPUbased implementation against the dualcore CPU implementation, so our GPUbased implementation and two CPUbased implementations can achieve almost the same accuracy of image registration. To sum up, according to two aspects of the image registration results and the RMSE metric, the experimental results show that both the two CPUbased implementations and our GPUbased implementation can generate nearidentical vector fields with respect to the reference images and almost achieve good registration results with the same precision.
5.3. Execution Speed Evaluation
In order to evaluate and compare the registration performance, five different algorithm versions are implemented: the singlethreaded CPU implementation; the dualcore CPU implementation using OpenMP; the basic GPU implementation (this GPU version has a straightforward and naive kernel function that calculates the gradient of similarity metric for each control point); the GPU implementation without LUTs (this version designs kernel function that calculates the gradient of similarity metric for each partition rather than each control point to decrease the calculation complexity, but it computes temporarily some intermediate values such as basis function products and does not still use LUTs); our algorithm (our algorithm computes three timeconsuming steps on the GPU and uses three LUTs and instruction optimization strategy in kernel functions to maximize execution performance).
These versions are performed for four groups of medical images. In our algorithm, we first set the maximum iteration number to 100 and define the initial mesh spacing size to 10 × 10. Then, we set the grid and block size for the two kernel functions kernelFunctionSimilarity() (the grid and block size are 32 × 32 and 16 × 16, respectively) and CalSimilarityGradientKernelFunc() (the grid and block size are, respectively, 5 × 5 and 10 × 10). The performance of our algorithm is compared with the two CPU versions and two GPU versions under two measures: total execution times and speedup ratio. The speedup ratio is a persuasive measure used to compare the performance of GPU to CPU, which is mathematically described as , where time refers to the total execution times. The execution times and speedup ratio comparisons of five versions for 2D images are, respectively, illustrated in Figure 6.
(a) The running times comparison of five versions
(b) The speedup ratio comparison of five versions
As shown in Figure 6, our algorithm requires approximately 25 s to complete the registration process for the reference and moving images with the size of 512 × 512. By comparing the execution time of our algorithm with the singlecore CPU version (the total time is about 220 s) and the dualcore CPU version (the total time is about 130 s), the running speed of our algorithm is around 9 times faster than the singlecore CPU version and 5 times faster than the dualcore CPU version. In addition, our algorithm is about 10 seconds less than the basic GPU version in terms of running time, and our algorithm is about 4 seconds less than the GPU version without LUTs. In Figure 6, the results show that our algorithm reduces the calculation complexity and accelerates the speed of image registration effectively due to employing the parallel computing strategy and LUTs.
In the next experiment, we evaluate the impact of the initial Bspline mesh spacing size on the registration efficiency. For the reference and moving images with the size of 512 × 512, we first set the maximum iteration number to 100 and, respectively, set the initial Bspline mesh size to 8 × 8, 10 × 10, 16 × 16 and 32 × 32 for four groups of different images. Then, the grid and block size are, respectively, 32 × 32 and 16 × 16 for the kernel function kernelFunctionSimilarity() in our algorithm. In addition, Bspline spacing sizes in four various datasets require different numbers of control points and regions, so we set reasonable values for the grid and block sizes to adapt each of datasets for the kernel function CalSimilarityGradientKernelFunc().Figure 7 illustrates the execution times and speedup ratio comparisons of five algorithm versions for different mesh spaces.
(a) The execution times comparison of five versions
(b) The speedup ratio comparison of five versions
As shown in Figure 7, the running times of the singlecore CPU version decrease about 60 s because control point mesh spacing size gets larger and the number of control points used to describe deformation field becomes less. The execution times of three GPU versions decreased less with the decrease of the amount of control points. Especially, in our algorithm, the mesh spacing size has a less impact on the execution speed of the entire registration process because the main timeconsuming steps are executed by multithreading in parallel on the GPU.
In order to evaluate the calculation efficiency of our algorithm for the different amounts of image volume data, we furthermore compare our algorithm with the other four versions in the elastic registration for four groups of 3D medical images with different sizes. The sizes of four group medical images are shown in Table 3. The execution time and speedup ratio comparison of five versions for 3D images with different resolutions are shown in Figure 8.
(a) The execution times comparison of five versions
(b) The speedup ratio comparison of five versions
In Figure 8, experimental results show that the execution time of the singlecore CPU version increases more significantly with the increasing amount of 3D volume data, but the execution time of our algorithm almost has no growth because of the powerful parallel computing ability of GPU. Meanwhile, our algorithm computes the gradient of the similarity metric for each partition rather than each control point to reduce the calculation complexity and improve the efficiency of global memory access, and we effectively create three LUTs and utilize kernel instruction optimization strategy in the design of kernel functions. Therefore, compared with the other two GPU versions, our algorithm utilizes the GPU parallel computing ability more fully, further improves the execution efficiency of Bspline registration, and especially achieves a better speedup ratio with the increasing amount of image volume data.
6. Conclusion
In this paper, we have introduced the LSD as similarity metric and developed a parallel computing strategy and some effective LUTs that greatly reduce the complexity of Bsplinebased registration and then designed highly parallel kernel functions for computing the similarity metric and its gradient on GPU. We have demonstrated the accuracy and execution speed of our algorithm via experiments using the public medical image datasets. Experimental results highlight that our algorithm takes full advantage of the GPU parallel computing, and the entire Bspline nonrigid registration process achieved the speed ratio of 17 times for the large amount of volume data compared with the CPUbased implementation. The future work will focus on the parallel registration algorithm based on new similarity metrics for multimodality medical images.
Competing Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work is supported by National Natural Science Foundation of China under Grant nos. 61562057 and 61461025, Lanzhou Science and Technology Innovation and Entrepreneurship Projects under Grant no. 2014RC7, the Natural Science Foundation of Gansu Province in China under Grant no. 145RJZA080, and the Youth Science Foundation of Lanzhou Jiaotong University under Grant nos. 2013005 and 2013009.