Abstract
Compressive sensing (CS) has been shown to enable dramatic acceleration of MRI acquisition in some applications. Being an iterative reconstruction technique, CS MRI reconstructions can be more timeconsuming than traditional inverse Fourier reconstruction. We have accelerated our CS MRI reconstruction by factors of up to 27 by using a split Bregman solver combined with a graphics processing unit (GPU) computing platform. The increases in speed we find are similar to those we measure for matrix multiplication on this platform, suggesting that the split Bregman methods parallelize efficiently. We demonstrate that the combination of the rapid convergence of the split Bregman algorithm and the massively parallel strategy of GPU computing can enable realtime CS reconstruction of even acquisition data matrices of dimension 4096^{2} or more, depending on available GPU VRAM. Reconstruction of twodimensional data matrices of dimension 1024^{2} and smaller took ~0.3 s or less, showing that this platform also provides very fast iterative reconstruction for smalltomoderate size images.
1. Introduction
Magnetic resonance imaging (MRI) is an important application of compressive sensing (CS) [1–4]. CS in MRI [5] has the potential to significantly improve both the speed of acquisition and quality of MR images, but requires an iterative reconstruction that is more computationally intensive than traditional inverse Fourier reconstruction. Compressive sensing accelerates MR acquisitions by reducing the amount of data that must be acquired. Reconstruction of this partial data set is then accomplished by iteratively constraining the resulting image to be sparse in some domain while enforcing consistency of the measured subset of Fourier data.
One practical barrier to the routine adoption of CS MRI is the delay between acquisition and reconstruction of images. Compressed sensing solvers work almost entirely with vector and image arithmetic, making them an excellent candidate for acceleration through using graphics processing units (GPUs) for parallelization. Here we illustrate how GPUs can be used to achieve significant increases in speed of CS reconstructions of large MRI data sets.
GPU computing means using the GPU to perform general purpose scientific and engineering computation. Highend video cards can contain hundreds of separate floatingpoint units, allowing for massively parallel computations at a fraction of the cost of CPUbased supercomputers, as measured on a per gigaFLOP basis (one gigaFLOP is equivalent to one billion floating point operations per second). GPU computing works best in single instruction, multiple data (SIMD) situations, such as the solution of large systems of linear equations. The power of GPU computing is already being realized in several advanced medical image reconstruction applications [6–9].
2. Materials and Methods
2.1. Hardware
The reconstruction platform tested was a highperformance GPU server designed to serve an MRI scanner as a dedicated reconstructor. This system contained a sixcore, 2.67 GHz Xeon X5650 with 12 GB of 1333 MHz DDR3 RAM, 32 KB of L1 cache, 256 KB of L2 cache, and 12 MB of L3 cache. An NVIDIA Tesla C2050 was used for this experiment. The Tesla C2050 contains 448 CUDA cores arrayed as 14 Streaming Multiprocessors (SMs) with 32 cores per SM. It has 3 GB of VRAM and 64 KB of memory per SM. All SMs share 768 KB of L2 cache. The Tesla C2050 has a theoretical maximum doubleprecision performance of 515 GFLOPs.
2.2. Reconstruction Problem
The optimization problem we are choosing to explore is the reconstruction of partial MRI data sets using CS. MRI is proving to be a fertile application for CS because many MR images can be highly compressed by transform coding with little loss of information. Conventional MRI data is acquired according to the Fourier sampling pattern required to satisfy the Nyquist criterion while producing an image of a given field of view and spatial resolution. To sample the Fourier domain “compressively,” only a random subset of the full Nyquist Fourier sampling scheme is acquired. Reconstruction of the conventional fully sampled MRI data requires simply an inverse Fourier transform, but the inverse problem becomes underdetermined when part of the Fourier data is omitted, so approximate methods must be used.
One highly successful method in particular has been to formulate the CS MRI reconstruction as a sparse recovery problem, in which an image is found that is consistent with the acquired Fourier data while having the sparsest representation in a chosen basis (e.g., gradient and wavelet). The typical formulation of the reconstruction of a complex image from a partial Fourier data set is then where is a measurement operator that transforms the sparse representation to the image domain then performs the subsampled Fourier measurement, and the and norms are, respectively, where the bar denotes complex conjugation. With the addition of the norm, the problem is more difficult to solve, and iterative techniques such as interior point methods [10, 11], iterative soft thresholding [12, 13], and gradient projection [14–16] are typically employed.
2.3. Software
The opensource split Bregman code of Goldstein and Osher [16] was chosen as the starting point for the GPUbased CS solver. The solver was originally written in Matlab. This solver was chosen for its rapid convergence and lack of array reduction steps, which hinders parallelization. We modified the original code to work with Jacket 1.8.0 (AccelerEyes, Atlanta, GA) and Matlab R2010b (Mathworks, Natick, MA). CUDA Toolkit 4.0 and CUDA developer driver 270.41.19 were used for all computations.
Algorithm 1 briefly outlines the procedure for running the split Bregman reconstruction on the GPU. Note that the split Bregman algorithm runs for a fixed number of iterations, so there is no variation in run time due to different descent trajectories as with a tolerancebased stopping criterion. Furthermore, the choice of image reconstructed has no bearing on the results, since the fixed number of iterations ensure that the same number of operations are performed on any input data set.

At the beginning of the reconstruction, the Fourier data in main memory must be transferred to the GPU with Jacket’s gsingle and gdouble Matlab commands. Next, temporary storage is allocated on the GPU using Jacket’s gzero and gones commands. All subsequent arithmetic operations, including the Fourier transform, are carried out on the GPU using function overloading. Function overloading simplifies code syntax by enabling a single function to encapsulate different functionality for different types of arguments. The specific behavior is typically chosen by the compiler or at run time. In our case, Matlab automatically calls the Jacket library if an operation is requested on a matrix that lies in GPU memory, while identical operations on a matrix in main memory are carried out with Matlab’s builtin functions. After the last loop iteration on the GPU, the solution is transferred back to main memory with overloaded versions of Matlab’s double and single commands; all temporary storage is automatically freed.
2.4. Experiments
Two numerical experiments were performed. The first was a pure matrix multiplication, designed to measure practical peak floatingpoint performance of the CPU and GPU as realized by Jacket 1.8.0. To remove dependencies on the multithreading performance of Matlab R2010b and provide easier comparison of the CS reconstructions, the CPU experiment was run both with and without multithreading enabled.
The second experiment, and the focus of this work, was a CS MRI reconstruction of a weighted breast image subjected to a 50% undersampling in Fourier (spatial frequency) space. Total variation was used as the sparsity constraint and was defined as the sum of the magnitudes of pixels in the gradient image. (See [5] for more details about CS MRI reconstruction in general.) Figure 1 shows sample images from the experiment and the random Fourier sampling pattern.
(a)
(b)
(c)
(d)
CS MRI reconstructions were performed for powersoftwo image sizes ranging from 32^{2} to 8192^{2} (up to 4096^{2} only for double precision, due to memory limitations). This range covers the range of realistic MR acquisition matrix sizes for 2D scans with allowance for specialty techniques at very low or very high resolutions or future developments in imaging capabilities. The largest matrix sizes can also be indicative of the performance of threedimensional reconstruction problems. For example, a 256^{3} data set is the same size as a 4096^{2} one.
The CPUbased reconstructions were performed with and without multithreading enabled. All reconstructions were timed eleven times, with the first iteration discarded and the following ten iterations averaged. This avoids biasing the results with startup costs associated with both Jacket and CUDA. Jacket uses JustinTime (JIT) compilation to improve performance of repeated function calls, so the first call to the Jacket library is slowed by this compilation step. The CUDA driver, upon initial invocation, optimizes the lowlevel GPU code for the particular hardware being used. These two processes increase code performance across multiple runs but reduce it for the initial function calls.
For timing experiments, the startup penalty is a confounding factor. Accurate timing thus requires that the reconstruction be “warmed up” with a similar problem before timing the fullscale computation. Here we used the simplest approach of discarding the first iteration. In principle, though, the warmup problem can be much smaller in data size as long as it uses the same set of Jacket functions needed in the fullsize reconstruction.
3. Results
3.1. Baseline
Table 1 shows the result of the matrix multiplication experiment with and without CPU multithreading enabled. In terms of CPU performance, Matlab R2010b accelerated the CPUbased matrix multiplication by ~6 on this sixcore processor using multithreading. Using the GPU then yielded an additional factor of ~5 beyond this, with a combined speedup of ~30 over a single CPU core. Based on the measured GPU single and doubleprecision performance of 650 and 311 GFLOPs, respectively, we can see that Matlab R2010b combined with Jacket 1.8.0 reached 60%, and 63% of the theoretical maximum single and doubleprecision performance, respectively, of the GPU.
3.2. CS MRI Reconstruction
The results of the CS MR image reconstructions are shown in Tables 2 and 3 and Figure 2. The maximum speedup was 27 for a singleprecision image of size 2048^{2}. For a typical doubleprecision MRI acquisition matrix of 256^{2} to 512^{2}, we found speedups of ~7–17.
Figure 2 shows the speed advantage of the GPUbased code over using both one and multiple CPU cores. The difference between enabling multithreading or not was large, and should be considered to fairly evaluate the speed improvement of using a GPU. As can be seen in Figure 2, the speedup was less than one for images smaller than about 64^{2}, followed by a rapid gain in speedup factor for images of 2048^{2} and a decline in speedup factor for the largest matrix sizes. Despite the performance falloff, the largest image was reconstructed over an order of magnitude faster on the GPU. The singleprecision GPU code was able to reconstruct images up to 8192^{2} before running out of GPU memory since singleprecision matrices require half the storage per element of doubleprecision matrices.
4. Discussion
Two notable features are evident in Tables 2 and 3. First, the GPU reconstruction times are very similar for images below 256^{2}, regardless of numerical precision. This is due to the communication overhead in transferring the data from the CPU memory to the GPU memory. For the smallest images, this cost dominated the computation time and, for images below 64^{2}, even caused the GPU reconstruction to take longer than the CPU reconstruction. This suggests that for very small acquisition matrices an efficient GPU reconstruction should combine multiple 2D data sets, such as different slices or echoes, into one reconstruction problem.
The second interesting feature of Tables 2 and 3 is that GPU reconstruction times for images of size 1024^{2} and smaller were effectively instantaneous. A typical rapid gradient echo sequence may employ a repetition time of 5 ms, so a 50% undersampling of an Fourier matrix would require roughly 2.5 ms to acquire (ignoring other acceleration methods for simplicity). The smallest image tested here would thus take 40 ms to acquire, while the CS reconstruction would take only 60 ms to complete, even with the severe communication penalty. The largest doubleprecision image we tested (4096^{2}) would take 15 s to acquire and 10 s to reconstruct. Thus our GPUbased platform has the capacity to produce iteratively reconstructed CSaccelerated gradient echo images in real time for some MRI applications.
The decline in acceleration at 4096^{2} was likely due to memory limitations (Yalamanchili, private communication). The NVIDIA Tesla C2050 card is designed such that each streaming multiprocessor (SM; comprised of 32 floating point units) shares a single 64 KB block of local memory. For a doubleprecision complex matrix multiplication, each matrix element requires 16 bytes of storage, so a maximum of 4096 elements can be stored in the SM’s shared memory.
Matlab stores arrays in column major order, so a matrix column must fit entirely into the shared memory of the SM in order to minimize memory access overhead. Thus memory access patterns will become inefficient above matrix sizes of 2048 for doubleprecision complex matrix multiplication. We do in fact see a leveling off of performance at 1024^{2} and a dramatic decline in performance above 2048^{2}, which is consistent with this prediction.
Many alternate methods of accelerating the CS reconstruction on a GPU platform exist, including writing custom lowlevel code in CUDA C or OpenCL, using free lowlevel libraries like CULA and CUFFT, using Matlab’s builtin GPU library through the Parallel Computing Toolbox (R2010b later only), or using other free highlevel MatlabCUDA interfaces, such as GPUmat.
The pace of algorithmic development is accelerating in compressed sensing, as demonstrated in Figure 3, and one advantage of using a highlevel interface to the GPU is rapid prototyping and implementation. Debugging time is shorter with highlevel languages, and coding effort is reduced, allowing the newest algorithms to be implemented quickly while still retaining a majority of the theoretical computational benefit of the GPU. Matlab’s builtin GPU library does not support the array indexing needed for the gradient operation, so we could not compare it here. GPUmat is free and open source and a viable alternative to Jacket in theory, but we were unable to use it with our reconstruction code due to an unresolvable memory access error. Jacket is the only highlevel software package to support sparse matrix operations on the GPU, which allows classes of compressed sensing algorithms that use sparse matrix operators to be used. (e.g., the gradient operation can be implemented as a bidiagonal matrix.) Also, the overloaded GPU functions in Jacket are implemented as MEX files, which are precompiled C/C++ functions callable from within Matlab; so writing custom CUDA subroutines could only eliminate the function call overhead and not speed up the individual SIMD operations.
5. Conclusion
We have shown that GPU computation significantly accelerated CS MRI reconstruction of all but the smallest of the tested image sizes. The combination of Matlab and Jacket yields a processing package that is able to realize over half of the theoretical maximum performance of the GPU, while requiring minimal code development.
The speedup realized by the GPU for the smallest images was progressively hampered by communication overhead, while the largest images suffered from the limited GPU RAM. The optimal image dimensions, however, seem to be serendipitously close to that of highresolution MRI data; so GPU computing, coupled with the Goldstein and Osher split Bregman algorithm, appears to be a wellsuited platform for rapid CS MRI reconstruction.
Future improvements to these methods include algorithm modifications to allow unified reconstruction of multiple twodimensional or a single threedimensional Fourier data set on the GPU with a single call to the reconstructor, thus reducing the communication penalty. Additional cores on the GPU card could allow higher acceleration, since we found that the split Bregman algorithm parallelizes extremely well. And finally, more RAM onboard the GPU would allow larger data sets to be reconstructed more efficiently.
Acknowledgments
The authors received helpful advice from J. Melonakos and P. Yalamanchili from AccelerEyes, Inc., and funding from the National Institutes of Health through NIBIB T32 EB001628, NIBIB RO1 EB00461, NCI 1U01CA142565, and NCI 5R01CA138599.