#### Abstract

*Purpose*. To develop generic optimization strategies for image reconstruction using graphical processing units (GPUs) in magnetic resonance imaging (MRI) and to exemplarily report on our experience with a highly accelerated implementation of the nonlinear inversion (NLINV) algorithm for dynamic MRI with high frame rates.* Methods*. The NLINV algorithm is optimized and ported to run on a multi-GPU single-node server. The algorithm is mapped to multiple GPUs by decomposing the data domain along the channel dimension. Furthermore, the algorithm is decomposed along the temporal domain by relaxing a temporal regularization constraint, allowing the algorithm to work on multiple frames in parallel. Finally, an autotuning method is presented that is capable of combining different decomposition variants to achieve optimal algorithm performance in different imaging scenarios.* Results*. The algorithm is successfully ported to a multi-GPU system and allows online image reconstruction with high frame rates. Real-time reconstruction with low latency and frame rates up to 30 frames per second is demonstrated.* Conclusion*. Novel parallel decomposition methods are presented which are applicable to many iterative algorithms for dynamic MRI. Using these methods to parallelize the NLINV algorithm on multiple GPUs, it is possible to achieve online image reconstruction with high frame rates.

#### 1. Introduction

Accelerators such as graphical processing units (GPUs) or other multicore vector coprocessors are well-suited for achieving fast algorithm run times when applied to reconstruction problems in medical imaging [1] including computed tomography [2], positron emission tomography [3], and ultrasound [4]. This is because respective algorithms usually apply a massive number of the same independent operations on pixels, voxels, bins, or sampling points. The situation maps well to many-core vector coprocessors and their large number of wide floating-point units that are capable of executing operations on wide vectors in parallel. The memory bandwidth of accelerators is roughly one order of magnitude greater than that of central processing units; they are optimized for parallel throughput instead of latency of a single instruction stream.

Recent advances in magnetic resonance imaging (MRI) pose new challenges to the implementations of associated algorithms in order to keep data acquisition and reconstruction times on par. While acquisition times decrease dramatically as in real-time MRI, at the same time the data grows in size when using up to 64 or even 128 independent receiver channels on modern MRI systems. In addition, new modalities such as model-based reconstructions for (dynamic) parametric mapping increase the computational complexity of reconstruction algorithms because they in general involve iterative solutions to nonlinear inverse problems. As a consequence, accelerators are increasingly used to overcome these challenges [5–13].

Accelerators outperform main processors (CPUs) in all important operations during MRI reconstruction including interpolation [14], filtering [15], and basic linear algebra [16]. The central operation of most MRI reconstruction algorithms is the Fourier transform. Figure 1 shows the performance of the three FFT libraries cuFFT, clFFT, and FFTW [17]. The accelerator libraries cuFFT and clFFT outperform the CPU library FFTW. Consequently, accelerators outperform CPUs for MRI reconstruction, if the algorithm is based on the Fourier transform and if the ratio of computation to data transfer favours computation. Accelerator and CPU typically form distributed memory systems. Figure 2 shows the memory transfer speed for two different PCIe 3.0 systems in dependence of the data transfer size. The amount of computation assigned to an accelerator and the transfer size have a huge impact on the overall performance of the algorithm, which needs to be taken into account when implementing accelerated algorithms for image reconstruction. Another limiting factor for the use of accelerators in MRI is the restricted amount of memory available. Nevertheless, the on-board memory of accelerators increased in recent years and the newest generation of accelerators is equipped with acceptable amounts of memory (compare Table 1).

In this paper, we summarize our experience in developing a low-latency online reconstruction system for real-time MRI over the last eight years. While some of the described techniques have already been reported previously, this article for the first time explains all technical aspects of the complete multi-GPU implementation of the advanced iterative reconstruction algorithm and adds further background information and analysis. Although we focus on the specific application of real-time MRI, many of the techniques developed for this project can be applied to similar tomographic reconstruction problems. In particular, we describe strategies for optimal choice of the grid size for a convolution-based nonuniform FFT, novel parallelization schemes using temporal and spatial decomposition, and automatic tuning of parameters.

#### 2. Theory

##### 2.1. Real-Time MRI

Diagnostic imaging in real time represents a most demanding acquisition and reconstruction problem for MRI. In principle, the data acquisition refers to the recording of a large number of different radiofrequency signals in the time domain which are spatially encoded with the use of magnetic field gradients. The resulting dataset represents the* k*-space (or Fourier space) of the image. MRI acquisition times are determined by the number of different encodings needed for high-quality image reconstruction multiplied by the time required for recording a single MRI signal, that is, the so-called repetition time TR. While TR values could efficiently be reduced from seconds to milliseconds by the invention of low-flip angle gradient-echo MRI sequences, for example, see [18] for an early dynamic application, a further speed-up by reducing the number of encodings was limited by the properties of the Fourier transform which for insufficient coverage of* k*-space causes image blurring and/or aliasing artifacts. Reliable and robust improvements in acquisition speed by typically a factor of two were first achieved when parallel MRI techniques [19, 20] were introduced. These methods compensate for the loss of spatial information due to data undersampling by simultaneously acquiring multiple datasets with different receiver coils. In fact, when such multicoil arrangements are positioned around the desired field-of-view, for example, a head or thorax, each coil provides a dataset with a unique spatial sensitivity profile and thus complementary information. This redundancy may be exploited to recover the image from moderately undersampled* k*-space data and thus accelerate the scan.

Parallel MRI indeed was the first concept which changed the MRI reconstruction from a two-dimensional FFT to the solution of an inverse problem. To keep mathematics simple and computations fast, however, commercially available implementations turned the true nonlinear inverse problem, which emerges because the signal model contains the product of the desired complex image and all coil sensitivity profiles (i.e., complex images in themselves), into a linear inverse problem. This is accomplished by first determining the coil sensitivities with the use of a prescan or by performing a low-resolution Fourier transform reconstruction of acquisitions with full sampling in the center of* k*-space and undersampling only in outer regions.

Recent advances towards real-time MRI with so far unsurpassed spatiotemporal resolution and image quality [21] were therefore only possible by combining suitable acquisition techniques with an adequate image reconstruction. Crucial elements include the use of (i) rapid low-flip angle gradient-echo sequences, (ii) spatial encodings with radial rather than Cartesian trajectories, (iii) different (i.e., complementary) sets of spokes in successive frames of a dynamic acquisition (see Figure 3) [22], (iv) extreme data undersampling for each frame, (v) an image reconstruction algorithm (NLINV) that solves the nonlinear inverse problem (see below), and (vi) temporal regularization to the preceding frame which constrains the ill-conditioned numerical problem with physically plausible a priori knowledge. Real-time MRI with NLINV reconstruction achieves a temporal resolution of 10 to 40 ms per frame (i.e., 25 to 100 frames per second) depending on the actual application (e.g., anatomic imaging at different spatial resolutions or quantitative blood flow studies); for a recent review of cardiovascular applications, see [23].

The spatial encoding scheme is comprised of different sets (turns) of spokes. All sets of spokes taken together cover the -space uniformly. Figure 3 illustrates the real-time MRI encoding scheme.

##### 2.2. Image Reconstruction with NLINV

If the receive coil sensitivities in parallel MRI are known, the image recovery emerges as a linear inverse problem which can efficiently be solved using iterative methods. In practice, however, static sensitivities are obtained through extrapolation and, even more importantly, in a dynamic in vivo setting they change due to coupling with the conductive tissue: this situation applies to a human subject during any type of movement (e.g., breathing or cardiac-related processes) or when dynamically scanning different planes and orientations (e.g., during real-time monitoring of minimally invasive procedures).

In such situations, extrapolating static sensitivities is not sufficient. The sensitivities have to be jointly estimated together with the image, yielding an inverse reconstruction problem that is nonlinear and ill-posed. For real-time MRI, both the dynamic changes during a physiologic process and the need for extremely undersampled datasets (e.g., by a factor of 20) inevitably lead to a nonlinear inverse problem.

A powerful solution to this problem is the regularized nonlinear inverse reconstruction algorithm [24]. NLINV formulates the image reconstruction as a nonlinear least-squares problem The nonlinear forward operator maps the combined vector of the unknown complex-valued image and all coil sensitivities (i.e., complex-valued images in themselves) to the data. multiplies the image with the sensitivities to obtain individual coil images and then predicts the* k*-space data using a nonuniform Fourier transform for all coil elements (this nonuniform FFT is implemented here with a uniform FFT and convolution with the PSF [25], which is especially advantageous for implementation on a GPU [9]). The data fidelity term quantifies the difference between this predicted data and the measured data in the least-squares sense, while the additional regularization term addresses the ill-posedness of the reconstruction problem. Here, is a weighting matrix in the Fourier domain which characterizes the spatial smoothness of the coil sensitivities and does not change the image component of . For dynamic imaging, temporal regularization to the immediately preceding frame , which exploits the temporal continuity of a human movement or physiologic process, allows for a remarkably high degree of data undersampling or image acceleration [21].

In the following, the main steps of an efficient numerical implementation are described; for further details, see [9, 21]. First, a change of variables and is done to improve the conditioning of the reconstruction problem. Starting from an initial guess , the numerical problem is then solved using the iteratively regularized Gauss-Newton method (IRGNM), which uses the following linear update rule:Here, denotes the derivative of at and its adjoint. In each of the Newton steps, this linear system of equations is solved using the method of conjugate gradients (CG). The main operation in a matrix-free implementation is the repeated application of the operator withThe star means complex conjugation. and are the blocks of the block-diagonal operations and which operate on a single channel. Because the nonuniform Fourier transform is paired with its adjoint, it can be implemented efficiently as a truncated convolution with a point-spread function using two applications of an FFT algorithm on a twofold oversampled grid [25]. is implemented using one FFT followed by the application of a diagonal weighting matrix. Thus, a single iteration requires 4 applications of an FFT algorithm per channel, several pixelwise complex multiplications and additions, and one pixel-by-pixel reduction across all channels . A flowchart can be found in Figure 4.

The NLINV algorithm is an iterative algorithm consisting of operators , , and as well as the conjugate-gradient method. Depending on the imaging scenario, the number of Newton steps is fixed to to . is only computed once per Newton step. and are applied in each conjugate-gradient iteration. Each operator applies a 2D FFT twice on per-channel data. computes a scalar product once; the conjugate gradient method contains two scalar products over all channel data. Furthermore each operator applies between and elementwise operations and contains a summation over channel data. The performance of the Fourier transform dominates the run time of a single inner-loop iteration of the algorithm and it has to be applied times in each inner loop.

If channels are assumed, Newton steps are computed (yielding roughly 50 conjugate-gradient iterations); 2D Fourier transformations must be applied to compute a single image. If data is acquired at a frame rate of fps, the reconstruction system must be able to apply 2D Fourier transforms per second.

#### 3. Optimization Methods and Results

The following section outlines several steps undertaken to move a prototype implementation of the NLINV algorithm to a highly accelerated implementation for use in an online reconstruction pipeline. The original implementation in C utilizes the CUDA framework and is capable of reconstructing at speeds of 1 to 5 fps on a single GPU. A typical clinical scenario in real-time MRI is cardiac imaging where a rate of 30 per second is necessary. To incorporate real-time heart examination in a clinical work-flow, the image reconstruction algorithm must perform accordingly.

The optimization techniques are classified into two categories. The first category encompasses platform independent optimization procedures that reduce the overall computational cost of the algorithm by reducing vector sizes while maintaining image quality. In the second category, the algorithm is modified to take advantage of the multi-GPU computer platform by channel decomposition, temporal decomposition, and tuning of the two techniques to yield best performance for a given imaging scenario.

All benchmark results show the minimum wall-clock time of a number of runs. For microbenchmarks, hundreds of runs were performed. For full reconstruction benchmarks, tens of runs were performed. The speed-up is defined as the quotient of the old wall-clock time over the new time and parallel efficiency is defined aswith representing the speed-up achieved when using accelerators over . Perfect efficiency is achieved when or ; sublinear speed-up is measured if and .

If not stated otherwise, all benchmarks were obtained on a Supermicro SuperServer 4027GR-TR system with PCIe 3.0, 2x Intel Xeon Ivy Bridge-EP E5-2650 main processors, 8x NVIDIA Titan Black (Kepler GK110) accelerators with 6 GB graphics memory each, and 128 GB main system memory. Custom GPU power cables were fabricated for the Supermicro system to support consumer-grade accelerators. The CUDA run time environment version was 6.5 with GPU driver version 346.46. The operating system used was Ubuntu 14.04. The combined single-precision performance of the eight GPUs in this system approaches 41 TFLOPS and the combined memory bandwidth is 2688 Gb/s. For a comparison of the performances of some accelerators available at the time of writing of the article, see Table 1.

##### 3.1. Real-Time Pipeline

The real-time NLINV reconstruction algorithm is part of a larger signal processing pipeline that can be decomposed into several stages:(i)Datasource: reading of input data into memory(ii)Preprocessing: interpolating data acquired with a non-Cartesian trajectory onto a rectangular grid, correcting for gradient delays as well as performing the channel compression; this stage contains a calibration phase that calculates the channel compression transform matrix and estimates the gradient delay(iii)Reconstruction: the NLINV algorithm, reconstructing each image(iv)Postprocessing: cropping of the image to the measured field of view, calculation of phase-difference image (in case of phase-contrast flow MRI), and applying temporal and spatial filters(v)Datasink: writing of resulting image to output.

Real-time MRI datasets with high temporal resolution typically contain several hundreds of frames even when covering only a few seconds of imaging. For this reason, it is beneficial to parallelize the pipeline at the level of these functional stages: each stage can work on a different frame, or, in other words, different frames are processed by different stages of the pipeline in parallel. The implementation of such a pipeline follows the actor model [26] where each pipeline stage is one (or multiple, see temporal decomposition in Section 3.3) actor, receiving input data from the previous stage as message and passing results to the following stage as a different message. The first actor and the last actor only produce (datasource) and receive (datasink) messages, respectively.

Figure 5 shows the NLINV pipeline for reconstructing 10 frames. The pipeline has a prologue and an epilogue of 4 frames (number of pipeline stages minus one) where full parallel reconstruction is not possible because the pipeline is filling up or emptying.

##### 3.2. Computational Cost Reduction

The most compute-intensive part of the NLINV algorithm is the application of the Fourier transform. NLINV applies the Fourier transform multiple times in each iteration. Fast execution of the Fourier transform has thus a major impact on the overall run time of the algorithm. The FFT library cuFFT [27] of the hardware vendor is used. Plotting the performance of the FFT libraries against the input vector size does not yield a linear function, but shows significant fluctuation (compare Figure 6). The algorithm performs better for vector sizes which are factorizable by small prime numbers, with the smallest prime numbers yielding the best results.

The grid oversampling ratio defines the ratio of grid sample locations over the side length of the output image . The ratio between and includes an inherent factor of that exists as the convolution with the point-spread function requires twofold oversampling. An additional factor can be set to values between and to prevent aliasing artifacts for situations where the operator chooses a field of view smaller than the object size. The relationship of to is thus defined asIn the following, only is specified. It is found that considering the constraints of the FFT algorithm when choosing can yield calculation speed-ups. A lookup table is generated that maps grid size to FFT performance by benchmarking the Fourier transform algorithm for all relevant input sizes. This lookup table is generated for different accelerator generations as well as for new versions of the FFT library. The grid oversampling ratio is adjusted according to the highest measured FFT performance with a minimum oversampling ratio of . Table 2 shows the speed-up when comparing to a fixed oversampling ratio of . Figure 6 is a graphical representation of the lookup table that is used to select optimal grid sizes. It highlights the run time difference between two very similar vector sizes of and .

The NLINV algorithm estimates not only the spin-density map , but also the coil sensitivity profiles . The regularization term added to the Gauss-Newton solver constrains the coil sensitivity profiles to a few low-frequency components. It is thus possible to not store the entire vector , but to reduce its size to of the original. The number of grid sample locations is thus reduced to for the coil sensitivity profiles . Whenever the original size is required, the vector is padded with zeroes. Figure 7 shows a typical weighting function in 1D as applied to and the cut-off that is not stored after this optimization. This saves computing time as functions applied on have to compute less. For this technique to yield a speed-up, the computation time saved must exceed the time required to crop and pad the vectors. Table 3 shows the speed-up achieved by this optimization for various data sizes.

##### 3.3. High-Level Parallelization with Autotuning

In parallel MRI, the reconstruction problem can be partitioned across the domain of multiple receive channels used for signal reception [10]. The measured data , corresponding coil sensitivity maps , and associated intermediate variables can be assigned to different accelerators . The image has to be duplicated on all accelerators. The sum is partitioned across accelerators according towhere is the subset of channels assigned to accelerator . This summation amounts to an all-reduce operation, since all accelerators require the computed updates to . An alternative decomposition would require communication between GPUs within the FFT operation which is not feasible for the typical vector dimensions in this problem. The resulting speed-up from this parallel decomposition originates from the fact that the FFT of all channels can be computed in parallel. With only one accelerator, all Fourier transforms have to be computed by a single accelerator with a batch size equal to . A batched FFT computes multiple FFTs of the same size and type on multiple blocks of memory either in parallel or in a sequential manner. With multiple accelerators, is divided by the number of accelerators . This is illustrated in Table 4 depicting the computation time of Fourier transforms of 2D vectors of various side lengths for different numbers of GPUs.

The parallel efficiency of distributing a batched FFT across devices decreases from two to three and four accelerators, because an accelerator can compute multiple 2D FFTs at the same time. In addition, the NLINV implementation uses 10 compressed channels which cannot be divided by 3 or 4 with no remainder.

Furthermore, the speed-up is reduced by the communication overhead of calculating which increases with the number of accelerators. Table 5 shows the reconstruction speed and speed-up for differently sized datasets and varying number of accelerators. To reduce the communication overhead, a peer-to-peer communication technique is employed that allows accelerators to directly access memory of neighbouring accelerators. This is only possible, if accelerators share a PCIe domain. The Supermicro system has 2 PCIe domains that each connect 4 accelerators. The maximum effective number of accelerators that different channels are assigned to is therefore .

Because the parallel efficiency of the channel decomposition is limited, investigations focused on decomposing the problem along the temporal domain which results in reconstructions of multiple frames at the same time. The standard formulation of the NLINV algorithm prohibits a problem decomposition along the temporal domain. Frame must strictly follow frame as serves as starting and iterative regularization value for . While maintaining the necessary temporal order, a slight relaxation of the temporal regularization constraint allows for the reconstruction of multiple frames at the same time. The following scheme ensures that the difference in the results of in-order and out-of-order image reconstruction remains minimal. The first frame is defined with set to unity and set to zero. The function maps frame for each and Newton step to initialization and regularizations values :

Due to the complementary data acquisition pattern for sequential frames, the initial frames are of poor quality. The first images of a series are thus reconstructed in a strict in-order sequence. This allows the algorithm to reach the best image quality in the shortest amount of time. From frame on, frames may be reconstructed in parallel. Initialization and regularization values are chosen to be the most recent available frame within the range . The only exception is regularization values in the last Newton step , where must be used for regularization. The algorithm thus waits for frame to be computed before proceeding to the last Newton step (compare Figure 8). Experimental validation shows that the best match of in-order and out-of-order processing while maintaining a speed-up can be achieved by setting to the number of turns and to roughly half the number of turns in the interleaved sampling scheme.

Channel decomposition and temporal decomposition can be applied for different imaging situations in different ways. Depending on the acquisition and reconstruction parameters, the number of reconstruction threads (temporal decomposition) and the number of accelerators per reconstruction thread (channel decomposition) can be adjusted. A prerequisite to this method is that all settings yield the same image quality. The parameters and that have the most impact on image reconstruction speed include(i)the imaging mode: single-slice anatomy, multislice anatomy, and phase-contrast flow,(ii)the data size which depends on the field of view and the chosen resolution,(iii)the number of frames acquired,(iv)the number of virtual channels used.

The number of parameters and the ever-evolving measurement protocols in a research setting make it difficult to come up with a model that maps the set of acquisition and reconstruction parameters to the optimal set of parallelization parameters . Therefore, an autotuning mechanism is employed: The algorithm measures its own run time and stores its performance along with all acquisition, reconstruction, and parallelization parameters in a database . For a given set of ( and ), the autotuning mechanism can select all recorded run times , sort them, and select the set of parallelization parameters that yields the best performance.

The autotuning has an optional learning mode to populate the database with varying parallelization parameters . If the learning mode is active, the algorithm searches the database for matching performance data and chooses parallelization arguments that do not yet exist in the database. The search space for is limited; there are only 16 sets of arguments for the 8-fold GPU reconstruction system used here. The reason for this is the restriction of the channel decomposition stage to the size of the PCIe domain due to peer-to-peer memory access. If then , if , then , and if then . Table 6 shows an excerpt from the autotuning database.

In a clinical setting, all relevant protocols may undergo a learning phase populating the database and covering the entire search space. This can be done in a setup phase to ensure optimal run times for all clinical scans. The algorithm is also capable of sorting acquisition and reconstruction parameters. This allows the autotuning mechanism to find good parameters for new measurement protocols it has never seen before.

#### 4. Discussion

This work describes the successful development of a highly parallelized implementation of the NLINV algorithm for real-time MRI where dynamic image series are reconstructed, displayed, and stored with little or almost no delay. The computationally demanding iterative algorithm for dynamic MRI was decomposed and mapped to massive parallel hardware. This implementation is deployed to several research groups and represents a cornerstone for the evaluation of the clinical relevance of a variety of real-time MRI applications, mainly in the field of cardiac MRI [23, 28], quantitative phase-contrast flow MRI [29, 30], and novel fields such as oropharyngeal functions during swallowing [31], speaking [32], and brass playing [33].

As examples, Supplementary Videos 1 and 2 show T1-weighted radial FLASH MRI acquisitions at 33.32 ms acquisition time (TR = 1.96 ms, 17 radial spokes covering* k*-space) of a human heart in a short-axis view and midsagittal tongue movements of an elite horn player, respectively. The cardiac example employed a mm^{3} field of view, 1.6 mm in-plane resolution (6 mm slice thickness), and data samples per spoke (i.e., grid size) which resulted in a slightly delayed reconstruction speed of about 22 fps versus 30 fps acquisition speed. In contrast, the horn study yielded real-time reconstruction speed of 30 fps because of a slightly smaller mm^{3} field of view at 1.4 mm in-plane resolution (8 mm slice thickness) and only samples per spoke.

Some of the optimization methods discussed here are specific to real-time MRI algorithms, but the majority of techniques translate well to other MRI applications or even to other medical imaging modalities. Functional decomposition can be employed in any signal or image processing pipeline that is built on multiple processing units like multicore CPUs, heterogeneous systems, or distributed systems comprised of FPGAs, digital signal processors, microprocessors, ASICs, and so on. Autotuning is another generic technique that can be useful in many other applications: if an algorithm implementation exposes parallelization parameters (number of threads, number of GPUs, distribution ratios, etc.), it can be tuned for each imaging scenario and each platform individually. The grid size optimization technique can also be universally employed: if an algorithm allows choosing the data size (within certain boundaries) due to regridding or interpolation, the grid size should be chosen such that the following processing steps exhibit optimal performance. This is valuable, if the following processing steps do not exhibit linear performance such as specific FFT implementations.

#### 5. Conclusion

Novel parallel decomposition methods are presented which are applicable to many iterative algorithms for dynamic MRI. Using these methods to parallelize the NLINV algorithm on multiple GPUs, it is possible to achieve online image reconstruction with high frame rates. For suitable parameters choices, real-time reconstruction can be achieved.

#### Conflicts of Interest

Martin Uecker and Jens Frahm hold a patent about the real-time MRI acquisition and reconstruction technique discussed here. All other authors declare that they have no conflicts of interest.

#### Acknowledgments

The authors acknowledge the support by the German Research Foundation and the Open Access Publication Funds of Göttingen University.

#### Supplementary Materials

*Supplementary 1. *Video S1: T1-weighted radial FLASH MRI acquisitions with NLINV reconstruction at 33.32 ms acquisition time (TR = 1.96 ms, 17 radial spokes) of a human heart in a short-axis view ( mm^{3} field of view, 1.6 mm in-plane resolution, and 6 mm slice thickness).

*Supplementary 2. *Video S2: T1-weighted radial FLASH MRI acquisitions with NLINV reconstruction at 33.32 ms acquisition time (TR = 1.96 ms, 17 radial spokes) of midsagittal tongue movements of an elite horn player ( mm^{3} field of view at 1.4 mm in-plane resolution, and 8 mm slice thickness).