Research Article  Open Access
Sebastian Schaetz, Dirk Voit, Jens Frahm, Martin Uecker, "Accelerated Computing in Magnetic Resonance Imaging: RealTime Imaging Using Nonlinear Inverse Reconstruction", Computational and Mathematical Methods in Medicine, vol. 2017, Article ID 3527269, 11 pages, 2017. https://doi.org/10.1155/2017/3527269
Accelerated Computing in Magnetic Resonance Imaging: RealTime Imaging Using Nonlinear Inverse Reconstruction
Abstract
Purpose. To develop generic optimization strategies for image reconstruction using graphical processing units (GPUs) in magnetic resonance imaging (MRI) and to exemplarily report on our experience with a highly accelerated implementation of the nonlinear inversion (NLINV) algorithm for dynamic MRI with high frame rates. Methods. The NLINV algorithm is optimized and ported to run on a multiGPU singlenode server. The algorithm is mapped to multiple GPUs by decomposing the data domain along the channel dimension. Furthermore, the algorithm is decomposed along the temporal domain by relaxing a temporal regularization constraint, allowing the algorithm to work on multiple frames in parallel. Finally, an autotuning method is presented that is capable of combining different decomposition variants to achieve optimal algorithm performance in different imaging scenarios. Results. The algorithm is successfully ported to a multiGPU system and allows online image reconstruction with high frame rates. Realtime reconstruction with low latency and frame rates up to 30 frames per second is demonstrated. Conclusion. Novel parallel decomposition methods are presented which are applicable to many iterative algorithms for dynamic MRI. Using these methods to parallelize the NLINV algorithm on multiple GPUs, it is possible to achieve online image reconstruction with high frame rates.
1. Introduction
Accelerators such as graphical processing units (GPUs) or other multicore vector coprocessors are wellsuited for achieving fast algorithm run times when applied to reconstruction problems in medical imaging [1] including computed tomography [2], positron emission tomography [3], and ultrasound [4]. This is because respective algorithms usually apply a massive number of the same independent operations on pixels, voxels, bins, or sampling points. The situation maps well to manycore vector coprocessors and their large number of wide floatingpoint units that are capable of executing operations on wide vectors in parallel. The memory bandwidth of accelerators is roughly one order of magnitude greater than that of central processing units; they are optimized for parallel throughput instead of latency of a single instruction stream.
Recent advances in magnetic resonance imaging (MRI) pose new challenges to the implementations of associated algorithms in order to keep data acquisition and reconstruction times on par. While acquisition times decrease dramatically as in realtime MRI, at the same time the data grows in size when using up to 64 or even 128 independent receiver channels on modern MRI systems. In addition, new modalities such as modelbased reconstructions for (dynamic) parametric mapping increase the computational complexity of reconstruction algorithms because they in general involve iterative solutions to nonlinear inverse problems. As a consequence, accelerators are increasingly used to overcome these challenges [5–13].
Accelerators outperform main processors (CPUs) in all important operations during MRI reconstruction including interpolation [14], filtering [15], and basic linear algebra [16]. The central operation of most MRI reconstruction algorithms is the Fourier transform. Figure 1 shows the performance of the three FFT libraries cuFFT, clFFT, and FFTW [17]. The accelerator libraries cuFFT and clFFT outperform the CPU library FFTW. Consequently, accelerators outperform CPUs for MRI reconstruction, if the algorithm is based on the Fourier transform and if the ratio of computation to data transfer favours computation. Accelerator and CPU typically form distributed memory systems. Figure 2 shows the memory transfer speed for two different PCIe 3.0 systems in dependence of the data transfer size. The amount of computation assigned to an accelerator and the transfer size have a huge impact on the overall performance of the algorithm, which needs to be taken into account when implementing accelerated algorithms for image reconstruction. Another limiting factor for the use of accelerators in MRI is the restricted amount of memory available. Nevertheless, the onboard memory of accelerators increased in recent years and the newest generation of accelerators is equipped with acceptable amounts of memory (compare Table 1).

In this paper, we summarize our experience in developing a lowlatency online reconstruction system for realtime MRI over the last eight years. While some of the described techniques have already been reported previously, this article for the first time explains all technical aspects of the complete multiGPU implementation of the advanced iterative reconstruction algorithm and adds further background information and analysis. Although we focus on the specific application of realtime MRI, many of the techniques developed for this project can be applied to similar tomographic reconstruction problems. In particular, we describe strategies for optimal choice of the grid size for a convolutionbased nonuniform FFT, novel parallelization schemes using temporal and spatial decomposition, and automatic tuning of parameters.
2. Theory
2.1. RealTime MRI
Diagnostic imaging in real time represents a most demanding acquisition and reconstruction problem for MRI. In principle, the data acquisition refers to the recording of a large number of different radiofrequency signals in the time domain which are spatially encoded with the use of magnetic field gradients. The resulting dataset represents the kspace (or Fourier space) of the image. MRI acquisition times are determined by the number of different encodings needed for highquality image reconstruction multiplied by the time required for recording a single MRI signal, that is, the socalled repetition time TR. While TR values could efficiently be reduced from seconds to milliseconds by the invention of lowflip angle gradientecho MRI sequences, for example, see [18] for an early dynamic application, a further speedup by reducing the number of encodings was limited by the properties of the Fourier transform which for insufficient coverage of kspace causes image blurring and/or aliasing artifacts. Reliable and robust improvements in acquisition speed by typically a factor of two were first achieved when parallel MRI techniques [19, 20] were introduced. These methods compensate for the loss of spatial information due to data undersampling by simultaneously acquiring multiple datasets with different receiver coils. In fact, when such multicoil arrangements are positioned around the desired fieldofview, for example, a head or thorax, each coil provides a dataset with a unique spatial sensitivity profile and thus complementary information. This redundancy may be exploited to recover the image from moderately undersampled kspace data and thus accelerate the scan.
Parallel MRI indeed was the first concept which changed the MRI reconstruction from a twodimensional FFT to the solution of an inverse problem. To keep mathematics simple and computations fast, however, commercially available implementations turned the true nonlinear inverse problem, which emerges because the signal model contains the product of the desired complex image and all coil sensitivity profiles (i.e., complex images in themselves), into a linear inverse problem. This is accomplished by first determining the coil sensitivities with the use of a prescan or by performing a lowresolution Fourier transform reconstruction of acquisitions with full sampling in the center of kspace and undersampling only in outer regions.
Recent advances towards realtime MRI with so far unsurpassed spatiotemporal resolution and image quality [21] were therefore only possible by combining suitable acquisition techniques with an adequate image reconstruction. Crucial elements include the use of (i) rapid lowflip angle gradientecho sequences, (ii) spatial encodings with radial rather than Cartesian trajectories, (iii) different (i.e., complementary) sets of spokes in successive frames of a dynamic acquisition (see Figure 3) [22], (iv) extreme data undersampling for each frame, (v) an image reconstruction algorithm (NLINV) that solves the nonlinear inverse problem (see below), and (vi) temporal regularization to the preceding frame which constrains the illconditioned numerical problem with physically plausible a priori knowledge. Realtime MRI with NLINV reconstruction achieves a temporal resolution of 10 to 40 ms per frame (i.e., 25 to 100 frames per second) depending on the actual application (e.g., anatomic imaging at different spatial resolutions or quantitative blood flow studies); for a recent review of cardiovascular applications, see [23].
The spatial encoding scheme is comprised of different sets (turns) of spokes. All sets of spokes taken together cover the space uniformly. Figure 3 illustrates the realtime MRI encoding scheme.
2.2. Image Reconstruction with NLINV
If the receive coil sensitivities in parallel MRI are known, the image recovery emerges as a linear inverse problem which can efficiently be solved using iterative methods. In practice, however, static sensitivities are obtained through extrapolation and, even more importantly, in a dynamic in vivo setting they change due to coupling with the conductive tissue: this situation applies to a human subject during any type of movement (e.g., breathing or cardiacrelated processes) or when dynamically scanning different planes and orientations (e.g., during realtime monitoring of minimally invasive procedures).
In such situations, extrapolating static sensitivities is not sufficient. The sensitivities have to be jointly estimated together with the image, yielding an inverse reconstruction problem that is nonlinear and illposed. For realtime MRI, both the dynamic changes during a physiologic process and the need for extremely undersampled datasets (e.g., by a factor of 20) inevitably lead to a nonlinear inverse problem.
A powerful solution to this problem is the regularized nonlinear inverse reconstruction algorithm [24]. NLINV formulates the image reconstruction as a nonlinear leastsquares problem The nonlinear forward operator maps the combined vector of the unknown complexvalued image and all coil sensitivities (i.e., complexvalued images in themselves) to the data. multiplies the image with the sensitivities to obtain individual coil images and then predicts the kspace data using a nonuniform Fourier transform for all coil elements (this nonuniform FFT is implemented here with a uniform FFT and convolution with the PSF [25], which is especially advantageous for implementation on a GPU [9]). The data fidelity term quantifies the difference between this predicted data and the measured data in the leastsquares sense, while the additional regularization term addresses the illposedness of the reconstruction problem. Here, is a weighting matrix in the Fourier domain which characterizes the spatial smoothness of the coil sensitivities and does not change the image component of . For dynamic imaging, temporal regularization to the immediately preceding frame , which exploits the temporal continuity of a human movement or physiologic process, allows for a remarkably high degree of data undersampling or image acceleration [21].
In the following, the main steps of an efficient numerical implementation are described; for further details, see [9, 21]. First, a change of variables and is done to improve the conditioning of the reconstruction problem. Starting from an initial guess , the numerical problem is then solved using the iteratively regularized GaussNewton method (IRGNM), which uses the following linear update rule:Here, denotes the derivative of at and its adjoint. In each of the Newton steps, this linear system of equations is solved using the method of conjugate gradients (CG). The main operation in a matrixfree implementation is the repeated application of the operator withThe star means complex conjugation. and are the blocks of the blockdiagonal operations and which operate on a single channel. Because the nonuniform Fourier transform is paired with its adjoint, it can be implemented efficiently as a truncated convolution with a pointspread function using two applications of an FFT algorithm on a twofold oversampled grid [25]. is implemented using one FFT followed by the application of a diagonal weighting matrix. Thus, a single iteration requires 4 applications of an FFT algorithm per channel, several pixelwise complex multiplications and additions, and one pixelbypixel reduction across all channels . A flowchart can be found in Figure 4.
The NLINV algorithm is an iterative algorithm consisting of operators , , and as well as the conjugategradient method. Depending on the imaging scenario, the number of Newton steps is fixed to to . is only computed once per Newton step. and are applied in each conjugategradient iteration. Each operator applies a 2D FFT twice on perchannel data. computes a scalar product once; the conjugate gradient method contains two scalar products over all channel data. Furthermore each operator applies between and elementwise operations and contains a summation over channel data. The performance of the Fourier transform dominates the run time of a single innerloop iteration of the algorithm and it has to be applied times in each inner loop.
If channels are assumed, Newton steps are computed (yielding roughly 50 conjugategradient iterations); 2D Fourier transformations must be applied to compute a single image. If data is acquired at a frame rate of fps, the reconstruction system must be able to apply 2D Fourier transforms per second.
3. Optimization Methods and Results
The following section outlines several steps undertaken to move a prototype implementation of the NLINV algorithm to a highly accelerated implementation for use in an online reconstruction pipeline. The original implementation in C utilizes the CUDA framework and is capable of reconstructing at speeds of 1 to 5 fps on a single GPU. A typical clinical scenario in realtime MRI is cardiac imaging where a rate of 30 per second is necessary. To incorporate realtime heart examination in a clinical workflow, the image reconstruction algorithm must perform accordingly.
The optimization techniques are classified into two categories. The first category encompasses platform independent optimization procedures that reduce the overall computational cost of the algorithm by reducing vector sizes while maintaining image quality. In the second category, the algorithm is modified to take advantage of the multiGPU computer platform by channel decomposition, temporal decomposition, and tuning of the two techniques to yield best performance for a given imaging scenario.
All benchmark results show the minimum wallclock time of a number of runs. For microbenchmarks, hundreds of runs were performed. For full reconstruction benchmarks, tens of runs were performed. The speedup is defined as the quotient of the old wallclock time over the new time and parallel efficiency is defined aswith representing the speedup achieved when using accelerators over . Perfect efficiency is achieved when or ; sublinear speedup is measured if and .
If not stated otherwise, all benchmarks were obtained on a Supermicro SuperServer 4027GRTR system with PCIe 3.0, 2x Intel Xeon Ivy BridgeEP E52650 main processors, 8x NVIDIA Titan Black (Kepler GK110) accelerators with 6 GB graphics memory each, and 128 GB main system memory. Custom GPU power cables were fabricated for the Supermicro system to support consumergrade accelerators. The CUDA run time environment version was 6.5 with GPU driver version 346.46. The operating system used was Ubuntu 14.04. The combined singleprecision performance of the eight GPUs in this system approaches 41 TFLOPS and the combined memory bandwidth is 2688 Gb/s. For a comparison of the performances of some accelerators available at the time of writing of the article, see Table 1.
3.1. RealTime Pipeline
The realtime NLINV reconstruction algorithm is part of a larger signal processing pipeline that can be decomposed into several stages:(i)Datasource: reading of input data into memory(ii)Preprocessing: interpolating data acquired with a nonCartesian trajectory onto a rectangular grid, correcting for gradient delays as well as performing the channel compression; this stage contains a calibration phase that calculates the channel compression transform matrix and estimates the gradient delay(iii)Reconstruction: the NLINV algorithm, reconstructing each image(iv)Postprocessing: cropping of the image to the measured field of view, calculation of phasedifference image (in case of phasecontrast flow MRI), and applying temporal and spatial filters(v)Datasink: writing of resulting image to output.
Realtime MRI datasets with high temporal resolution typically contain several hundreds of frames even when covering only a few seconds of imaging. For this reason, it is beneficial to parallelize the pipeline at the level of these functional stages: each stage can work on a different frame, or, in other words, different frames are processed by different stages of the pipeline in parallel. The implementation of such a pipeline follows the actor model [26] where each pipeline stage is one (or multiple, see temporal decomposition in Section 3.3) actor, receiving input data from the previous stage as message and passing results to the following stage as a different message. The first actor and the last actor only produce (datasource) and receive (datasink) messages, respectively.
Figure 5 shows the NLINV pipeline for reconstructing 10 frames. The pipeline has a prologue and an epilogue of 4 frames (number of pipeline stages minus one) where full parallel reconstruction is not possible because the pipeline is filling up or emptying.
3.2. Computational Cost Reduction
The most computeintensive part of the NLINV algorithm is the application of the Fourier transform. NLINV applies the Fourier transform multiple times in each iteration. Fast execution of the Fourier transform has thus a major impact on the overall run time of the algorithm. The FFT library cuFFT [27] of the hardware vendor is used. Plotting the performance of the FFT libraries against the input vector size does not yield a linear function, but shows significant fluctuation (compare Figure 6). The algorithm performs better for vector sizes which are factorizable by small prime numbers, with the smallest prime numbers yielding the best results.
The grid oversampling ratio defines the ratio of grid sample locations over the side length of the output image . The ratio between and includes an inherent factor of that exists as the convolution with the pointspread function requires twofold oversampling. An additional factor can be set to values between and to prevent aliasing artifacts for situations where the operator chooses a field of view smaller than the object size. The relationship of to is thus defined asIn the following, only is specified. It is found that considering the constraints of the FFT algorithm when choosing can yield calculation speedups. A lookup table is generated that maps grid size to FFT performance by benchmarking the Fourier transform algorithm for all relevant input sizes. This lookup table is generated for different accelerator generations as well as for new versions of the FFT library. The grid oversampling ratio is adjusted according to the highest measured FFT performance with a minimum oversampling ratio of . Table 2 shows the speedup when comparing to a fixed oversampling ratio of . Figure 6 is a graphical representation of the lookup table that is used to select optimal grid sizes. It highlights the run time difference between two very similar vector sizes of and .

The NLINV algorithm estimates not only the spindensity map , but also the coil sensitivity profiles . The regularization term added to the GaussNewton solver constrains the coil sensitivity profiles to a few lowfrequency components. It is thus possible to not store the entire vector , but to reduce its size to of the original. The number of grid sample locations is thus reduced to for the coil sensitivity profiles . Whenever the original size is required, the vector is padded with zeroes. Figure 7 shows a typical weighting function in 1D as applied to and the cutoff that is not stored after this optimization. This saves computing time as functions applied on have to compute less. For this technique to yield a speedup, the computation time saved must exceed the time required to crop and pad the vectors. Table 3 shows the speedup achieved by this optimization for various data sizes.

3.3. HighLevel Parallelization with Autotuning
In parallel MRI, the reconstruction problem can be partitioned across the domain of multiple receive channels used for signal reception [10]. The measured data , corresponding coil sensitivity maps , and associated intermediate variables can be assigned to different accelerators . The image has to be duplicated on all accelerators. The sum is partitioned across accelerators according towhere is the subset of channels assigned to accelerator . This summation amounts to an allreduce operation, since all accelerators require the computed updates to . An alternative decomposition would require communication between GPUs within the FFT operation which is not feasible for the typical vector dimensions in this problem. The resulting speedup from this parallel decomposition originates from the fact that the FFT of all channels can be computed in parallel. With only one accelerator, all Fourier transforms have to be computed by a single accelerator with a batch size equal to . A batched FFT computes multiple FFTs of the same size and type on multiple blocks of memory either in parallel or in a sequential manner. With multiple accelerators, is divided by the number of accelerators . This is illustrated in Table 4 depicting the computation time of Fourier transforms of 2D vectors of various side lengths for different numbers of GPUs.

The parallel efficiency of distributing a batched FFT across devices decreases from two to three and four accelerators, because an accelerator can compute multiple 2D FFTs at the same time. In addition, the NLINV implementation uses 10 compressed channels which cannot be divided by 3 or 4 with no remainder.
Furthermore, the speedup is reduced by the communication overhead of calculating which increases with the number of accelerators. Table 5 shows the reconstruction speed and speedup for differently sized datasets and varying number of accelerators. To reduce the communication overhead, a peertopeer communication technique is employed that allows accelerators to directly access memory of neighbouring accelerators. This is only possible, if accelerators share a PCIe domain. The Supermicro system has 2 PCIe domains that each connect 4 accelerators. The maximum effective number of accelerators that different channels are assigned to is therefore .

Because the parallel efficiency of the channel decomposition is limited, investigations focused on decomposing the problem along the temporal domain which results in reconstructions of multiple frames at the same time. The standard formulation of the NLINV algorithm prohibits a problem decomposition along the temporal domain. Frame must strictly follow frame as serves as starting and iterative regularization value for . While maintaining the necessary temporal order, a slight relaxation of the temporal regularization constraint allows for the reconstruction of multiple frames at the same time. The following scheme ensures that the difference in the results of inorder and outoforder image reconstruction remains minimal. The first frame is defined with set to unity and set to zero. The function maps frame for each and Newton step to initialization and regularizations values :
Due to the complementary data acquisition pattern for sequential frames, the initial frames are of poor quality. The first images of a series are thus reconstructed in a strict inorder sequence. This allows the algorithm to reach the best image quality in the shortest amount of time. From frame on, frames may be reconstructed in parallel. Initialization and regularization values are chosen to be the most recent available frame within the range . The only exception is regularization values in the last Newton step , where must be used for regularization. The algorithm thus waits for frame to be computed before proceeding to the last Newton step (compare Figure 8). Experimental validation shows that the best match of inorder and outoforder processing while maintaining a speedup can be achieved by setting to the number of turns and to roughly half the number of turns in the interleaved sampling scheme.
Channel decomposition and temporal decomposition can be applied for different imaging situations in different ways. Depending on the acquisition and reconstruction parameters, the number of reconstruction threads (temporal decomposition) and the number of accelerators per reconstruction thread (channel decomposition) can be adjusted. A prerequisite to this method is that all settings yield the same image quality. The parameters and that have the most impact on image reconstruction speed include(i)the imaging mode: singleslice anatomy, multislice anatomy, and phasecontrast flow,(ii)the data size which depends on the field of view and the chosen resolution,(iii)the number of frames acquired,(iv)the number of virtual channels used.
The number of parameters and the everevolving measurement protocols in a research setting make it difficult to come up with a model that maps the set of acquisition and reconstruction parameters to the optimal set of parallelization parameters . Therefore, an autotuning mechanism is employed: The algorithm measures its own run time and stores its performance along with all acquisition, reconstruction, and parallelization parameters in a database . For a given set of ( and ), the autotuning mechanism can select all recorded run times , sort them, and select the set of parallelization parameters that yields the best performance.
The autotuning has an optional learning mode to populate the database with varying parallelization parameters . If the learning mode is active, the algorithm searches the database for matching performance data and chooses parallelization arguments that do not yet exist in the database. The search space for is limited; there are only 16 sets of arguments for the 8fold GPU reconstruction system used here. The reason for this is the restriction of the channel decomposition stage to the size of the PCIe domain due to peertopeer memory access. If then , if , then , and if then . Table 6 shows an excerpt from the autotuning database.

In a clinical setting, all relevant protocols may undergo a learning phase populating the database and covering the entire search space. This can be done in a setup phase to ensure optimal run times for all clinical scans. The algorithm is also capable of sorting acquisition and reconstruction parameters. This allows the autotuning mechanism to find good parameters for new measurement protocols it has never seen before.
4. Discussion
This work describes the successful development of a highly parallelized implementation of the NLINV algorithm for realtime MRI where dynamic image series are reconstructed, displayed, and stored with little or almost no delay. The computationally demanding iterative algorithm for dynamic MRI was decomposed and mapped to massive parallel hardware. This implementation is deployed to several research groups and represents a cornerstone for the evaluation of the clinical relevance of a variety of realtime MRI applications, mainly in the field of cardiac MRI [23, 28], quantitative phasecontrast flow MRI [29, 30], and novel fields such as oropharyngeal functions during swallowing [31], speaking [32], and brass playing [33].
As examples, Supplementary Videos 1 and 2 show T1weighted radial FLASH MRI acquisitions at 33.32 ms acquisition time (TR = 1.96 ms, 17 radial spokes covering kspace) of a human heart in a shortaxis view and midsagittal tongue movements of an elite horn player, respectively. The cardiac example employed a mm^{3} field of view, 1.6 mm inplane resolution (6 mm slice thickness), and data samples per spoke (i.e., grid size) which resulted in a slightly delayed reconstruction speed of about 22 fps versus 30 fps acquisition speed. In contrast, the horn study yielded realtime reconstruction speed of 30 fps because of a slightly smaller mm^{3} field of view at 1.4 mm inplane resolution (8 mm slice thickness) and only samples per spoke.
Some of the optimization methods discussed here are specific to realtime MRI algorithms, but the majority of techniques translate well to other MRI applications or even to other medical imaging modalities. Functional decomposition can be employed in any signal or image processing pipeline that is built on multiple processing units like multicore CPUs, heterogeneous systems, or distributed systems comprised of FPGAs, digital signal processors, microprocessors, ASICs, and so on. Autotuning is another generic technique that can be useful in many other applications: if an algorithm implementation exposes parallelization parameters (number of threads, number of GPUs, distribution ratios, etc.), it can be tuned for each imaging scenario and each platform individually. The grid size optimization technique can also be universally employed: if an algorithm allows choosing the data size (within certain boundaries) due to regridding or interpolation, the grid size should be chosen such that the following processing steps exhibit optimal performance. This is valuable, if the following processing steps do not exhibit linear performance such as specific FFT implementations.
5. Conclusion
Novel parallel decomposition methods are presented which are applicable to many iterative algorithms for dynamic MRI. Using these methods to parallelize the NLINV algorithm on multiple GPUs, it is possible to achieve online image reconstruction with high frame rates. For suitable parameters choices, realtime reconstruction can be achieved.
Conflicts of Interest
Martin Uecker and Jens Frahm hold a patent about the realtime MRI acquisition and reconstruction technique discussed here. All other authors declare that they have no conflicts of interest.
Acknowledgments
The authors acknowledge the support by the German Research Foundation and the Open Access Publication Funds of Göttingen University.
Supplementary Materials
Supplementary 1. Video S1: T1weighted radial FLASH MRI acquisitions with NLINV reconstruction at 33.32 ms acquisition time (TR = 1.96 ms, 17 radial spokes) of a human heart in a shortaxis view ( mm^{3} field of view, 1.6 mm inplane resolution, and 6 mm slice thickness).
Supplementary 2. Video S2: T1weighted radial FLASH MRI acquisitions with NLINV reconstruction at 33.32 ms acquisition time (TR = 1.96 ms, 17 radial spokes) of midsagittal tongue movements of an elite horn player ( mm^{3} field of view at 1.4 mm inplane resolution, and 8 mm slice thickness).
References
 A. Eklund, P. Dufort, D. Forsberg, and S. M. LaConte, “Medical image processing on the GPU—Past, present and future,” Medical Image Analysis, vol. 17, no. 8, pp. 1073–1094, 2013. View at: Publisher Site  Google Scholar
 H. Scherl, B. Keck, M. Kowarschik, and J. Hornegger, “Fast GPUbased CT reconstruction using the Common Unified Device Architecture (CUDA),” in Proceedings of the IEEE Nuclear Science Symposium, Medical Imaging Conference, vol. 6, pp. 4464–4466, Honolulu, Hawaii, USA, November 2007. View at: Publisher Site  Google Scholar
 S. Schaetz and M. Kucera, “Integration of gpgpu methods into a pet reconstruction system,” in Proceedings of the 9th IASTED International Conference on Parallel and Distributed Computing and Networks, PDCN 2010, pp. 49–58, aut, February 2010. View at: Google Scholar
 Y. Dai, J. Tian, D. Dong, G. Yan, and H. Zheng, “Realtime visualized freehand 3D ultrasound reconstruction based on GPU,” IEEE Transactions on Information Technology in Biomedicine, vol. 14, no. 6, pp. 1338–1345, 2010. View at: Publisher Site  Google Scholar
 S. S. Stone, J. P. Haldar, S. C. Tsao, W.M. W. Hwu, B. P. Sutton, and Z.P. Liang, “Accelerating advanced MRI reconstructions on GPUs,” Journal of Parallel and Distributed Computing, vol. 68, no. 10, pp. 1307–1318, 2008. View at: Publisher Site  Google Scholar
 M. S. Hansen, D. Atkinson, and T. S. Sorensen, “Cartesian SENSE and kt SENSE reconstruction using commodity graphics hardware,” Magnetic Resonance in Medicine, vol. 59, no. 3, pp. 463–468, 2008. View at: Publisher Site  Google Scholar
 J. Gai, N. Obeid, J. L. Holtrop et al., “More IMPATIENT: A griddingaccelerated Toeplitzbased strategy for nonCartesian highresolution 3D MRI on GPUs,” Journal of Parallel and Distributed Computing, vol. 73, no. 5, pp. 686–697, 2013. View at: Publisher Site  Google Scholar
 T. S. Sorensen, D. Atkinson, T. Schaeffter, and M. S. Hansen, “Realtime reconstruction of sensitivity encoded radial magnetic resonance imaging using a graphics processing unit,” IEEE Transactions on Medical Imaging, vol. 28, no. 12, pp. 1974–1985, 2009. View at: Publisher Site  Google Scholar
 M. Uecker, S. Zhang, and J. Frahm, “Nonlinear inverse reconstruction for realtime MRI of the human heart using undersampled radial FLASH,” Magnetic Resonance in Medicine, vol. 63, no. 6, pp. 1456–1462, 2010. View at: Publisher Site  Google Scholar
 S. Schaetz and M. Uecker, “A MultiGPU Programming Library for RealTime Applications,” in Algorithms and Architectures for Parallel Processing, vol. 7439 of Lecture Notes in Computer Science, pp. 114–128, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. View at: Publisher Site  Google Scholar
 M. Murphy, M. Alley, J. Demmel, K. Keutzer, S. Vasanawala, and M. Lustig, “Fast l1SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime,” IEEE Transactions on Medical Imaging, vol. 31, no. 6, pp. 1250–1262, 2012. View at: Publisher Site  Google Scholar
 D. S. Smith, J. C. Gore, T. E. Yankeelov, and E. B. Welch, “Realtime compressive sensing MRI reconstruction using GPU computing and split Bregman methods,” International Journal of Biomedical Imaging, vol. 2012, Article ID 864827, 2012. View at: Publisher Site  Google Scholar
 M. Uecker, F. Ong, J. I. Tamir et al., “Berkeley Advanced Reconstruction Toolbox,” in Proceedings of the International Society for Magnetic Resonance in Medicine, vol. 2, p. 24863, Toronto, 2015. View at: Google Scholar
 A. Gregerson, “Implementing fast MRI gridding on GPUs via CUDA,” NVidia Whitepaper, 2008, http://cn.nvidia.com/docs/IO/47905/ECE757_Project_Report_Gregerson.pdf. View at: Google Scholar
 J. Liu, K. Huang, D. Zhang et al., “Nonlocal means denoising algorithm accelerated by GPU,” in Proceedings of the Sixth International Symposium on Multispectral Image Processing and Pattern Recognition, p. 749711, Yichang, China. View at: Publisher Site  Google Scholar
 V. Volkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear algebra,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '08), pp. 31:1–31:11, November 2008. View at: Publisher Site  Google Scholar
 M. Frigo and S. G. Johnson, “FFTW: an adaptive software architecture for the FFT,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 3, pp. 1381–1384, IEEE, May 1998. View at: Google Scholar
 J. Frahm, A. Haase, and D. Matthaei, “Rapid NMR imaging of dynamic processes using the FLASII technique,” Magnetic Resonance in Medicine, vol. 3, no. 2, pp. 321–327, 1986. View at: Publisher Site  Google Scholar
 D. K. Sodickson and W. J. Manning, “Simultaneous acquisition of spatial harmonics (SMASH): fast imaging with radiofrequency coil arrays,” Magnetic Resonance in Medicine, vol. 38, no. 4, pp. 591–603, 1997. View at: Publisher Site  Google Scholar
 K. P. Pruessmann, M. Weiger, M. B. Scheidegger, and P. Boesiger, “SENSE: sensitivity encoding for fast MRI,” Magnetic Resonance in Medicine, vol. 42, no. 5, pp. 952–962, 1999. View at: Publisher Site  Google Scholar
 M. Uecker, S. Zhang, D. Voit, A. Karaus, K.D. Merboldt, and J. Frahm, “Realtime MRI at a resolution of 20 ms,” NMR in Biomedicine, vol. 23, no. 8, pp. 986–994, 2010. View at: Publisher Site  Google Scholar
 S. Zhang, K. T. Block, and J. Frahm, “Magnetic resonance imaging in real time: Advances using radial FLASH,” Journal of Magnetic Resonance Imaging, vol. 31, no. 1, pp. 101–109, 2010. View at: Publisher Site  Google Scholar
 S. Zhang, A. A. Joseph, D. Voit et al., “Realtime magnetic resonance imaging of cardiac function and flow  recent progress,” Quantitative Imaging in Medicine and Surgery, vol. 4, no. 5, p. 313, 2014. View at: Google Scholar
 M. Uecker, T. Hohage, K. T. Block, and J. Frahm, “Image reconstruction by regularized nonlinear inversion  Joint estimation of coil sensitivities and image content,” Magnetic Resonance in Medicine, vol. 60, no. 3, pp. 674–682, 2008. View at: Publisher Site  Google Scholar
 F. Wajer and K. Pruessmann, “Major speedup of reconstruction for sensitivity encoding with arbitrary trajectories,” in Proceedings of the International Society for Magnetic Resonance in Medicine, p. 767, 2001. View at: Google Scholar
 I. Greif, Semantics of communicating parallel processes [Ph.D. thesis], Massachusetts Institute of Technology, 1975.
 Nvidia, cuFFT library, 2010, http://docs.nvidia.com/cuda/cufft/.
 S. Zhang, M. Uecker, D. Voit, K.D. Merboldt, and J. Frahm, “Realtime cardiac MRI at high temporal resolution: radial FLASH with nonlinear inverse reconstruction,” Journal of Cardiovascular Magnetic Resonance, vol. 12, p. 39, 2010. View at: Publisher Site  Google Scholar
 A. A. Joseph, K.D. Merboldt, D. Voit et al., “Realtime phasecontrast MRI of cardiovascular blood flow using undersampled radial fast lowangle shot and nonlinear inverse reconstruction,” NMR in Biomedicine, vol. 25, no. 7, pp. 917–924, 2012. View at: Publisher Site  Google Scholar
 M. Untenberger, Z. Tan, D. Voit et al., “Advances in realtime phasecontrast flow MRI using asymmetric radial gradient echoes,” Magnetic Resonance in Medicine, vol. 75, no. 5, pp. 1901–1908, 2016. View at: Publisher Site  Google Scholar
 S. Zhang, A. A. Joseph, L. Gross, M. Ghadimi, J. Frahm, and A. W. Beham, “Diagnosis of Gastroesophageal Reflux Disease Using Realtime Magnetic Resonance Imaging,” Scientific Reports, vol. 5, Article ID 12112, 2015. View at: Publisher Site  Google Scholar
 A. Niebergall, S. Zhang, E. Kunay et al., “Realtime MRI of speaking at a resolution of 33 ms: Undersampled radial FLASH with nonlinear inverse reconstruction,” Magnetic Resonance in Medicine, vol. 69, no. 2, pp. 477–485, 2013. View at: Publisher Site  Google Scholar
 P. W. Iltis, E. Schoonderwaldt, S. Zhang, J. Frahm, and E. Altenmüller, “Realtime MRI comparisons of brass players: A methodological pilot study,” Human Movement Science, vol. 42, pp. 132–145, 2015. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2017 Sebastian Schaetz et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.