Abstract
This paper presents a novel realtime compressive sensing (CS) reconstruction which employs high density fieldprogrammable gate array (FPGA) for hardware acceleration. Traditionally, CS can be implemented using a highlevel computer language in a personal computer (PC) or multicore platforms, such as graphics processing units (GPUs) and Digital Signal Processors (DSPs). However, reconstruction algorithms are computing demanding and software implementation of these algorithms is extremely slow and power consuming. In this paper, the orthogonal matching pursuit (OMP) algorithm is refined to solve the sparse decomposition optimization for partial Fourier dictionary, which is always adopted in radar imaging and detection application. OMP reconstruction can be divided into two main stages: optimization which finds the closely correlated vectors and least square problem. For large scale dictionary, the implementation of correlation is time consuming since it often requires a large number of matrix multiplications. Also solving the least square problem always needs a scalable matrix decomposition operation. To solve these problems efficiently, the correlation optimization is implemented by fast Fourier transform (FFT) and the large scale least square problem is implemented by Conjugate Gradient (CG) technique, respectively. The proposed method is verified by FPGA (Xilinx Virtex7 XC7VX690T) realization, revealing its effectiveness in realtime applications.
1. Introduction
Compressive sensing (CS) is a novel technology which allows sampling of sparse signals under subNyquist rate and reconstructing the signal using computational intensive algorithms. It has received considerable attention and has been successfully applied in many fields, such as signal/image processing, radar imaging, communication, geophysics, and remote sensing [1–6]. In CS, it has been shown that a signal which is sparse or has a sparse representation in some bases can be recovered from a small number of random nonadaptive linear measurements. Unfortunately, highperformance sparse signal recovery algorithms typically require a significant computational effort [7]. While the computational complexity is not a major issue for applications where offline processing on central processing units (CPUs) or graphics processing units (GPUs) can be afforded (e.g., in MRI) [8–10], it becomes extremely challenging for applications requiring realtime processing at high throughput (e.g., in radar detection and imaging). Hence, to meet the stringent throughput, latency, and powerconsumption constraints of realtime CSbased radar systems, developing dedicated hardware implementations, such as application specific integrated circuits (ASICs) or fieldprogrammable gate arrays (FPGAs), is of paramount importance [11].
A common approach to sparse reconstruction is known as Basis Pursuit (BP) [12]. This method uses convex optimization to find a signal representation in an overcomplete dictionary that minimizes norm of the coefficients in the representation. While being known to achieve accurate signal reconstruction, BP is more computationally intensive and has been shown to be significantly slower than other methods [13]. Orthogonal matching pursuit (OMP) which is proposed by Tropp and Gilbert [14] is an efficient and reliable reconstruction algorithm. OMP is a greedy method which identifies the location of one nonzero component of dictionary at a time. In order to converge, it requires a minimum number of samples in the order of , where is the signal’s sparsity and is the original dimension of the problem. Tropp and Gilbert show that, by performing enough nonadaptive measurement, signal recovery is possible with high probability. OMP is widely used in CS signal reconstruction due to its computing efficiency and relative simplicity. Even so, the computation load is very large for realtime applications, in which the signal reconstruction should be done within specified time constraints.
Software implementation of these algorithms is time consuming since they often require massive matrix multiplications. These signal processing applications which require intense computation and simultaneous processing of large amount of data in realtime can make use of large scale fieldprogrammable gate array (FPGA) platform for hardware acceleration. Modern highcapacity FPGA is an attractive alternative to accelerate scientific and engineering applications [15] due to the possible utilization of massive parallelism.
In order to speed up the complex reconstruction algorithms, a number of implementations on GPU, [8] ASICs, or reconfigurable FPGAs have been reported so far. The first ASIC implementation of CS reconstruction algorithms including matching pursuit (MP), gradient pursuit (GP), and OMP is presented in [16, 17] for channel estimation in wireless communication systems. FPGA implementation of OMP for generic CS problems of dimension 32 × 128 is developed in [18], processing signals with the sparsity of . In [19], a reconstruction algorithm similar to OMP is implemented on an FPGA to reconstruct bandsparse signals acquired by the modulated wideband converter. OMPlike implementation for problems of size 64 × 256 is proposed in [20], which, however, does not orthogonalize the estimation in every iteration. The first approximate message passing (AMP) designs [21] perform audio restoration and solve CS problems of size 512 × 1024. Highly parallel FPGA implementation of OMP and AMP reconstruction algorithms are presented in [22], which run on a Xilinx Virtex6 FPGA. The high speed architecture optimized based on [20] is discussed in [23], in which an architectural design and FPGA implementation of lowcomplexity compressive sensing reconstruction hardware are proposed. The proposed architecture supports vectors of length 256. And a thresholding method is applied for reducing the computation latency of dot product. In [24], a singleprecision floatingpoint CS reconstruction engine implemented on a Kintex7 FPGA is presented. In order to achieve high performance with maximum hardware utilization, a highly parallel architecture that shares computing resources among different tasks of OMP by using configurable processing elements (PEs) is presented.
OMP reconstruction can be divided into two main stages: optimization which finds the closely correlated atoms and least square problem. For large scale dictionary, the implementation of correlation is time consuming since it often requires a large number of matrix multiplications. The architectures listed above will take a lot of time for computing due to the path delay in processing the dot product and performing the matrix inverse. This paper aims to optimize the computational complexity according to the characteristics of the dictionary of CSbased radar applications. Some orthonormal transformations are utilized for CSbased radar signal reconstruction and image processing, such as Fourier basis, Discrete Cosine Transform (DCT) basis, and wavelet basis. In radar applications, the partial Fourier dictionary is widely applied for spectrum reconstruction and radar imaging. In [25–27], a framework of highresolution inversed synthetic aperture radar (ISAR) imaging with limited measured data is presented. During CS framework, the ISAR imaging is converted into a problem of signal reconstruction with orthogonal Fourier basis. Novel stepfrequency radar (SFR) systems are proposed in [28, 29], which achieve the same resolution as conventional SFRs, while using significantly reduced bandwidth. This bandwidth reduction is accomplished by employing compressive sampling ideas and exploiting the sparseness of targets in the rangevelocity space with redundancy Fourier basis. Gurbuz et al. proposed a compressive sensing data acquisition and imaging method for step frequency continuous wave (SFCW) ground penetrating radar (GPR) [30, 31], where the sparsity property and limited number of buried objects are successfully utilized for improving the performance of target detection. In [32–34], similar data acquisition and target reconstruction strategies are applied for SFCW throughthewall radar imaging. The abovementioned compressive GPR algorithms are discussed in the framework of SFCW radar. In these systems, the dense partial Fourier dictionary or modulated partial Fourier dictionary is adopted for target reconstruction. In [35], a novel velocity ambiguity resolving method is proposed for moving target indication (MTI), in which the compressive sensing (CS) is applied to recover the unambiguous Doppler spectrum of targets from the random pulse repetition frequency (PRF) jittering pulses. In [36], highfrequency (HF) overthehorizon radar (OTHR) spectrum reconstruction of maneuvering target is proposed. The spectrum is reconstructed from incomplete measurements via CS by using a redundant Fourierchirp dictionary. Therefore, the realtime hardware implementation of OMP for radar applications based on Fourier or modulated Fourier dictionary is very meaningful.
Based on the procedure of standard orthogonal matching pursuit (OMP) algorithm and the characteristics of Fourierclass dictionary, an improved solver named as IOMP is developed to optimize the computation complexity of optimization, which implements the correlation by fast Fourier transform (FFT) and the least squares by Conjugate Gradient (CG), respectively.
This paper is organized as follows. Section 2 reviews the CS theory, the traditional OMP algorithm, and the improved OMP algorithm for partial Fourier dictionary. Section 3 describes the circuit optimization and FPGA architecture for IOMP. In Section 4, several experiments using hardware acceleration are conducted. Then, the calculation efficiency, accuracy, and resources utilization are reported. Finally, Section 5 concludes the paper.
2. CS Theory and IOMP Algorithm
2.1. CS Theory
The developing theory of compressive sensing indicates that an unknown sparse signal can be exactly recovered from a very limited number of measurements with high probability by solving an optimization problem when some special conditions are met [1–6]. Consider , and suppose there exists a basis satisfying , where is a sparse vector. Making a linear measurement process to , we have , where is a random measurement matrix and is defined as dictionary and is additive noise. The CS theory indicates that if the matrix has optimal restricted isometry property (RIP), accurate recovery of with high probability can be achieved by solving a convex problem as where denotes norm and denotes minimization. In the case of low signal to noise ratio (SNR), we should set a high noise level for good estimation. The noise level estimation of radar applications can be found in [25, 26].
2.2. OMP Algorithm
Consider a sparse signal sampled using a random matrix and is the sampled data. Our ultimate goal is to find columns of dictionary which mostly contributed to . To begin with, residual is initialized to . For each iteration, a column of is chosen which has the best correlation with . The residual is then updated by subtracting the correlation from for next iteration. This is repeated for times to find columns of and the estimated signal is obtained by solving an overdetermined least square problem. The procedure of original OMP algorithm [14] is given below:(a)Initialize the residual , the index set , the signal set , and the iteration counter .(b)Find the index which is most correlated to by solving the optimization problem:where is the th column of .(c)Update the index set and signal set :(d)Solve a least square problem to obtain a new signal estimate:(e)Calculate the new residual according to(f)Increment and return to step (b) if is less than or the residual error is larger than a preset noise level .
There are three main tasks performed in each iteration: atom searching (AS) for identifying the active set (step (b)), least square (LS) solving for computing the new signal estimation (step (d)), and residual update (step (e)). At iteration , the number of complex floating point operations (FLOPs) required by each task is , (realize LS by GramSchmidt based QR decomposition), and , respectively [37]. The AS task is found to be the performance bottleneck as it contributes the most computation load in original OMP. Simultaneously, the LS task executed in each loop plays a significant role in the hardware utilization efficiency. This is because the matrix factorization in the LS task involves a variety of operations with complex data flows, which will introduce extra hardware complexity in terms of control and scheduling [38].
2.3. AS Improvement for Partial Fourier Dictionary
The OMP algorithm mentioned above can be implemented efficiently. The LS performed in each iteration in original OMP algorithm (step (d)) can be substituted by GramSchmidt orthogonalization [18]. Then, step (d) to step (f) of original OMP algorithm can be realized by the following steps:(d)Perform modified GramSchmidt orthogonalization by using the column of and in order to determine .(e)Calculate the new residual according to(f)Increment and return to step (b) if is less than or the residual error is larger than a preset value .(g)Solve the least square problem to find for the indices in : where consists of the relevant columns of .
By doing so, the architecture explained above is more efficient because the LS should only be executed once, rather than the original OMP algorithm, in which the LS should be taken in each iteration. Then, the most time consuming steps are AS for selecting the atoms (step (b)) and the LS (step (g)) for estimating the final in large sparsity situation. As we know, most of the reconstruction time is consumed in the optimization problem for finding columns of . When the dimension of dictionary is large, matrixvector multiplications are time and resource demanding to implement in hardware in real time. The algorithm proposed here is based on partial Fourier basis or modulated Fourier basis which is less complex for implementing in hardware. For radar detection and imaging, it is preferable to employ a structured basis, such as partial Fourier basis, Discrete Cosine Transform (DCT) basis, and wavelet basis. In this case, the OMP implementation can be processed efficiently. Most significantly, it is possible to compute the maximum correlation between a signal and the columns of the matrix in real time by using fast transforms. Second, the matrix can be preconstructed and it is only necessary to store only one vector from dictionary matrix.
First, we discuss the AS optimization of partial Fourier dictionary. Define the redundant time frequency dictionary aswhere and are defined as the Fourier basis of size and random measurement matrix of size , respectively. is constructed by randomly selecting rows from an identity matrix. The schematic of partial Fourier dictionary construction is expressed as
Rewrite the optimization formula of finding the index which is most correlated to by solving the following problem:Instead of calculating it one by one by vector dot product, we may obtain a column through the following fast Fourier transform computation: where is a vector based on , with the missing element zero padded according to measurement matrix .
Next, we discuss the AS optimization of Fourierchirp dictionary which is adopted in [36]. By this way, the spectrum is reconstructed from incomplete measurements via CS by using a redundant Fourierchirp dictionary for maneuvering targets. For our convenient derivation, let and stand for the Doppler frequency and chirp rate, respectively. And then the partial Fourierchirp dictionary is able to be constructed:The Fourierchirp basis can be formulated as where “” denotes Hadamard product, , , and is the grid step of chirp rate. and is a subset of . is the dimension of chirp rate.
Suppose we need to estimate the th signal components of . Its Doppler frequency and chirp rate are achieved when the inner product of and the basis in reaches its maximum:Then, we apply IFFT instead of matrixvector multiplications directly to enhance efficiency in inner product computation in (14). Denote the inner product matrix corresponding to the dictionary and the measurement asApparently, is an matrix. Instead of calculating its element one by one by vector dot product, we may obtain a column through the following fast Fourier transform computation: and are vectors based on and , with the missing data zero padded. By seeking the maximum element in , we can get the estimation of . Then, we may achieve the signal estimation.
2.4. LS Improvement
There are two categories of methods for solving linear systems. The first is direct method, where the solution is computed through lower/upper triangular decomposition and solving triangular system. The second is iterative method, where the solution is approximated by performing iterations from an initial vector. Direct method is only feasibly applied for small systems. In contrast, the iterative methods are suited for solving larger scale problem. One wellstudied method that has been proven to be very efficient in software and robust at solving large sparse linear systems is the Conjugate Gradient (CG) algorithm. In this paper we present a hardware architecture of CG method which takes advantage of wide parallelization and deeppipelining of FPGAs.
The pseudocode of the corresponding CG algorithm is listed in Algorithm 1.

In summary, the flowchart of OMP and IOMP algorithm is shown in Figure 1. The IOMP is optimized for partial Fourier dictionary or modulated partial Fourier dictionary which is always adopted in radar detection and imaging.
(a) Flow chart of OMP
(b) Flow chart of IOMP
3. Proposed FPGA Architecture
In this section, FPGAbased hardware architecture is proposed to take a compromise on speed, resource utilization, and accuracy in the eventual circuit implementation. The newly available Xilinx FPGA (XC7VX690T) is utilized to validate the implementation of proposed approach. The block diagram of toplevel design is illustrated in Figure 2.
As shown in Figure 2, the hardware design principally comprises two components: atom searching (AS) and least square solving (LS). AS is used to find out the atoms which are most correlated to residual in dictionary . LS is adopted to figure out sparse solutions based on CG iteration method. Firstly, zero padding is conducted to residual in corresponding positions so as to allow for carrying out IFFT operations which are capable of calculating the correlation between residual and dictionary . Followed by the most correlated atoms which are found out, the GramSchmidt orthogonalization is in succession executed for updating the residual . Repeat the aforementioned steps until all atoms are found out. Finally, the CG iteration method is used for quickly figuring out the sparse solution .
Figure 3(a) illustrates the hardware architecture of AS, whose input parameters are sampled data and dictionary which are stored in external doubledatarate (DDR3) synchronous dynamic random access memory (SDRAM). As depicted in Figure 3(a), processing element (PE) is a modularized circuit which is shared and available for calculating the product of two complex vectors. Its inner structure is depicted in Figure 3(b). Multiply accumulator (MAC) is a Xilinx LogiCORE IP core which provides multiplyaccumulate implementations of two fixedpoint vectors. FP1 and FP2 are floatingpoint IP cores serving as floatingtofixed point conversion and fixedtofloating point conversion, respectively. The output data of FP1 hold a 32bit width with 16 bits reserved for the integer part and 16 bits for the fractional part. To offer sufficient bit width for multiply accumulator operations, the accumulation width of MAC is configured to the default value of 64 bits.
(a)
(b)
Based on comprehensive consideration of computation time and precision, data to be processed by FFT IP core are converted to 32bit fixed point data with 16 bits for the integer part and 16 bits for the fraction part. FFT IP core utilizes XtremeDSP slices to calculate and block RAMs to store intermediate data. Fullprecision unscaled arithmetic is chosen for the accurate computation. To perform 2048point transform, the output data of FFT IP core are 44bit width with 28 integer bits and 16 fractional bits. Consider the realdata test and redundant sign bits; integer bits are intercepted to 20 bits. As shown in Figure 3(a), the output data of FFT are converted to floatingpoint format by using FP2, whose input integer width and fraction width are configured to 20 bits and 16 bits, respectively. Error caused by the conversion between fixedpoint and floating point is about , which is negligible.
Ultimately, AS outputs the most correlated atoms which are subsequently utilized for LS calculations.
The hardware structure of LS is depicted in Figure 4(a). Data to be processed are received from sampled data and the output of AS, that is, matrix . As illustrated in Figure 4(a), matrixmatrix and matrixvector multiplications are implemented by using paralleled PEs which have the same structure as that illustrated in Figure 3(b). The inner structure of basic IP which is utilized for updating sparse solution and residual is depicted in Figure 4(b). Finally, LS outputs a sparse vector which is the optimal solution to the problem with partial Fourier dictionary.
(a)
(b)
FloatingPoint Operator v5.0 is utilized here to perform floatingpoint arithmetic on selected FPGA device. For carrying out addition/subtraction and multiply operations, the core is configured to usage of 2 × DSP48E and 3 × DSP48E, respectively. Both fixpoint division and MAC operations require no DSP48Es, and so does the conversion between floating point and fixed point. PE module which consists of 10 floatingpoint cores and 4 MACs utilizes only 4 DSP48Es despite the increase of input vector length.
As analyzed above, we optimized the circuit in resource utilization and computation speed, respectively. FPGA’s embedded dualport block RAMs are utilized for expediently reading/writing data, reducing data transfer delay, and simplifying sequential control.
4. Experimental Results
In this section, we will provide the comparison between other CS reconstruction algorithms and proposed IOMP algorithm in basic aspects such as processing speed, resource utilization, and computation precision.
4.1. Comparison between Traditional OMP and Proposed IOMP
The Verilog HDL program developed for IOMP algorithm is simulated and implemented based on XC7VX690T FPGA and runs at 165 MHz. In proposed technique, calculations of the correlations between dictionary and residual are conducted by IFFT operations instead of inner product. Assume the number of columns of matrix is four times more than that of rows. Based on the utilization of DSP48Es of IFFT IP core provided by ISE14.3, vectorvector multiplications are calculated in parallel to ensure the same utilization of DSP48Es by inner product and IFFT. Under the aforementioned conditions, experimental results concerning clock cycles consumed by IFFT and inner product are described in Figure 5.
(a)
(b)
Figure 5(a) illustrates the comparative results. Red curve which rises rapidly with the increase of column vector length illustrates the computation time of inner product, while blue curve which depicts the computation time of IFFT operations increases lentamente. Figure 5(b) illustrates the result of enlarged drawing of Figure 5(a) when the vector length is less than 256. Obviously, inner product will take far more time than IFFT does when the vector length is larger than 128. IFFT operations will achieve the speedup of 12× over dot product calculations under the condition of same resource utilization (DSP48E) when the vector length is 2048.
CG iteration method which has excellent properties of small memory space requirement and high iteration speed is utilized for solving the least square problem. As an iteration method, the precision of CG is determined by bit width and iteration times. Normalized mean square error (MSE) of reconstruction result is utilized as the evaluation criterion of computation precision, and is required in this design. Experiment about the relationship between precision and iteration times is carried out. The experimental result is depicted in Figure 6, where is the sparsity. Generally sixtime iterations can exactly achieve the required precision under the condition of .
Mathematical calculations, for example, floatingpoint multiply and add/subtract operations, are realized jointly by DSP48Es and LUTs, and the latency is one clock cycle based on the maximum usage of DSP48Es. Floatingpoint division is realized by LUTs and its computing latency is 18 clock cycles. As components of PE module, FP1/FP2 and MAC have a latency of one clock cycle and clock cycles, respectively, where is the input vector length. Thus, PE module totally requires clock cycles to accomplish the complex vector multiplications. To accomplish one iteration of CG, totally clock cycles described as in (17) are required:where is the sparsity and represents one clock cycle.
As discussed above, to reconstruct the signal of sparsity 12, sixtime iterations can exactly satisfy the demand for precision. Totally clock cycles are required to accomplish the whole iterations:
QR decomposition is another widely used and effective least square method. Unfortunately, because of the considerable computation complexity, its computation load and resource requirement will have a sharp growth with the increase of matrix size. Even though parallel architecture is considered in the hardware implementation, with respect to computation time and resource utilization, QR decomposition can hardly obtain the excellent performance that CG iteration creates. Figure 7 gives the clock cycles that CG iteration took with the increase of sparsity.
According to abovementioned experimental results, the proposed technique is more feasible for processing largescale sparse dictionary and reconstructing signals with large sparsity.
4.2. Calculation Latency and Resource Utilization of Proposed IOMP Algorithm
In this paper, IOMP algorithm is utilized for reconstructing sparse signals. To process a 512length measured vector of sparsity 12, assume the size of Fourier basis matrix and dictionary to be and , respectively. It requires 4245 clock cycles to find out one column of dictionary, 1053 clock cycles for Schmidt orthogonalization and updating residual , and 1080 clock cycles for LS computation, respectively. Thus, totally 64656 clock cycles, that is, 391.8 , are required to accomplish the whole reconstruction work.
Table 1 provides the synthesis and implemented result reported by the Xilinx ISE14.3. Notice that FPGA resources are sufficient for processing reconstruction algorithm in this example.
Radar echoes are complex data containing phase and amplitude information. Theoretically, to accomplish the multiply operations of two complex numbers, totally four multiplications, one addition, and one subtraction are conducted. Relatively, only one multiplication is needed for the multiplication of two real numbers. Thus, processing complex data has considerable computation load over processing real data.
Many works about OMP accelerated reconstruction, including both software implementation and hardware implementation, have been published [9, 18, 22, 24, 39, 40]. Comparisons between them and this work are carried out with respect to computing speed, maximum working frequency, computation accuracy, and so forth. Table 2 expatiates on the comparative results. Compared to other works based on FPGA, a higher working frequency of 165 MHz is achieved in this work. Authors in [18] present VLSI implementation of an optimized OMP algorithm to process the dictionary of size 32 × 128 with sparsity of , which totally takes 24 . In [22], highly parallel FPGA implementations of two CS reconstruction algorithms OMP and AMP are proposed, which run on a Xilinx Virtex6 FPGA. The hardware realization discussed in [24] totally takes , where is the sparsity. It totally takes 478.8 to accomplish CS reconstruction when . Besides, the reconstruction work is also executed on CPU and GPU. A parallel architecture of implementing OMP algorithm based on GTX480 GPU is proposed in [39]. To reconstruct a signal of 8192 points with sparsity 32, totally 15 ms is required. Work [9] proposes a CS method for manycore architectures, for example, the cell processor, GPUs, and CPUs. Compared to the implementation by Intel Core i7 whose computing speed is limited by available dominant frequency, apparently a higher reconstruction speed can be achieved by this work. Meanwhile, CPU and GPU exactly consume much more power. Work [40] executed a MATLAB code of OMP algorithm on Intel Core Duo CPU at 2.8 GHz. To reconstruct a 128length signal of sparsity 5, totally 606 is required. In contrast, our work obtains a speedup of 33 times.
It is worth mentioning that the data processed in our architecture is complex data format, while the architectures proposed in references are processing the real data. In summary, compared to other works, the proposed CS reconstruction technique in this paper has an excellent performance with respect to computing speed, accuracy, and flexible applicability in case that partial Fourier dictionary or modulated partial Fourier dictionary is adopted.
5. Conclusion
In this paper, we focus on the realtime implementation of compressive sensing with partial Fourier dictionary which is always adopted for radar applications. And the high density FPGA is used for hardware acceleration. According to the characteristics of the dictionary, an improved orthogonal matching pursuit algorithm is proposed to solve the sparse optimization efficiently. In this scheme, the correlation is implemented by FFT and the least square is realized by CG, respectively. Fast and areaefficient FPGA realization is provided to meet the realtime requirement of CSbased radar. The hardware architecture, the resource utilization, the computation latency, and the computation precision are analyzed in detail. Finally, a comparison with other works is made to evaluate the effectiveness of proposed approach.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This work was supported by National Natural Science Foundations of China (Grants nos. 61303035 and 61471283).