International Journal of Antennas and Propagation

Volume 2016 (2016), Article ID 1671687, 12 pages

http://dx.doi.org/10.1155/2016/1671687

## FPGA Implementation of Real-Time Compressive Sensing with Partial Fourier Dictionary

National Laboratory of Radar Signal Processing, Xidian University, Xi’an 710071, China

Received 18 July 2015; Accepted 13 December 2015

Academic Editor: Atsushi Mase

Copyright © 2016 Yinghui Quan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper presents a novel real-time compressive sensing (CS) reconstruction which employs high density field-programmable gate array (FPGA) for hardware acceleration. Traditionally, CS can be implemented using a high-level computer language in a personal computer (PC) or multicore platforms, such as graphics processing units (GPUs) and Digital Signal Processors (DSPs). However, reconstruction algorithms are computing demanding and software implementation of these algorithms is extremely slow and power consuming. In this paper, the orthogonal matching pursuit (OMP) algorithm is refined to solve the sparse decomposition optimization for partial Fourier dictionary, which is always adopted in radar imaging and detection application. OMP reconstruction can be divided into two main stages: optimization which finds the closely correlated vectors and least square problem. For large scale dictionary, the implementation of correlation is time consuming since it often requires a large number of matrix multiplications. Also solving the least square problem always needs a scalable matrix decomposition operation. To solve these problems efficiently, the correlation optimization is implemented by fast Fourier transform (FFT) and the large scale least square problem is implemented by Conjugate Gradient (CG) technique, respectively. The proposed method is verified by FPGA (Xilinx Virtex-7 XC7VX690T) realization, revealing its effectiveness in real-time applications.

#### 1. Introduction

Compressive sensing (CS) is a novel technology which allows sampling of sparse signals under sub-Nyquist rate and reconstructing the signal using computational intensive algorithms. It has received considerable attention and has been successfully applied in many fields, such as signal/image processing, radar imaging, communication, geophysics, and remote sensing [1–6]. In CS, it has been shown that a signal which is sparse or has a sparse representation in some bases can be recovered from a small number of random nonadaptive linear measurements. Unfortunately, high-performance sparse signal recovery algorithms typically require a significant computational effort [7]. While the computational complexity is not a major issue for applications where offline processing on central processing units (CPUs) or graphics processing units (GPUs) can be afforded (e.g., in MRI) [8–10], it becomes extremely challenging for applications requiring real-time processing at high throughput (e.g., in radar detection and imaging). Hence, to meet the stringent throughput, latency, and power-consumption constraints of real-time CS-based radar systems, developing dedicated hardware implementations, such as application specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs), is of paramount importance [11].

A common approach to sparse reconstruction is known as Basis Pursuit (BP) [12]. This method uses convex optimization to find a signal representation in an overcomplete dictionary that minimizes norm of the coefficients in the representation. While being known to achieve accurate signal reconstruction, BP is more computationally intensive and has been shown to be significantly slower than other methods [13]. Orthogonal matching pursuit (OMP) which is proposed by Tropp and Gilbert [14] is an efficient and reliable reconstruction algorithm. OMP is a greedy method which identifies the location of one nonzero component of dictionary at a time. In order to converge, it requires a minimum number of samples in the order of , where is the signal’s sparsity and is the original dimension of the problem. Tropp and Gilbert show that, by performing enough nonadaptive measurement, signal recovery is possible with high probability. OMP is widely used in CS signal reconstruction due to its computing efficiency and relative simplicity. Even so, the computation load is very large for real-time applications, in which the signal reconstruction should be done within specified time constraints.

Software implementation of these algorithms is time consuming since they often require massive matrix multiplications. These signal processing applications which require intense computation and simultaneous processing of large amount of data in real-time can make use of large scale field-programmable gate array (FPGA) platform for hardware acceleration. Modern high-capacity FPGA is an attractive alternative to accelerate scientific and engineering applications [15] due to the possible utilization of massive parallelism.

In order to speed up the complex reconstruction algorithms, a number of implementations on GPU, [8] ASICs, or reconfigurable FPGAs have been reported so far. The first ASIC implementation of CS reconstruction algorithms including matching pursuit (MP), gradient pursuit (GP), and OMP is presented in [16, 17] for channel estimation in wireless communication systems. FPGA implementation of OMP for generic CS problems of dimension 32 × 128 is developed in [18], processing signals with the sparsity of . In [19], a reconstruction algorithm similar to OMP is implemented on an FPGA to reconstruct band-sparse signals acquired by the modulated wideband converter. OMP-like implementation for problems of size 64 × 256 is proposed in [20], which, however, does not orthogonalize the estimation in every iteration. The first approximate message passing (AMP) designs [21] perform audio restoration and solve CS problems of size 512 × 1024. Highly parallel FPGA implementation of OMP and AMP reconstruction algorithms are presented in [22], which run on a Xilinx Virtex-6 FPGA. The high speed architecture optimized based on [20] is discussed in [23], in which an architectural design and FPGA implementation of low-complexity compressive sensing reconstruction hardware are proposed. The proposed architecture supports vectors of length 256. And a thresholding method is applied for reducing the computation latency of dot product. In [24], a single-precision floating-point CS reconstruction engine implemented on a Kintex-7 FPGA is presented. In order to achieve high performance with maximum hardware utilization, a highly parallel architecture that shares computing resources among different tasks of OMP by using configurable processing elements (PEs) is presented.

OMP reconstruction can be divided into two main stages: optimization which finds the closely correlated atoms and least square problem. For large scale dictionary, the implementation of correlation is time consuming since it often requires a large number of matrix multiplications. The architectures listed above will take a lot of time for computing due to the path delay in processing the dot product and performing the matrix inverse. This paper aims to optimize the computational complexity according to the characteristics of the dictionary of CS-based radar applications. Some orthonormal transformations are utilized for CS-based radar signal reconstruction and image processing, such as Fourier basis, Discrete Cosine Transform (DCT) basis, and wavelet basis. In radar applications, the partial Fourier dictionary is widely applied for spectrum reconstruction and radar imaging. In [25–27], a framework of high-resolution inversed synthetic aperture radar (ISAR) imaging with limited measured data is presented. During CS framework, the ISAR imaging is converted into a problem of signal reconstruction with orthogonal Fourier basis. Novel step-frequency radar (SFR) systems are proposed in [28, 29], which achieve the same resolution as conventional SFRs, while using significantly reduced bandwidth. This bandwidth reduction is accomplished by employing compressive sampling ideas and exploiting the sparseness of targets in the range-velocity space with redundancy Fourier basis. Gurbuz et al. proposed a compressive sensing data acquisition and imaging method for step frequency continuous wave (SFCW) ground penetrating radar (GPR) [30, 31], where the sparsity property and limited number of buried objects are successfully utilized for improving the performance of target detection. In [32–34], similar data acquisition and target reconstruction strategies are applied for SFCW through-the-wall radar imaging. The above-mentioned compressive GPR algorithms are discussed in the framework of SFCW radar. In these systems, the dense partial Fourier dictionary or modulated partial Fourier dictionary is adopted for target reconstruction. In [35], a novel velocity ambiguity resolving method is proposed for moving target indication (MTI), in which the compressive sensing (CS) is applied to recover the unambiguous Doppler spectrum of targets from the random pulse repetition frequency- (PRF-) jittering pulses. In [36], high-frequency (HF) over-the-horizon radar (OTHR) spectrum reconstruction of maneuvering target is proposed. The spectrum is reconstructed from incomplete measurements via CS by using a redundant Fourier-chirp dictionary. Therefore, the real-time hardware implementation of OMP for radar applications based on Fourier or modulated Fourier dictionary is very meaningful.

Based on the procedure of standard orthogonal matching pursuit (OMP) algorithm and the characteristics of Fourier-class dictionary, an improved solver named as IOMP is developed to optimize the computation complexity of optimization, which implements the correlation by fast Fourier transform (FFT) and the least squares by Conjugate Gradient (CG), respectively.

This paper is organized as follows. Section 2 reviews the CS theory, the traditional OMP algorithm, and the improved OMP algorithm for partial Fourier dictionary. Section 3 describes the circuit optimization and FPGA architecture for IOMP. In Section 4, several experiments using hardware acceleration are conducted. Then, the calculation efficiency, accuracy, and resources utilization are reported. Finally, Section 5 concludes the paper.

#### 2. CS Theory and IOMP Algorithm

##### 2.1. CS Theory

The developing theory of compressive sensing indicates that an unknown sparse signal can be exactly recovered from a very limited number of measurements with high probability by solving an optimization problem when some special conditions are met [1–6]. Consider , and suppose there exists a basis satisfying , where is a sparse vector. Making a linear measurement process to , we have , where is a random measurement matrix and is defined as dictionary and is additive noise. The CS theory indicates that if the matrix has optimal restricted isometry property (RIP), accurate recovery of with high probability can be achieved by solving a convex problem as where denotes norm and denotes minimization. In the case of low signal to noise ratio (SNR), we should set a high noise level for good estimation. The noise level estimation of radar applications can be found in [25, 26].

##### 2.2. OMP Algorithm

Consider a -sparse signal sampled using a random matrix and is the sampled data. Our ultimate goal is to find columns of dictionary which mostly contributed to . To begin with, residual is initialized to . For each iteration, a column of is chosen which has the best correlation with . The residual is then updated by subtracting the correlation from for next iteration. This is repeated for times to find columns of and the estimated signal is obtained by solving an overdetermined least square problem. The procedure of original OMP algorithm [14] is given below:(a)Initialize the residual , the index set , the signal set , and the iteration counter .(b)Find the index which is most correlated to by solving the optimization problem:where is the th column of .(c)Update the index set and signal set :(d)Solve a least square problem to obtain a new signal estimate:(e)Calculate the new residual according to(f)Increment and return to step (b) if is less than or the residual error is larger than a preset noise level .

There are three main tasks performed in each iteration: atom searching (AS) for identifying the active set (step (b)), least square (LS) solving for computing the new signal estimation (step (d)), and residual update (step (e)). At iteration , the number of complex floating point operations (FLOPs) required by each task is , (realize LS by Gram-Schmidt based QR decomposition), and , respectively [37]. The AS task is found to be the performance bottleneck as it contributes the most computation load in original OMP. Simultaneously, the LS task executed in each loop plays a significant role in the hardware utilization efficiency. This is because the matrix factorization in the LS task involves a variety of operations with complex data flows, which will introduce extra hardware complexity in terms of control and scheduling [38].

##### 2.3. AS Improvement for Partial Fourier Dictionary

The OMP algorithm mentioned above can be implemented efficiently. The LS performed in each iteration in original OMP algorithm (step (d)) can be substituted by Gram-Schmidt orthogonalization [18]. Then, step (d) to step (f) of original OMP algorithm can be realized by the following steps:(d)Perform modified Gram-Schmidt orthogonalization by using the column of and in order to determine .(e)Calculate the new residual according to(f)Increment and return to step (b) if is less than or the residual error is larger than a preset value .(g)Solve the least square problem to find for the indices in : where consists of the relevant columns of .

By doing so, the architecture explained above is more efficient because the LS should only be executed once, rather than the original OMP algorithm, in which the LS should be taken in each iteration. Then, the most time consuming steps are AS for selecting the atoms (step (b)) and the LS (step (g)) for estimating the final in large sparsity situation. As we know, most of the reconstruction time is consumed in the optimization problem for finding columns of . When the dimension of dictionary is large, matrix-vector multiplications are time and resource demanding to implement in hardware in real time. The algorithm proposed here is based on partial Fourier basis or modulated Fourier basis which is less complex for implementing in hardware. For radar detection and imaging, it is preferable to employ a structured basis, such as partial Fourier basis, Discrete Cosine Transform (DCT) basis, and wavelet basis. In this case, the OMP implementation can be processed efficiently. Most significantly, it is possible to compute the maximum correlation between a signal and the columns of the matrix in real time by using fast transforms. Second, the matrix can be preconstructed and it is only necessary to store only one vector from dictionary matrix.

First, we discuss the AS optimization of partial Fourier dictionary. Define the redundant time frequency dictionary aswhere and are defined as the Fourier basis of size and random measurement matrix of size , respectively. is constructed by randomly selecting rows from an identity matrix. The schematic of partial Fourier dictionary construction is expressed as

Rewrite the optimization formula of finding the index which is most correlated to by solving the following problem:Instead of calculating it one by one by vector dot product, we may obtain a column through the following fast Fourier transform computation: where is a vector based on , with the missing element zero padded according to measurement matrix .

Next, we discuss the AS optimization of Fourier-chirp dictionary which is adopted in [36]. By this way, the spectrum is reconstructed from incomplete measurements via CS by using a redundant Fourier-chirp dictionary for maneuvering targets. For our convenient derivation, let and stand for the Doppler frequency and chirp rate, respectively. And then the partial Fourier-chirp dictionary is able to be constructed:The Fourier-chirp basis can be formulated as where “” denotes Hadamard product, , , and is the grid step of chirp rate. and is a subset of . is the dimension of chirp rate.

Suppose we need to estimate the th signal components of . Its Doppler frequency and chirp rate are achieved when the inner product of and the basis in reaches its maximum:Then, we apply IFFT instead of matrix-vector multiplications directly to enhance efficiency in inner product computation in (14). Denote the inner product matrix corresponding to the dictionary and the measurement asApparently, is an matrix. Instead of calculating its element one by one by vector dot product, we may obtain a column through the following fast Fourier transform computation: and are vectors based on and , with the missing data zero padded. By seeking the maximum element in , we can get the estimation of . Then, we may achieve the signal estimation.

##### 2.4. LS Improvement

There are two categories of methods for solving linear systems. The first is direct method, where the solution is computed through lower/upper triangular decomposition and solving triangular system. The second is iterative method, where the solution is approximated by performing iterations from an initial vector. Direct method is only feasibly applied for small systems. In contrast, the iterative methods are suited for solving larger scale problem. One well-studied method that has been proven to be very efficient in software and robust at solving large sparse linear systems is the Conjugate Gradient (CG) algorithm. In this paper we present a hardware architecture of CG method which takes advantage of wide parallelization and deep-pipelining of FPGAs.

The pseudocode of the corresponding CG algorithm is listed in Algorithm 1.