Abstract

Because of the poor real-time performance of in-place fast Fourier transforms, a reconfigurable radix-4 FFT processor is studied and designed, which is based on decimation-in-time and single floating-point computation. The proposed method adopts “pipeline and parallel” structure for accessing multiple memories to improve the FFT processing speed, and then it is applied to digital pulse compression. The experimental result shows that the proposed FFT based on radix-4 computation can implement digital pulse compression rapidly under no adding hardware resources. The proposed method can be also applied to other radix FFTs.

1. Introduction

The concept of pulse compression begins from the Second World War. Because of the technological difficulty, pulse compression signal has applied to long-range surveillance and long-range tracking radars until the early 1960s. From the 1970s, with theoretic maturity and improving technology, pulse compression can be widely applied to radars of 3D, phased array, reconnaissance, fire control and so on. Therefore the performance of these radars is proved obviously. As many novel technologies and new devices are progressively used to radar systems, the property of radar systems is much improved. Specially, the emerging of the fast Fourier transform (FFT) lays a solid foundation. It is a hotspot for radar design to study high real-time pulse compression [13]. In order to obtain high range resolution, pulse compression has been used.

There are two methods for digital pulse compression (DPC), that is, convolution integral in time domain and matched filter in frequency [4]. In engineering design, DPC can be mainly implemented in frequency. So the matched filtering of LFM signal is realized and shown as Figure 1.

The procedure of matched filtering in frequency is using FFT to make discrete time signal into discrete spectrum, multiplying by the frequency response function of the filter, and using IFFT to back into time series, namely, gaining the time domain single of DPC.

Suppose that the transmitter signal is , and the corresponding signal spectrum is ; thus the function of the matched filter is . Given that the receiver signal is , the output of pulse compression is

It is seen that FFT/IFFT is still the concern in implementing DPC.

In the early days, general high-speed signal processor is mainly adopted to implement DPC, and this method is gradually eliminated as the new radar system with high-resolution real-time processing technology [5]. However, it becomes more and more mature for using hardware programmable logic, and FPGA (Field Programmable Gate Array) [6, 7] can implement DPC meeting precision requirements and increasing speed.

Usually, the structures of FFT processor have the constant geometry [8], the pipeline [9], and in place [10, 11]. The constant geometry FFT costs double memories. The pipeline architectures have high throughput, but they waste a large of resouces. Therefore, in-place architecture is employed to implement the FFT/IFFT, which are the main computations in DPC.

A reconfigurable FFT processor based on “pipeline and parallel” structure is proposed, and it is applied to DPC.

2. Basic Theory

Given a length- complex sequence , , its DFT is also a length- complex sequence defined bywhere and . In this proposed algorithm, the sequence length is a composite number and it is equal to ; therefore, can be expressed aswhere , . , are the numbers of columns and rows, respectively. and are the amounts of columns and rows, respectively.

In a similar way, the frequency index for the output sequence is expressed aswhere , . is a column vector, and is a row vector.

Equation (2) can be rewritten as

Then, we havewhere

From above, can be mapped to ; that is, can be decomposed into sets data with points. The data of the sets are independent of each other.

3. Proposed FFT Design

3.1. Structure

4096-point radix-4 FFT is mainly discussed. The design is based on a “pipeline and parallel” structure. In the proposed design, only single radix-4 butterfly unit is needed; meanwhile, four dual-port memories are used for reducing processing time.

Before discussing the structure, a counter should be designed. It keeps consistent with the input data. When one datum is input, the counter adds “1”. So, the range of the counter is from 0 to 4095.

First, the 4096 input data should be distributed into the four memories. According to the results of the modulo 4 of the designed counter, 4096 input data can be assigned to the four memories with depth being 1024, that is, RAM0~RAM3. Then each set of 1024 data can be computed by radix-4 FFT. Because the four sets of data are independent of each other according to (6), they can compute in parallel. At the same time, the four sets are the same in computing; therefore, the four FFTs can be implemented in pipeline structure. The pipeline structure is shown in Figure 2, and only one radix-4 butterfly is needed to implement four 1024-point FFTs.

Detailed explanation of each unit is as follows.(i)RAM0~RAM3: four memories are used for storing the 4096 data. With high precision, the data are formatted in single-precision floating point. Therefore, the storage size of each memory is 64 K bits.(ii)Control unit for memories: it is used to control the orders of memories accessed.(iii)Cache unit 1: its function is to cache the pipeline data loaded from each memory. If the data are computed, the data should be in parallel.(iv)Radix-4 butterfly unit: it is the basic radix-4 unit [12] and is computed in decimation-in-time. Considering hardware implementation, the butterfly unit mainly consists of floating-point adders and multipliers [13]. For the four operands, the first one needs not a multiplier, so there are three complex multipliers.(v)Cache unit 2: its function is to cache the parallel results from radix-4 butterfly unit. Because the outputs are stored in pipeline, there is one cache unit to store them.

After the computation above, there is another stage to compute. In this stage, parallel accessing for the four memories is applied and the same butterfly unit is used. The structure is shown in Figure 3. After running 1024 radix-4 computations, the data are the results of 4096-point FFT.

Detailed explanation of each unit is as follows.(i)The depth of each memory at the left part is 1024. The data are the results of 1024-point FFT.(ii)One datum can be accessed from one memory and four data are obtained to input the radix-4 butterfly unit. There are 1024 times to process.(iii)The parallel outputs are stored in parallel.

3.2. Pipeline Accessing for Memories

For the radix-4 FFT computation, there are two parts: multiplication with coefficients; 4-point DFT computation. The four operands of one 4-point butterfly are input in parallel, multiple by twiddle factors and run 4-point DFT.

Before being input into the butterfly unit, four data are accessed from one memory. Then, the four pipeline data are cached and output in parallel. Last, the four parallel data are input into the butterfly unit. The whole computing process of the butterfly unit is just as follows.(1)According to the accessing addresses of the operands and the twiddle factors [14], the operands and the twiddle factors can be obtained. The four operands can be represented by Op(0)~Op(3).(2)The operands and the twiddle factors in pipeline are input into cache unit 1 and the outputs are in parallel.(3)The four parallel operands multiply by the four parallel twiddle factors and then addition and subtraction are computed. Finally, the results from multiplication with twiddle factors are O(0)~O(3).(4)After butterfly computation, the parallel outputs are O(0)~O(3). Then, the outputs are input into cache unit 2 and pipeline outputs are obtained and then stored in place.

The timing diagram between the input and the output of the radix-4 butterfly unit is shown as in Figure 4.

From the timing diagram above, three cycles are idle in one radix-4 butterfly computation.

In the proposed method, the idle cycles can be used. So one butterfly unit is used for four 1024-point FFTs. It is important to arrange the order of loading data from the four sets, and the corresponding timing diagram is shown as in Figure 5.

Detailed explanations are as follows.(i)The four operands from RAM0 are Op0(0)~Op0(3), and the complex multiplications with twiddle factors are Op0′(0)~Op0′(3). The parallel outputs of butterfly unit are Op(0)~Op(3).(ii)The four operands from RAM1 are Op1(0)~Op1(3), and the complex multiplications with twiddle factors are Op1′(0)~Op1′(3). The parallel outputs of butterfly unit are Op(0)~Op(3).(iii)The four operands from RAM2 are Op2(0)~Op2(3), and the complex multiplications with twiddle factors are Op2′(0)~Op2′(3). The parallel outputs of butterfly unit are Op(0)~Op(3).(iv)The four operands from RAM3 are Op3(0)~Op3(3), and the complex multiplications with twiddle factors are Op3′(0)~Op3′(3). The parallel outputs of butterfly unit are Op(0)~Op(3).

In order to implement pipeline accesses for the four memories, reading data from RAM0~RAM3 are delayed for one cycle in turn. Because the four sets are independent of each other, the four sets need the same accessing addresses and twiddle factors. So the four FFTs share one set of twiddle factors and it can simplify FFT design.

3.3. Parallel Accessing for Memories

The outputs of the above 1024-point FFT multiple by the twiddle factors and they are new sequences. The new sequences are input into the butterfly unit. The last stage is to compute 1024 4-point DFTs, and the structure is shown as in Figure 6.

For the last stage, the same radix-4 butterfly is used. The 1024 4-point DFTs are computed. Because the operands of one butterfly belong to different memories, the four memories are accessed in parallel. The results are the 4096-point FFT.

4. Reconfigurable FFT Design

Because it needs IFFT computation in DPC, a configurable FFT is designed, which can be configured to FFT or IFFT.

According to the relationship between FFT and IFFT, the designed FFT can be configured to IFFT. Due to the inverse transformation,

From (2) and (8), the main differences are the coefficients and the results being times.

Therefore, modifying the FFT is as the following. First, we exchange real part with imaginary part of the input data and then compute FFT. Last, the real part and imaginary part are exchanged and then are divided by . Finally the IFFT results are obtained.

Therefore, a control signal is set. When it is high, FFT computation is done; when it is low, IFFT computation is done. Thus, the proposed FFT processor is set as a reconfigurable FFT.

5. DPC Design

On demand of the actual engineering, there is 4096-point DPC to process. The process of DPC is to compute FFT and then multiply by coefficients and at last do IFFT. Radix-4 FFT is used to compute 4096-point FFT and four memories are accessed in “pipeline and parallel” structure.

First, 1024-point FFT should be implemented by one radix-4 butterfly unit, and the four sets are processed in pipeline; then the four memories are accessed in parallel. The stored results of 4096-point FFT in RAMs are in reversed order.

Second, the results of the above 4096-point FFT are read from memories, multiply by the matched filter coefficients and the results are stored in memories in pipeline. (In the example, hamming window is used for the matched filter coefficients.)

Last, data after multiplication with coefficients do IFFT. This step uses the reconfigurable FFT to process IFFT, followed by division by 4096. The division can be replaced by subtracting 12 because of floating-point operation being used.

6. Implementations and Analyses

FPGA is a large-scale programmable logic device with flexible logic cell, high integration, short developing time, and low developing cost; it is widely used in prototype design and prophase of new product development. Xilinx series FPGA is used for the proposed FFT and DPC verifications. The timing simulations are done in ModelSim and the resources are estimated by the device of Xilinx V6 series Xc6v1x240t (−1).

6.1. Implementation of the FFT Processor

The processing time periods and resources of the proposed FFT are listed in Tables 1 and 2, respectively.

From Table 1, the computation time of the proposed FFT is shorter than the compared methods. If 4096-point FFT is implemented with one memory and one butterfly, the processing time is 24259, which is 4 times as many as the proposed scheme. The running time of the other two methods which are based on the in-place architecture in spite of different platforms, is longer than the time of the proposed one. Therefore, the proposed design adopts “pipeline and parallel” accessing for memories to implement FFT and indeed reduces the processing time based on in-place architecture.

From Table 2, the LUTs of the proposed design are less than the two compared methods. Because the accessing addresses for the operands and the twiddle factors are generated for 1024-point FFT, not for 4096-point FFT, the waste resources can be reduced.

Therefore, the proposed method with “pipeline and parallel” structure is not only keeping less resources but also having higher processing speed to compute FFT and the above results show its advantage.

6.2. Implementation of DPC

The resources of DPC are listed in Table 3. The maximum frequency is 122.128 MHz and satisfies the engineering demand.

The timing simulation of DPC is shown in Figure 7. The data source of 4096 numbers is generated by Matlab tool and the matched filtering coefficients are stored in ROM.

Figure 7 shows that the computing cycles of FFT or IFFT are 6292, and the cycles of multiplications with coefficients are 4110. So the total of processing cycles of DPC is 16694. If the clock period is set 100 MHz, the running time of 4096-point DPC computation is 166.94 μs.

Table 4 shows the processing time of DPC with different schemes. If one butterfly is used with in-place structure and one memory, the running clocks are 52656 and if Xilinx FFT IP core is used to compute FFT and IFFT in DPC, the time periods of DPC are 24992. There are many arithmetic operations, such as additions and multiplications, in DPC computation. Because the processing delays of the operations are not the same in the first two schemes, the processing time is different for them. As the processing delay of arithmetic operations of the proposed scheme is the same as the first one, from the result, the proposed method wastes less processing time than the first one. Therefore, the proposed FFT processor can improve the processing speed of DPC and has its advantages in engineering.

7. Conclusions

The proposed scheme for FFT structure based on “pipeline and parallel” access can improve processing speed and can be applied to DPC design for single-channel data. According to the processing flow of DPC, the stages of FFT and IFFT adopt the proposed FFT structure to achieve the goals of high speed and less resources, so the DPC system can obtain high real-time performance.

Meanwhile, the proposed method is also applied to multiple-channel data. The data of each channel can be stored in one memory and the multiple sets of multiple-channel data are stored in multiple memories.

Furthermore, the novel way can be used to other radix FFTs and simple mixed-radix FFTs. When radix- FFT adopt the scheme, usually memories are taken and data are divided into the memories and single radix- FFT is employed. Simply mixed-radix FFT, for example, radix-2/4 FFT, can use the proposed method. We can store data into four memories; thus there exists four memories for radix-2 butterfly computation. Two radix-2 butterflies can be computed in parallel. Because radix-4 butterfly can reconfigure into two radix-2 butterflies, radix-2/4 FFT can only use one radix-4 FFT to compute.

Therefore, the proposed method can have wide applications.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.