International Journal of Reconfigurable Computing

Volume 2016, Article ID 3561317, 7 pages

http://dx.doi.org/10.1155/2016/3561317

## Modules for Pipelined Mixed Radix FFT Processors

Computer Science Department, National Technical University of Ukraine, Peremogy Avenue 37, Kiev 03056, Ukraine

Received 26 October 2015; Revised 2 January 2016; Accepted 5 January 2016

Academic Editor: Michael Hübner

Copyright © 2016 Anatolij Sergiyenko and Anastasia Serhienko. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A set of soft IP cores for the Winograd -point fast Fourier transform (FFT) is considered. The cores are designed by the method of spatial SDF mapping into the hardware, which provides the minimized hardware volume at the cost of slowdown of the algorithm by times. Their clock frequency is equal to the data sampling frequency. The cores are intended for the high-speed pipelined FFT processors, which are implemented in FPGA.

#### 1. Introduction

Fast Fourier transform (FFT) algorithm is widely used in many signal processing and communication systems. Due to its intensive computational requirements, it occupies large area and consumes high power if implemented in hardware.

FFT uses divide and conquer approach to reduce the computations of the discrete Fourier transform (DFT). In Cooley-Tukey radix-2 algorithm, the -point DFT is subdivided into two ()-point DFTs and then ()-point DFT is recursively divided into smaller DFTs until a two-point DFT. The last procedure, named as radix-2 butterfly, is just an addition and a subtraction of complex numbers.

Higher radix algorithms, such as radix-4 and radix-8, can be employed to reduce the complex multiplications, but the butterfly structure becomes complex. So, a split radix algorithm [1] is adopted to get the benefits of both radix-2 and radix-4 algorithms.

Prime factor algorithms use Good-Thomas mapping and Chinese Remainder Theorem for decomposing the -point DFT into smaller DFTs, which are the factors of and are mutually prime [2]. With this mapping the twiddle factor multiplications are avoided at a cost of increased number of additions and irregular structure. A modification of the prime factor algorithm is Winograd fast Fourier transform algorithm. It is capable of achieving minimum complex multiplications but the number of additions is increased.

Pipelined FFT architectures are fast and high throughput architectures with parallelism and pipelining [3]. Among them the single-path delay feedback architecture and multipath delay commutator architecture are the most popular.

In the first kind of architectures, each of the pipeline stages contains twiddle factor multiplier, -point DFT unit (), and data buffer, which stays in the feedback of the stage. This buffer is filled before the DFT computations. Therefore, such structure could not be fully loaded [4, 5].

In the second kind of architectures, the pipeline stages contain data buffer, -point DFT unit, and twiddle factor multiplier, which are connected in sequence. The data buffer is based on multipath delay commutator and provides sets of complex data which feed the DFT unit in parallel. Such architecture provides the maximum throughput at the cost of the high hardware volume [3, 6]. Besides, it must implement the uniform radix FFT, because for the mixed radix FFT the data buffers become too complex.

Systolic array scheme has also been proposed for FFT computations [7, 8]. The -point DFT in it is calculated as separate sums of weighted samples. It is attractive because of its regularity, scalability, locality of interconnections, and suitability for non-power-of-two transforms. However, such processor requires substantial hardware volume. For example, the 16 × 16-point processor for 16-bit data contains 3982 adaptive logic modules (ALMs) and 33 multipliers of the Altera Stratix III FPGA, comparing to the usual 20-bit width pipelined FFT processor, which contains 4261 ALMs, and only 24 multipliers [8].

The implementation of the pipelined FFT architecture in modern field programmable gate arrays (FPGAs) provides the high-speed hardware solution with small energy consumption. One FFT of 256 16-bit complex points dissipates approximately one microjoule in FPGA [6]. The FFT processor for , which occupies 36.7 k ALMs, and 60 multipliers provides the speed up to 90 GFLOPS, and the efficiency near 10 GFOPS per watt, which is in many times higher than in CPU, or GPU implementation [9]. Besides, this architecture can be accommodated in FPGA to the solved problem conditions by exchanging the throughput, transform length , or computation precision.

The papers [10–12] describe the design and implementation of radix-2^{2} single-path delay feedback pipelined FFT.

In most cases the power-of-two FFT processors are implemented in FPGA. In the paper [13] it was proven that Radix-2 FFT method provides the least number of FPGA slices, the Good-Thomas method is faster than Cooley-Tukey, and the Rader method had the lowest operating frequency of the pipelined processor in FPGA.

The non-power-of-two transforms are widely used in the OFDM modems. In such transform the Winograd algorithm minimizes the number of multiplications in the DFT modules but also adds a degree of complexity and significantly increases the total number of utilized adders in FPGA [14, 15].

In [16] the parallel architecture of the DFT module has been proposed for the computation of this algorithm. This architecture is able to deal with a large amount of FFT sizes, decomposable in product factors that are 2, 3, 4, 5, 7, or 8.

In [17] the pipelined processor design is proposed, which uses the Cooley-Tukey FFT algorithm for FFT computation only in those cases where the factors of the number are not relatively prime.

The DFT modules, which are used in the examples of the pipelined FFT processors, are designed by the one-to-one mapping of the respective small point FFT algorithms. As a result, they need the data feeding through input ports. Consider the two stage pipeline; the number is factored to factors . Then the buffer has FIFOs of the length of more than , which are fed from inputs in the nonuniform order. Therefore, to provide the proper input data order for these stages, the complex data buffers must be attached to their ports.

Consider the DFT modules, which accept the input data sequentially for steps. Then both data buffers and twiddle factor multipliers are simplified substantially. These modules have the slowdown operation in times. Besides, the hardware volume of the DFT modules can be decreased up to times. The disadvantage of this architecture is the decrease of the FFT processor throughput up to times. But in this situation we can provide the proper system throughput by the increase of the FFT processors number, which are configured in FPGA.

In this paper we propose the design of a set of -point DFT units, which help to implement the pipelined FFT processors, when the data flow is a single sample per a clock cycle.

#### 2. A Method of Pipelined Datapath Synthesis

By the high level synthesis the DSP algorithm is usually described by a signal flow graph or a synchronous data flow (SDF). In SDF the nodes-actors and edges represent the operators and data transfers between them, respectively. Each actor consumes and generates the same amount of data in each SDF cycle [21, 22].

Uniform SDF has the property that its graph is equal to the graph of the pipelined datapath, which implements the algorithm with the period of clock cycle. Then the SDF nodes are mapped into the operational resources like adders, multipliers, the edges are mapped into the connections, and the delays in edges represent the pipelined registers. This property is widely used by the synthesis of DSP modules for FPGA in many CAD tools like Matlab-Simulink System Generator [23].

The synthesis of the pipelined datapath with the period of cycles is usually performed by the steps of resource selection, actor scheduling, and resource assignment. Then the datapath structure is found, and the control unit is synthesized.

It is worth mentioning that most of DFT algorithms are acyclic ones. The most popular scheduling methods for limited resources and execution time consider the acyclic SDF. These methods are list scheduling and force directed scheduling [24]. The register allocation is effectively implemented by the Tseng heuristic and by the left edge scheduling. The use of the cyclic interval graph takes into account the cyclic nature of the SDF algorithm [25]. The retiming methods and the graph folding methods simplify the SDF mapping [26, 27].

In [28, 29] the method of the datapath synthesis is proposed, which is based on SDF. This method, adapted to the acyclic SDF, is described below. In this method, SDF is represented in the three-dimensional space in the form of a triple , where is the matrix of vectors-nodes , which mean the operators or actors, is matrix of vectors-edges , performing the links between operators, and is the incidence matrix of SDF.

In the vector the coordinates , , and correspond to the type of operator, the processor unit (PU) number, and the clock cycle. The SDF graph in such representation is called spatial SDF.

Spatial SDF is split into the spatial configuration and event configuration , which correspond to the datapath structure, and its schedule. By the splitting process the vectors are decomposed into vectors , corresponding to the PU coordinates, and vectors , which mean the execution times of the relevant operators in PU . Then the temporal component of the vector is equal to the delay of transfer or processing of the respective variable.

We can assume that the matrix encodes some acceptable structural solution, since the matrix is calculated by

The structural optimization consists in finding such matrix , which minimizes a given quality criterion. It is possible to specify a matrix which provides the minimum value of . Then the vectors are found from a relationshipwhere is the matrix of vectors-nodes and is the incidence matrix of the maximum spanning tree for SDF. When looking for the effective structural solution, the following relations have to be considered. Spatial SDF is valid, if the matrix has no two identical vectors; that is,

The schedule with the period of clock cycles is correct if the operators, which are mapped into the same PU, are performed in different cycles; that is,

This inequality provides the correct circular schedule. Moreover, the next operator has to be executed no earlier than the previous one; that is,

The operators of the same type should be mapped into PU of the same type; that is, where is a set of -type vectors-operators, which are mapped in the th PU of th type ().

Then the search for the effective schedule consists in the following. The vectors are assigned the coordinate ; that is, the respective operators have the delays of a single clock cycle. The matrix is found from (2). The remaining elements of the matrix are found from (1). If inequality (5) is not satisfied for some of vectors, then the coordinate is increased for certain vectors , and the schedule search is repeated. The rest of coordinates are found from conditions (3)–(6). In such wise the fastest schedule is built, as each statement is executed in a single clock cycle without unnecessary delays.

The resulting spatial SDF can be described by the VHDL language, so the pipelined datapath description can be translated into the gate level description of the FPGA configuration by the proper compiler-synthesizer [30].

During the structure synthesis, the nodes are placed in the space according to a set of rules, providing the minimum hardware volume for the given number of clock cycles in the algorithm period. The resulting spatial SDF is described by VHDL language and is modelled and compiled using proper CAD tools.

The method is similar to the known method of the SDF folding in times [31]. However, it is distinguished from the intuitive folding procedure in the formal approach and directed optimization process. In this method, the steps of resource selection, operator scheduling, and resource allocation are implemented in a single step, providing more effective optimization.

The method was built in the framework which is intended for the SDF graph input and its graphical editing. Both algorithm and resulting structure are stored in XML files. The framework can translate the XML description into the VHDL synthesizable model, which can be modelled and synthesized by usual CAD tools provided by different companies. The present limitation consists in that the SDF graph is optimized only by hand using the relations, definitions, and theorems, mentioned above [32]. The shown below examples were synthesized by Xilinx ISE, Ver. 13.3.

The method is successfully proven by the synthesis of a set of pipelined FFT processors, IIR filters, and other pipelined datapaths for FPGA [33]. A set of DFT modules was designed using it as well.

#### 3. Example of the DFT Module Synthesis

Consider a design example of a DFT module of points. It performs the Winograd DFT algorithm, which is described in [5]:where is the input complex data set, is the complex result set, and . In this algorithm ; therefore, . The algorithm has twelve real additions and four real multiplications.

To minimize the number of multiply units (MPUs) and to increase the clock frequency, it is worth to use the application specific MPUs [34]. Then the coefficient . To minimize the addition operations it is represented by digits 1, 0, and −1; that is, .

Then the multiplication is implemented as

SDF of this algorithm is shown in Figure 1. The black circles in it represent the input-output nodes, circle with a cross does complex addition, and symbols “≫” mean the shift right operation to bits. The edge, which is loaded by , means the multiplication to , which is performed as inversion of the image part of data and swapping the real and image parts of data. Each node performs a delay to a single clock cycle. Therefore, this SDF makes the structure of a module, which computes DFT with the period of cycle.