#### Abstract

A reconfigurable systolic array (RSA) architecture that supports the realization of DSP functions for multicarrier wireless and multirate applications is presented. The RSA consists of coarse-grained processing elements that can be configured as complex DSP functions that are the basic building blocks of Polyphase-FIR filters, phase shifters, DFTs, and Polyphase-DFT circuits. The homogeneous characteristic of the RSA architecture, where each reconfigurable processing element (PE) cell is connected to its nearest neighbors via configurable switch (SW) elements, enables array expansion for parallel processing and facilitates time sharing computation of high-throughput data by individual PEs. For DFT circuit configurations, an algorithmic optimization technique has been employed to reduce the overall number of vector-matrix products to be mapped on the RSA. The hardware complexity and throughput of the RSA-based DFT structures have been evaluated and compared against several conventional modular FFT realizations. Designs and circuit implementations of the PE cell and several RSAs configured as DFT and Polyphase filter circuits are also presented. The RSA architecture offers significant flexibility and computational capacity for applications that require real time reconfiguration and high-density computing.

#### 1. Introduction

In multicarrier wireless and multirate applications, DFT and Polyphase filters are two key functions frequently employed in the processing of digital signals. Orthogonal Frequency Division Multiplexing (OFDM) utilized in numerous multicarrier communication systems including Digital Audio/Video Broadcasting (DAB/DVB), personal area networks (PAN), and mobile communications [1–5] depends on the efficient realization of the DFT. Furthermore, hardware solutions that offer computational flexibility of -point DFTs for OFDM modulation-demodulation of subcarriers enable the systems to operate in a multimode and multistandard environment. For example, the transmission modes of DAB involve FFT computations of 2048, 1024, 512, or 256 subcarriers while DVB-T uses 2K and 8K FFTs for data modulation/demodulation. On the other hand, dedicated DFT hardware solutions for OFDM modulators, where the number of subcarriers is not a power of two, have been considered such as the processor for the Digital Multimedia Broadcast standard (DMB-T) described in [6]. A solution to the multimode multistandard problem could be realized through the implementation of reconfigurable point DFT/IDFT processors, where is not restricted to be a power of two.

Polyphase filters have been effectively exploited in a broad range of applications including digital signal interpolation and decimation for speech and image processing as well as multicarrier communications [7–11]. The combination of Polyphase Filter banks with FFTs has resulted in computationally efficient Group Demultiplexer realizations [10]. Multirate signal processing or communication applications incorporating polyphase filter banks frequently need to be reconfigurable [12, 13].

For software defined radio (SDR) [14] applications, digital signal processing is usually carried out using a general purpose processor or digital signal processor (DSP). These processors allow the loading of different software for different system configurations. The need to handle signals with different data rates and protocols as well as multicarrier communications points to the desirability of incorporating hardware reconfigurability in SDR platforms.

Multicarrier processors based on Polyphase-DFT and IDFT-Polyphase architectures offer a computationally effective solution to the multiplexing and demultiplexing of the composite FDM carriers. The application of such multicarrier processing across a frequency band of interest could provide a potential solution to the problem of dynamic channel selection for SDR. Thus, a flexible hardware platform where real time reconfiguration of Polyphase-DFT and IDFT-Polyphase functions is supported may provide a computationally effective solution for high data rate SDR applications as compared to one based on multiple processors.

FFT-based algorithms have been widely used to realize hardware efficient high-throughput DFT circuits [15–18]. However, hardware FFT implementations are typically difficult to reconfigure due to the problems associated with signal routing between successive butterfly stages.

Several flexible FFT architectures have been proposed for the processing of sample input sequences where is a power of two [19, 20]. The main feature of these architectures is that the circuits can be extended to support different input sequence sizes. The architecture proposed in [19] consists of a number of identical processing elements connected in a pipeline fashion. This architecture offers some flexibility in terms of configurability, but it requires significant logic resources, and the throughput is limited by the technology constraints inherent in a serial I/O design. The architecture proposed in [20] is modular and consists of identical pipelined butterfly building blocks whose number depends on the size of the transform. However, connections among the butterfly modules include feedback and long signal links, making reconfiguration difficult.

Other systolic array architectures have also been proposed for the computation of -point DFTs, where is a composite number consisting of two or more cofactors [21–24]. The architectures proposed in [22, 23] offer an efficient way to calculate -point DFTs based on an -by- array where is a product of two numbers ( = ). The base-4 algorithm proposed in [21] is implemented in systolic array architecture that offers an efficient way to compute -point DFTs. The base-4 algorithm supports DFT computations for input sequences where is a multiple of 256. The systolic array proposed in [24] supports DFT computation for -point DFT where = . These architectures share a common characteristic, namely, that the input and output data sequences are reordered throughout the computation process.

DFT circuit designs reflecting vector-matrix multiplication result in modular structures that facilitate circuit reconfiguration and can be realized for input sequences of arbitrary length. However, circuit implementation based on direct DFT computation requires vector multiplications, each of which consists of complex multiplications. To reduce hardware requirements for DFT circuit implementations, systolic array designs based on Horner's rule have also been proposed [25, 26]. The limitation of all of the above mentioned systolic array architectures is that they have been designed specifically for DFT/FFT computations and do not support other functions. Reconfigurable systolic array architectures that support different DSP functions including DFT/FFT have been documented in [27–29] and will be considered in the context of current work in later sections.

In this paper, a scalable systolic array architecture which can be configured to function as an FIR filter, Filter bank, phase shifter, DFT/IDFT, and Polyphase-DFT circuit is presented. The architecture offers computational flexibility and allows DSP functions to be executed in parallel, in pipelined configurations, or, alternately, the DSP function can be partitioned and processed sequentially by shared elements of the systolic array.

For DFT configurations, the RSA architecture supports processing of input sequences of arbitrary length. To reduce hardware requirements for the point DFT implementation, the sum of complex products inherent in the computation is regrouped so as to reduce the number of complex multiplications at the cost of increasing the number of additions. The arithmetic configuration required to compute a partial sum of complex products can be mapped on a single PE cell. Circuit implementations based on a combination of common subexpression and bit-slice arithmetic (CSE-BitSlice) techniques have also been incorporated in the PE design to achieve low-complexity and high-throughput performance [30].

The paper is organized as follows. In Section 2, systolic array architecture capable of supporting Polyphase-FIR filtering and DFT computations is presented. Mapping techniques aimed at achieving low-complexity DFT processors on the systolic arrays are also outlined. Configurations for parallel and sequential DFT processor implementations are also presented in this section. In Section 3 complex multiplication requirements are considered for DFTs with or using the proposed factorization approach to achieve hardware reduction. Section 4 presents a comparison of hardware complexity and performance for DFT circuits mapped on the RSA and several modular FFT structures reported in literature. Section 5 deals with circuit design, FPGA implementation, and simulation results for the RSA, the PE cell, and representative RSA circuits configured to implement DFT processors and polyphase filters. In Section 6, the RSA is evaluated and compared to representative configurable systolic arrays reported in literature in terms of flexibility and performance. Summary and conclusions are presented in Section 7.

#### 2. Systolic Array Architecture for DFT and Polyphase Filtering Computations

The DFT and IDFT can be expressed as follows:

As can be seen from (1) and (2), each output depends on the sum of partial products generated by the multiplication of the input sequence terms and the known coefficients. Direct DFT /IDFT computation requires a total of complex multiplications and complex additions. For large values of , the number of multiplications and additions required for both DFT and IDFT can be rather substantial. However, the number of complex arithmetic operations can be reduced by the application of a factorization and term aggregation technique. For each output, the factorization process first identifies terms that multiply a common coefficient. These terms are then summed together, and the result is multiplied by the common coefficient. This process is repeated and the partial products are accumulated to produce the individual outputs. Furthermore, due to the property of the DFT algorithm, the individual partial products can be used to determine the partial results of one or three other outputs depending upon whether is odd or even. The factorization technique can be repeated to generate all outputs of the -point DFT/IDFT. Thus it is possible to significantly reduce the number of complex multiplications required to compute DFT/IDFT as indicated in some detail in Appendices A and B for even and odd values of , respectively. Indeed, a total of only complex multiplications are required to compute an -point DFT where the parameter is equal to , , or and parameter is equal to , , or for a multiple of four, even, and odd, respectively. This factorization approach offers the advantage of reducing the overall number of complex multiplications without resorting to data recursion. The data routing and aggregation inherent in this technique make it particularly suitable for a systolic array implementation incorporating circuit scaling, expansion, and reconfiguration features.

A -tap FIR filter is described by:

The output can be alternately defined as the product of two vectors, namely, an input vector consisting of the data samples and a vector reflecting the filter coefficients. Thus an array of multipliers and adders could be employed to meet the computational requirements of a -tap FIR filter.

The DFT, on the other hand, corresponds to the product of the input data vector and a matrix consisting of the DFT coefficients. For a DFT realization, an -by- array structure consisting of complex multipliers is required to compute an -point DFT. Figure 1 depicts a systolic array of processing elements (PE) that can be used for parallel computation of an -point DFT based on the expressions in Appendices A and B. This array provides an output sequence at every clock cycle. Each processing element in this array structure can simultaneously compute four real multiplications. If is a multiple of four, each set of four complex input data shifted into the PE is combined and the partial result is multiplied by the factored coefficient. The set of four partial results generated is then added to the results shifted in from the PE cell in the previous row. As shown in Figure 1, each PE cell in the first row of the array combines the set of input data, , and based on the expression described in (A.3) in the Appendix A where and are defined in (A.4), respectively. Similarly, each PE cell in row combines the set of input data , , and . The sum or difference generated is then multiplied by the real and imaginary part of the factored coefficients represented by and as described in (A.3). The partial products generated from the multiplication process are added to results shifted in from PE cells in previous row. The partial products generated for can be used to calculate for partial results of , and simultaneously since all four multiplications involve the same expressions and coefficients as shown in the second part of (A.3) and (A.5). Thus, the PE computes the multiplication-accumulation on a set of four input data and produces a set of four partial results based on one complex multiplication.

If is not a multiple of four, each PE produces two partial results based on the multiplication of the factored coefficient by the sum generated from the set of complex input data. If is even, the set of input data consists of two or four complex data as described in (A.8) and (A.9) in Appendix A. If is odd, a set of two complex input data is shifted into each PE as shown in (B.2) and (B.4) in Appendix B.

Since each PE cell is connected to its immediate neighbors, the systolic array structure in Figure 1 supports DFT computation of input sequences where data sequences are sequentially shifted in. There are two possible mapping schemes for sequential computation of an -point DFT/IDFT based on the systolic array architecture depicted in Figure 1. The block diagram in Figure 2 depicts a parallel input–serial output mapping scheme where all input sequences are shifted in simultaneously, and a maximum of 4 outputs is shifted out at every clock cycle. Based on this configuration, an output sequence is shifted out in clock cycles if is a multiple of four. If is not a multiple of four, the set of output sequence is shifted out in clock cycles.

The mapping shown in Figure 3 depicts a sequential synchronous structure where the inputs and outputs operate off the master clock. This serial input-parallel output mapping scheme shifts 4 input sequences into the processing elements at every clock cycle where partial results are combined with data generated in previous clock cycles. All output sequences are available in or clock cycles for the cases of even and odd, respectively.

The sequential structures in Figures 2 and 3 could provide higher data throughput if the dimension of the array is expanded. If the parallel inputs-serial outputs structure is consisting of columns of PE cells, it would shift out the complete set of output sequences after clock cycles. Similarly, the serial inputs-parallel outputs structure would shift out all output sequences after clock cycles, based on rows of PE cells.

A systolic array structure consisting of a 1-by- array of PE cells could be used to realize a polyphase filter bank consisting of four -tap FIR filters. Figure 4 depicts the mapping of such a filter bank onto an array structure consisting of PE cells cascaded together. The input and output signals shown in Figures 2 to 4 are all consistent with the complex data format.

Since each PE cell in the systolic array contains four real multipliers, it could also be used to realize a phase shifter.

#### 3. Estimation of DFT Complexity

In Appendices A and B it is shown that, by factoring and aggregating terms that multiply a common coefficient, the total number of complex multiplications required for an -point DFT can be reduced significantly. For , the four output sequences and for can be calculated concurrently using only complex multiplications. Thus, for parallel computation, the total number of complex multiplications required to compute all output sequences where is . For and , each set of output sequences can be calculated concurrently with and complex multiplications, respectively. Thus, the total number of complex multiplications required is and for DFT computation of input sequences for which and , respectively. The total number of complex operations required to compute an point DFT/IDFT is summarized in Table 1. From Table 1 we see that for the three cases presented the smallest number of complex multiplications occurs when is a multiple of four.

#### 4. Comparison of Complexity and Throughput of RSA and Reconfigurable DFT/FFT Architectures

In this section, the number of complex multipliers contained in representative reconfigurable DFT/FFT structures is evaluated and compared with a DFT circuit configured on the RSA architecture. The comparison focuses on complex multipliers as they require significantly more hardware resources than adders. Throughput performance and DFT dimensionality constraints are also considered for each structure. The RSA structure used in this comparison has serial inputs-parallel outputs architecture similar to the one depicted in Figure 3. Table 2 presents throughput and complex multiplier requirements for the RSA and for circuit designs based on a 2D-Systolic array [24], base-4 [21], the RADComm [28], the CSoC [29], and the modular architectures presented in [19]. However, only the RADComm and RSA circuits can support DFT configurations for which is an odd number. Throughput timing is defined as the number of clock cycles between corresponding outputs of any two sequential point DFTs.

Figures 5 and 6 illustrate the number of clock cycles per -point DFT computation and the number of complex multipliers required by the RSA and representative DFT implementations identified in Table 2. From Figures 5(a) and 5(b) it is apparent that the RSA-based DFT requires more complex multipliers than the base-4. However, it provides higher throughput. Furthermore, the number of complex multipliers required by the RSA circuit is a quarter of the number in a 2D-systolic structure for any value of . However, the latter provides a higher throughput for DFTs for which is greater than 256. Thus, the RSA is more compact and provides a higher throughput than the 2D-systolic array structure for smaller than 256. The Modular architecture on the other hand requires more complex multipliers and has a lower throughput than the RSA. Moreover, the base-4, the 2D-systolic, and the Modular structures impose specific restrictions on the dimensionality of . Specifically, for the 2D systolic and the Modular architectures, is limited to values that are a power of two, whereas, for the base-4 architecture = 256. For purpose of this comparison, the RSA was configured to support DFTs for which is a power of 2 or a multiple of 4.

**(a)**

**(b)**

**(a)**

**(b)**

In contrast, the RADComm circuit supports DFT implementations for even or odd. However, in comparison with the RSA, it requires four times the number of multipliers if and twice as many otherwise. Furthermore, for even, the RADComm circuit has one fourth the throughput of the RSA and, for odd, the throughput is one half. Figure 6provides throughput results and hardware requirements in terms of the number of complex multipliers for the RSA and the RADComm circuit.

The CSoC circuit incorporates 16 PE cells per cluster together with fixed arithmetic blocks, which can be configured to perform 4 complex butterfly computations. For a 256-point FFT, approximately 500 clock cycles are needed to perform the complex butterfly computations. Indeed, for FFTs with input sequences for which the throughput is considerably lower than that for the other implementations identified in Table 2.

#### 5. RSA Circuit Design, Implementation, and Simulation Results

The RSA circuit design is based on systolic array architecture consisting of PEs interconnected via SWs as depicted in Figure 7. Both the PEs and SWs can be reconfigured dynamically with the former as an arithmetic processor and the latter as a flexible router linking the neighboring PE cells. The RSA shifts reconfiguration and input signals into the PEs and SWs on separate data bus which enables the circuit to continue its operation while the reconfiguration is in process.

##### 5.1. The Processing Element (PE) Cell

The PE cell incorporates arithmetic circuits that can be configured to perform a combination of addition and multiplication operations needed by the RSA to execute a broad range of DSP functions, that is, Polyphase-FIR filter, phase shifter, DFT, IDFT, Group Demultiplexer (Polyphase-DFT), and Group Multiplexer (IDFT-Polyphase). The PE consists of three main building blocks: the arithmetic unit (AU), the control unit (CU), and the memory unit (MU). The input signals shifted into the PE cells via eight data buses identified as to are processed by the AU, and the configuration data is loaded via the bus as indicated in Figure 8. Configuration data consists of control information, saved in the CU, that dictates the PE mode of operation and the MU stored coefficients required by the AU to execute arithmetic computations. The input data is in complex format while the configuration data is in real format.

The eight data buses are partitioned into two sets, to and to . Input data routed over the first set is typically multiplied by the stored coefficients while the partial results available from other cells are shifted in via the second set. These partial results are then combined with the results generated by the multiplication process to produce the PE’s output at the four output ports depicted in Figure 8.

For DFT implementations for which the data block of length is even, input data is partitioned into four input sequences and shifted into the PE cell via the first set of input buses. The results are shifted out as two or four output sequences. Alternately, for the case that is odd, the input data and output results consist of two sequences that are clocked in/out synchronously.

For Polyphase/FIR filtering computation, two complex input and output sequences are shifted in and out of the cell at every clock cycle. Partial results from previous cells are shifted over the second input bus of the PE cells and are combined with the partial products generated by the multiplication of input data and the filter coefficients. For phase shifting computation, only one complex input and one complex output sequence need to be shifted in and out of the cell at every clock cycle.

The MU incorporates a block of RAM where coefficients are stored while the CU controls data flow among the sub-blocks within the AU based on the mode of operation selected. A total of nine modes of operation are supported by the CU, which can be configured in real time via the configuration data bus. The AU which contains the PE’s arithmetic functions interfaces with MU and CU as depicted in Figure 9.

###### 5.1.1. The Memory Unit (MU)

The MU is a storage unit where a total of four sets of coefficients, each of which consists of two 9-bit words, can be stored. The four sets of coefficients allow the AU to perform sequential calculations on four different sets of input data. For DFT algorithms requiring a parallel rather than a sequential signal processing implementation, only one set of coefficients is forwarded to the AU. The coefficients can be loaded into the MU or updated via the configuration data bus in real time.

###### 5.1.2. The Arithmetic Unit (AU)

Figure 10 depicts a block diagram of the AU architecture where each of the multiplier blocks performs two real multiplications involving complex data and the real or quadrature part of the coefficient. Thus, a total of four real multiplication operations can be performed by the PE cell.

The AU can be configured to support several modes of operation, including complex and real multiplication, addition, or a combination of these arithmetic operations. Complex multiplication-accumulation is required for DFT/IDFT implementations while complex multiplication and real multiplication-accumulation are needed to realize phase shifting and FIR filtering, respectively. The complex multiplication-accumulation configuration is shown in Figure 11 while Figures 12 and 13 depict complex multiplication and real multiplication-accumulation respectively.

###### 5.1.3. The Control Unit (CU)

The CU, depicted in Figure 14, is responsible for the PE modes of operation, which are loaded in via the configuration data bus. It multiplexes the PE cell’s three clock signals, namely, clk, clk2, and clk4 to synchronize the input data and coefficients based on parallel or sequential mode of computation. The clock signal clk2 has a frequency twice the frequency of clk and clk4 has a frequency four times the frequency of the clk signal. If the circuit is configured to compute in sequential mode, the data input is controlled by the clk signal while the computation inside the AU is controlled by the signal clk2 if two complex multiplications are performed, or clk4 if four complex multiplications are performed. The CU incorporates a multiplexer and a RAM block that supports nine modes of operation MO.1 to MO.9, which dictate the flow of signals in the AU in accordance with the identified target function. The first five modes of operation MO.1 to MO.5 are allocated for DFT configurations for (: even or odd), is a multiple of four, is even, and is odd, respectively. The next two modes of operation are allocated for phase shifters configured to operate in sequential fashion where two (MO.6) or four (MO.7) complex multiplications are performed. The modes of operation MO.8 and MO.9 are reserved for FIR filters configured to operate in parallel and sequential modes, respectively.

##### 5.2. The Switch (SW)

The switches route signals from the RSA’s inputs to the PE cells, between the PE cells, and from the PE cells to the RSA’s output [31]. In this section, primary signals are defined as signals at the input and output of the RSA circuit as well as signals at the output of SWs that are routed to other SW cells as indicated in Figure 15. Secondary signals are defined as signals that originate at the output of a PE cell and connect to the adjacent SW. Similarly, signals that originate at the output of an SW and connect to the adjacent PE cell are defined as secondary signals. Each SW interfaces with four neighboring SWs and three nearby PE cells. The SW shifts primary signals out to its neighboring SWs on its east side and south side. It also routes the primary and secondary signals to the PE cell on its east side. The primary signals routed to neighboring PE cells contain data to be further operated on by the PE’s CU. The secondary signals shifted to these PE cells are partial results from the previous stage to be combined with results generated by the multiplication process. Figure 15 depicts interconnections between an SW and its neighboring SW and PE cells.

The SW has two types of input interfaces, namely, data and control. The first, which consists of sixteen data buses in complex format, supports data that is shifted in from system inputs or data that is shifted out of the neighboring PE cell. The sixteen data buses are grouped into two sets of input data representing the primary and secondary signals. The primary input data consists of eight, 24 bit data signals of which four are shifted in to the SW from the north side and the other four are shifted in from the west side. The secondary input data consists of eight, 34 bit data signals of which four are shifted in to the SW from the west side and the other four are coming from the north side of the SW. The control input interface is dedicated to reconfiguration data that determines the modes of operation for the SW. Reconfiguration data can be loaded in to the switch in real time via the CONF bus. Signal outputs are shifted out of the SW via twelve data buses represented by two sets of primary and one set of secondary data buses, all in complex format.

As shown in Figure 16, each set of the primary output buses consists of four data signals that are shifted out to PE cells to the east and to the south side of the SW. The set of secondary buses consists of four data signals that are shifted out to the PE cell on the east side of the SW.

The SW consists of two building blocks: the routing unit (RU) and the control unit (SW-CU). The SW-CU contains control data that is used to direct the flow of signals from the inputs to the outputs of the switch. The SW-CU controls the set of clock signals clk, clk2 and clk4 that are used to synchronize the routing of signals between the inputs and outputs of the SW for sequential and parallel modes of operation. The relationships between the signals clk, clk2, and clk4 have been defined in the Section 5.1.3. The SW-CU consists of a multiplexer and a RAM block. The RAM block contains eight sets of routing data, each of which represents a specific routing configuration. The multiplexer selects one of the eight routing modes based on a 3-bit control bus. The control mode is loaded via the configuration data bus in real time if necessary. Figures 17 and 18 show the data flow between the SW-CU and RU building blocks inside the SW and the block diagram of the SW-CU building block, respectively.

##### 5.3. The RSA Circuit for DFT Computations

For parallel configuration, a 4-by-4 array that can implement a 16-point, as well as a 9-point DFT is depicted in Figure 19. Indeed, parallel DFT implementations for the case that equals 10 and 12 can be realized on RSAs of 3-by-5 and 3-by-3 PE cells, respectively. Each of these arrays could also be configured to perform DFT computations for longer input sequences using serial modes of operation. In fact, the 4-by-4 array could also be configured to perform parallel input-serial output computation of a 14-point DFT for which the complete output sequence would be available every two clock cycles. The 4-by-4 array could also be configured to implement a serial input-serial output realization of a 256-point DFT. The RAM block in each PE cell has to be expanded to support a set of 256 coefficients, and a complete output sequence would be available every 256 clock cycles.

##### 5.4. The RSA Circuit for Polyphase Filter Implementation

An RSA circuit consisting of a 1-by-8 array of PE cells, as shown in Figure 20, can be configured to realize a 2-channel, 8-tap Polyphase filter.

The configured filter bank consists of four 8-tap FIR filters, each of which operates independently. Two complex input data sequences are simultaneously shifted into the array. Partial results generated by the multiplication of these inputs by the two coefficients in each PE cell are shifted to the next PE cell to be combined with partial products computed in that cell. The filter outputs are available at the output of the last PE cell.

##### 5.5. Circuit Implementations and Simulation Results

In this section, circuit implementation and simulation results for the PE and the RSA structures for parallel computation of -point DFTs, for equal to 9, 10, 12, and 16 as well as a Polyphase filter bank consisting of four 8-tap FIR filters are presented. Additional results for RSA circuit implementations including a 32-point sequential DFT, 8-channel Polyphase filter, and 8-channel Polyphase-DFT Group Demultiplexer have been presented in [31].

For the implementations presented here, the PE complex data inputs are 24-bit, wide and the intermediate result inputs to the PEs are 34-bit wide. Real coefficients are represented by 9-bit words while the complex coefficients in the DFT circuit consist of 18-bit words with 9-bit real and 9 bit quadrature. Two’s complement number representation has been used throughout, and the arithmetic computations are based on fixed-point representation. The RSA structures have been designed in VHDL, and MATLAB models of the configured functions have been used in circuit functional simulations. The VHDL designs based on a generic library which can be synthesized and mapped onto an FPGA or an ASIC. The circuit implementations presented have been synthesized and mapped onto Xilinx's xc5vlx330-2ff1760 FPGAs [32].

PE circuit simulations show that it can shift out four data samples per clock cycle at a clock frequency of 55 MHz. Thus the 4-by-4 RSA circuit computes a 16-point DFT in 18 ns and similarly the 4-by-4, 3-by-5 and 3-by-3 RSA circuits output 9, 10, and 12 point DFTs every 18 ns. For the Polyphase filter circuit, the 1-by-8 array offers a throughput of up to 55 MSPS per filter or 110 MSPS for the two filters. Simulation results for the PE cell are shown in Table 3 and for the various RSA-based circuit implementations in Tables 4 and 5.

#### 6. Flexibility and Performance Comparison of Reconfigurable Architectures

The RSA was designed to incorporate reconfigurability of the array’s PEs and SWs to enable the mapping and seamless interfacing of selected computationally intensive DSP functions such as those present in multicarrier baseband processors, that is, FIR filters and DFTs. The reconfigurability of the arithmetic processor functionality and signal routing within the array enables the RSA to be configured so that certain DSP functions can be computed in parallel or in sequential modes, thus providing a better match between the throughput and logic resource requirements. Thus a row of 8 PEs could, for example, be configured as four 8-tap FIR filters or one 32-tap FIR filter. Similarly an 8-by-8 array of PEs could be configured to implement a 32-point DFT or, alternately, at 1/8th the throughput, a single row of 8 PEs could be configured to operate in sequential mode with the same functionality. The latter configuration is possible as no data index mapping is required by the DFT-configured RSA circuit, hence it is feasible to time share the PEs.

In this section the configuration flexibility and performance of the RSA and representative configurable architectures reported in literature [27–29] will be considered. The features of interest are the circuit’s architecture, ability to realize complex functions, and support of real time reconfigurability, scalability, and performance. Upon consideration of the different architectures, we note that all architectures support FIR filter implementations, however, not all support FFT or DFT circuits. Indeed the DRAW circuit does not support either DFTs or FFTs, and the CSoC supports only FFTs while the RADComm circuit supports DFT implementations based on the Goertzel algorithm. As previously indicated, the FFT realization restricts unduly the dimensionality of the transform to powers of two. Furthermore, the CSoC circuit with 32 reconfigurable PE cells [29] requires a total of 100 clock cycles to compute a 64-point FFT. Thus, in terms of the ability to support both FIR filters and DFT, the RSA and RADComm circuits meet some of the previously identified requirements. However, the RADComm circuit was not designed for real time reconfiguration as its control data and coefficients must be shifted in serially during start up and reconfiguration can only be initiated by circuit reset. To address the issue of reconfigurability in real time the RSA was designed to support the routing of configuration and coefficient data in real time.

#### 7. Summary

A reconfigurable systolic array architecture that supports computations of DSP functions commonly used in multicarrier and multirate applications has been presented. The architecture consists of a systolic array of identical processing elements and switches that could be configured to perform arithmetic computations required by FIR filters, phase shifters and DFT/IDFT transforms of arbitrary dimensionality. The RSA circuit is reconfigurable in real-time and configuration data can be updated without interrupting the circuit operation. The FPGA implementation of the PE circuit shows that it can perform up to 220 million multiply and accumulate operations per second.

The RSA, a regular structure with PE cells and switches connected to their immediate neighbors, can be expanded and scaled in a seamless fashion. The RSA architecture’s flexibility and configurability also makes it suitable for embedded or system on chip implementations.

#### Appendices

#### A. DFT Computation for Even

If is a multiple of four (1) can be decomposed into a sum of products as follows:

where and the coefficients in (A.1) can be defined as follows:

Substituting the terms shown in (A.2) into (A.1) and simplifying, we obtain

where

From (A.3), the output sequences , and can be inferred as follows:

where .

For , the four outputs are expressed as follows:

If is even but not a multiple of four, the output sequence is derived from the decomposition of (1) using a similar approach as follows:

Upon substituting in (A.6) with the terms defined in (A.2), the expression becomes

The output sequence can be derived from (A.8) as follows:

where .

For , the two outputs can be obtained based on (A.6).

#### B. DFT Computation for Odd

If is odd, the output sequence is derived from decomposition of (1) as follows:

where

The output sequence can be derived from (B.1) as follows:

where .

For , the output can be determined based on (A.6).