Abstract

This paper presents the design and implementation of a 2K/4K/8K multiple mode FFT core for DVB-T/DVB-H receivers. The proposed core is based on a pipeline radix-22 SDF architecture. The necessary changes in the radix-22 SDF architecture to achieve an efficient FFT implementation are detailed. Quantization effects and timing design parameters are analyzed for DVB-T/DVB-H. Area and power results are provided for the proposed core.

1. Introduction

DVB-H adapts the successful DVB-T standard for digital terrestrial television to the specific requirements of mobile, handheld, and battery-powered receivers [13]. Both standards rely on an orthogonal frequency division multiplexing (OFDM) modulation scheme to achieve high-data rates in multipath environments. OFDM uses an inverse fast Fourier transform (IFFT) to modulate the signal and a fast Fourier transform (FFT) to demodulate it. One of the main differences between both standards is the number of points of the FFT: DVB-H that proposes an additional mode (4K) to the two DVB-T modes (2K and 8K). This new mode is a tradeoff between reception quality in movement and network size.

As DVB-H is an extension to DVB-T, it is possible to introduce DVB-H services in the bandwidth of DVB-T. One operator can offer two DVB-T services and one DVB-H service to its subscribers. Therefore, digital terrestrial television receivers should be able to receive both DVB-T and DVB-H signals. In these receivers, the FFT processor must be able to work in 2K/4K/8K multiple mode. Moreover, this module must have a high throughput in order to achieve the high-data rates required by both standards. Some 2K/4K/8K pipelined FFT architectures have been proposed in the literature [4].

The algorithm and architecture for the FFT core should be chosen trading off its processing speed, area, and power. Monoprocessor architectures such as [5] have to be discarded, as they are not able to fulfill the timing specifications. The throughput can be increased by using either parallel [6, 7] or pipeline architectures [814]. Pipeline architectures present smaller latency and lower-power consumption [8, 9], which makes them suitable for mobile devices such as DVB-T/DVB-H receivers.

Basically, two types of pipeline architectures can be distinguished: single-path delay feedback (SDF) architectures [1012] and multipath delay commutator (MDC) architectures [9]. SDF architectures use registers more efficiently, since the outputs of the butterflies can be stored in shift registers. In MDC architectures, the input sequence is divided into several parallel lines that feed the butterfly applying the appropriate delays to the data. SDF architectures use less memory than MDC. On the other hand, MDC architectures obtain a slightly higher throughput than SDF architectures. The optimal choice depends on the application [9]. In the case of DVB-T/DVB-H receivers, the SDF architecture can achieve the required throughput and needs less area than the MDC architecture.

In addition to the pipeline structure, the radix of the algorithm also influences the complexity of the implementation. A radix-2 algorithm needs more products and gets a lower throughput than a radix-4 algorithm. However, a radix-4 algorithm can only process FFTs with a number of points that is a power of 4, and the butterfly is more complex.

In order to maintain the simplicity of the radix-2 butterfly, radix-22 (r22) algorithms have been proposed [11, 14]. The r22 algorithm is well-suited for DVB-T/DVB-H applications since it can work both with a number of points that is a power of 4 and with a number of points that is a power of 2. However, the implementation of the r22 FFT in [11] should be modified to achieve a multiple mode operation.

This paper presents a new 2K/4K/8K FFT core for DVB-T/DVB-H receivers. The r22-SDF pipeline architecture in [11] is adapted to achieve a multiple mode 2K/4K/8K FFT. The effect of the quantization errors of the FFT processor is studied, and design parameters are analyzed according to the DVB-T/DVB-H requirements.

This paper is organized as follows. Section 2 explains the modifications done to the r22 SDF algorithm so that it can operate in multiple mode. Section 3 describes the pipeline r22-SDF architecture proposed for a DVB-T/DVB-H receiver. Section 4 gives signal-to-noise-ratio (SNR), area, timing, and power results of the proposed FFT core. Finally, Section 5 summarizes the conclusions of this work.

2. R22-SDF Algorithm

In this section, the r22-SDF algorithm presented in [11] will be modified by us to enable it to work in a multiple mode operation.

Figures 1 to 3 show the flow graph of the radix-22 algorithm when the length of the FFT is 32, 16, and 8, respectively. In these flow graphs, the index of the input and output data is shown. The black circles represent butterflies, and the arrows are the twiddle factor multiplications, where the number in brackets next to each arrow is the index m of the twiddle factor, , of an FFT of length N.

Three different operations can be distinguished in each stage: BT1, which is the butterfly of type 1 defined in [11]; BT2, which is the butterfly of type 2 defined in [11]; and CM, which performs complex multiplications. We can observe that, for BT2, some input data of the butterflies are multiplied by j. The multiplication by j can be implemented by exchanging the real and imaginary parts.

The twiddle factors of the CM at the kth stage, with are given by the vector where The index p is defined as where m is equal to

The r22 algorithms reduce considerably the number of arithmetical operations performed by the FFT. The only restriction to process the r22-FFT algorithm is that the length of the FFT is a power of 2. Thus, the last stage of the algorithm is different according to the size of the FFT. If the number of points of the FFT is a power of 4, the last stage is composed of BT1 and BT2, as it can be seen in Figure 2. However, if the number of points of the FFT is only a power of 2, the last stage will be formed by BT1, as can be observed in Figures 1 and 3.

2.1. Multiple Mode Operation

In this section, the multiple mode of the r22-SDF algorithm will be derived where a represents the number of stages of an FFT of points. The resources needed to process the FFT of the largest number of points, , will be implemented.

An FFT of points can be easily obtained from a points FFT. The first stage of the points FFT does not need to be processed in order to calculate the points FFT. Additionally, in the following stages of the points FFT, the twiddle factors are the same as the ones needed for the points FFT, as . This multiple mode implementation can be deduced by analyzing the flow graphs of a 32-points FFT and an 8-points FFT in Figures 1 and 2, respectively, where we have considered .

Using the resources of an FFT of length , a points FFT can be obtained. In the latter FFT, stages are needed. The first stages of the points FFT can be reused to process the points FFT, if only the operations in the even positions are carried out. Thus, half the operations are done in each BT1, BT2, and CM. Moreover, the twiddle factors of the CM in stage of the points FFT are always one. Thus, the operations of the last CM can be omitted, and the final stage of the FFT will only contain a BT1 and a BT2. This multiple mode implementation can be easily concluded by inspection of Figures 1 and 2, taking into account that .

3. Pipeline R22-SDF Architecture in Multiple Mode

The proposed FFT core receives the input data DATA_IN in natural order, and it generates the output DATA_OUT in bit-reversed order. This is not a problem as the reordering can be performed by subsequent modules of the DVB-T/DVB-H receiver (e.g., deinterleaver) with no additional cost. Input data arrives at clock rate. All the data needed to compute each FFT arrives as a block. In order to ease the integration of the FFT core within a DVB-T/DVB-H receiver, a validation signal, DATA_IN_VALID, will be set high during the arrival of valid data at the input of the FFT core. Similarly, when valid output data are ready at the output of the FFT core, a validation signal DATA_OUT_VALID is set high.

The FFT processor employs fixed-point arithmetic. Input and output data are represented using dbw bits. The twiddle factors have been quantized with tbw bits. The core scales data appropriately during internal operations to avoid overflow.

In the following, the basic building blocks of the r22-SDF architecture [11] are described and their implementation detailed. Then, the required modifications to achieve a multiple mode 2K/4K/8K FFT are explained. The section finishes with a summary of the main features of the proposed multiple mode FFT core.

3.1. Basic Building Blocks

The FFT processor is a distributed system where every module generates the control signals for the next module in the pipe, as shown in Figure 4. Shadowed lines represent data, and white lines control signals. The FFT core has ceil stages, where N is the number of FFT points. A typical stage of the architecture consists of three processing elements: B1, B2, and CM; and three memory elements: ROM, FIFO1, and FIFO2. B1 and B2 carry out the processing of the two types of butterflies of the r22 algorithm (BT1 and BT2). FIFO1 and FIFO2 are used to achieve the required data shuffling for proper operation of the butterflies. In stage k, the depth of FIFO1 is , and the depth of FIFO2 is . CM computes the complex multiplications between the outputs of B2 and the twiddle factors stored in the corresponding ROM memory. The last stage of the FFT processor does not need the complex twiddle factor multiplication.

As explained before, the r22-SDF architecture can work with a number of points that is a power of 4 and with a number of points that is only a power of 2. When N is just a power of 2, in the last stage, data are only processed by B1.

3.1.1. Module B1

Figure 5 shows the structure of B1. The DATA_IN input port comes from the previous component in the pipe, normally a CM module. The DATA_OUT output port is connected to the next component in the pipe, usually a B2 module. The FIFO_IN and FIFO_OUT ports connect B1 with FIFO1. The size of the two counters of B1 is , where k is the stage. The implementation of BF1, the butterfly of type 1, is detailed in Figure 6.

Initially, the multiplexers MUX are in position 0, and FIFO1 is empty. During the arrival of the first data, FIFO1 is filled. Then, the multiplexers change to position 1, and the butterfly operations can be performed using the input data at port DATA_IN and the data stored in FIFO1. One of the butterfly outputs, X1, is output, whereas the other one, X2, is stored in FIFO1. After other cycles, multiplexers switch back to position 0. Data for the next computation is stored in FIFO1, and the results X2 of the previous butterfly operations are sent out.

The selection signal of the multiplexers, MUX_CTRL, is generated by the input counter. This counter increments its value when DATA_IN_VALID is high. When the input counter arrives to half of its count, DATA_OUT_VALID is set high, and the output counter is started. The output counter will count until FIFO1 is emptied of valid output data. When the output counter finishes its count, DATA_OUT_VALID is set to zero.

3.1.2. Module B2

Figure 7 shows an internal diagram of component B2, which is formed by a butterfly of type 2, BF2, and some control logic. The DATA_IN input port of B2 comes from the previous component in the pipe, a B1 module. The DATA_OUT output port is connected to the next component in the pipe, usually a CM module. The FIFO_IN and FIFO_OUT ports are connected to FIFO2.

Figure 8 details the implementation of BF2. Its structure is similar to BF1. FIFO2 is filled with the first data. Then, the multiplexers MUX change to position 1, and the input data at port DATA_IN and the data stored in FIFO2 are used to perform the required butterfly operations. The butterfly output X1 is sent out, and X2 is stored in FIFO2. After cycles, the multiplexers MUX switch back to position 0. Data for the next computation is stored in FIFO2, and the results X2 of the previous butterfly operations are output.

The multiplexers MUX J are used to handle efficiently the multiplications by needed in a butterfly of type 2. Whenever a multiplication by must be carried out, the multiplexers MUX J are set to position 1. The signal MINUS_J_CTRL controls the behavior of the multiplexers MUX J.

The input counter generates the signals that control the multiplexers MUX and MUX J. This counter increments its value when DATA_IN_VALID is high. Its size is . The second most significant bit of this counter’s value is used to generate MUX_CTRL. When the input counter is making the last quarter of its count, MINUS_J_CTRL is set to high.

When the input counter arrives to a quarter of its count, DATA_OUT_VALID is set to high, and the output counter starts to count. The output counter will count until FIFO2 is emptied of valid output data. When the output counter finishes its count, DATA_OUT_VALID is set to zero. The size of the output counter is .

3.1.3. Module CM

The internal structure of CM is shown in Figure 9. This component carries out the twiddle factor multiplications. A one clock cycle complex multiplier has been implemented to perform the complex multiplications between the input data and the twiddle factors. The twiddle factors are read from a synchronous ROM. A counter of size , addr counter, is used to generate the ROM addresses appropriately. A flip-flop is used to synchronize the DATA_OUT_VALID signal with DATA_OUT.

3.2. Multiple Mode Operation

In order to accommodate the 2K/4K/8K multiple mode, some extra elements are needed in the r22 SDF architecture. The proposed architecture is depicted in Figure 10. As can be seen, the resources needed to process the FFT of the largest number of points, have been implemented. Thus, there are stages, and the twiddle factors have been calculated for .

When an 8K point FFT is to be calculated, the multiplexers shown in Figure 10 are configured so that the core works as described above. Multiplexers M4K and M2K are in position 0. When multiplexers are in position 0, they select the signal connected to the upper port.

In order to calculate a 2K point FFT, the first stage of the 8K FFT, stage 0 in Figure 10, does not have to be processed. Thus, multiplexers M2K are set to position 1 to bypass stage 0. Multiplexers M4K remain in position 0.

For the 4K points FFT, six complete stages (with both types of butterflies) are needed. In order to reuse the existing hardware, B1 of stage k uses FIFO2 of stage k, and B2 of stage k uses FIFO1 of stage . Additionally, CM of stages 5 and 6 is bypassed. The former is achieved by setting M4K to position 1 and M2K to position 0. In each stage, CM needs half the twiddle factors of the 8K points FFT: those with an even address in the ROM memory. A control signal configures CM for proper operation according to the number of points of the FFT to be calculated.

3.3. Features of the Proposed FFT Core

Table 1 summarizes the memory requirements of the core. The table shows the total number of memory bits used in the FIFOs and in the twiddle factor ROMs. In order to achieve more compact memories, the real and imaginary parts of each complex number are stored in the higher and lower part of the same memory position. Table 2 is a summary of the arithmetic operators needed in the core. It can be noted that the memory and arithmetic modules needed in the proposed multiple mode architecture are the same as those needed in a single mode 8K FFT.

Once the clock frequency has been selected, the processing time of the FFT module can be determined using

4. Results

4.1. Analysis of the Signal-to-Noise Ratio of the FFT

For an appropriate operation of the receiver, the degradation introduced in the signal due to the fixed-point computation of the FFT must be controlled. Monte Carlo simulations have been carried out comparing a fixed-point model of the proposed architecture with a floating-point FFT. The signal-to-noise-ratio (SNR) has been used to measure the degradation.

Figure 11 shows the SNR for the 2K/4K/8K FFT when the data are quantized, and the twiddle factors are left at floating point. It can be observed that both the data bitwidth and the number of points of the FFT influence the SNR. As N increases, the number of arithmetical operations grows and, thus, the SNR decreases. The SNR increases in 6 dB per bit of dbw.

The effect of the quantization of the twiddle factors in the SNR has been studied for 2K (Figure 12(a)), 4K (Figure 12(b)), and 8K (Figure 12(c)) FFTs. These figures show the SNR for different values of dbw and tbw. It can be seen that for a given value of dbw and of N, increasing tbw above a certain value does not improve the performance.

Figures 11 and 12 can help the designer in the selection of dbw and tbw. In a multiple mode FFT processor, dbw and tbw must be selected for the maximum number of points of the FFT (8K in a DVB-T/DVB-H receiver). A SNR of at least 40 dB is sufficient for terrestrial TV broadcasting [15]. In order to guarantee a SNR of 40 dB for the 8K FFT, and are needed.

4.2. Analysis of the Timing of the FFT for DVB-T/DVB-H

For low-power applications, such as a DVB-T/DVB-H receiver, a slow-clock frequency is preferable: for example, equal to the sample rate of the FFT. For DVB-T/DVB-H, the minimum sample rate at the FFT can be 48/7 MHz for a 6 MHz channel, 8 MHz for a 7 MHz channel, and 64/7 MHz for an 8 MHz channel [2]. However, the FFT processor shall be able to compute the FFT within the duration of an OFDM symbol plus the duration of the guard interval .

Table 3 presents the maximum allowed time to compute the FFT in DVB-T/DVB-H and the processing time of the proposed FFT core. Results are given for the different bandwidth channels (6 MHz, 7 MHz, and 8 MHz) and for the three lengths of the FFT (2K, 4K, and 8K). The value of requirement given in the table considers the worse case scenario: a guard interval with duration of 1/32 of the OFDM symbol period. The processing time of the FFT core, , has been calculated for a clock frequency equal to the corresponding sampling frequency fs. It can be observed that the proposed FFT core is able to meet the timing requirements in all cases.

4.3. Area and Timing Comparison

Table 4 compares the proposed core with reported FFTs that could be used within DVB-T/DVB-H applications. The FFTs presented in [6, 10] have been designed for DVB-T, and they do not implement the 4K mode. The work in [7] only provides results for 8K.

In [4], a 2K/4K/8K FFT architecture is proposed. The twiddle factor multiplication is carried out using a CORDIC, and no ROM is needed for the twiddle factors. The CORDIC carries out 17 iterations to guarantee good performance. For comparison, we will relate the number of iterations to the precision of the twiddle factors. Following [16], we can estimate the precision in the rotated angle as , where represents the number of iterations. A quantization error in the twiddle factors can be seen as an error in rotating an angle. It can be shown that max . Thus, we can perform the following approximation:

The area and timing results shown in Table 4 for the proposed multiple mode FFT core are given for the selected dbw and tbw. They have been obtained using the 0.35 μm XFAB 4-ML technology. The area value given for the proposed FFT processor is an estimation of the core area after layout, making the assumption that the layout area is twice the cell area. For a fairer comparison, the area of [6, 7] has been normalized to 0.35 μm using the same approach as [7]. In [4], the number of equivalent gates is provided. The value given in Table 4 has been estimated for a 0.35 μm technology using that number.

Table 4 shows the parameter AT as well. AT is the product between the area and the processing time . This parameter can be used to assess the efficiency of different cores. The table shows that the proposed core is the most efficient.

In addition, Table 5 presents a comparison of the computational complexity of [6], a pipeline-SDF r2 FFT design, and our proposal. As can be observed, our design requires less multipliers and adders. Thus, our FFT core presents a more efficient implementation.

4.4. Area and Timing Results for FPGA Implementation

The FFT core has been prototyped in an FPGA virtex 2V6000FF1517. Table 6 presents a comparison of our core with other FFT cores for DVB-T in the literature. Only those proposals that give data about an FPGA implementation have been considered in the comparison. The table shows the working clock frequency, the total number of occupied slices, the necessary block RAMs, and the number of multipliers. One can observe that our FFT core presents the most efficient implementation for an FPGA.

4.5. Area, Timing, and Power Results for ASIC Implementation

The layout of the FFT core, with and , has been carried out for the 0.35 μm AMS 3-ML technology. The FIFOs of stages 0 to 3 have been implemented using single port RAMs and some additional control logic, whereas the FIFOS of stages 4 to 6 have been implemented using standard cells. A detailed summary of the features of the proposed FFT core is presented in Table 7. The power consumption of the proposed FFT processor has been calculated at synthesis level. The switching activity has been extracted from simulations operating in the 8K mode. A chip photo of the layout of the FFT core for 0.35 μm AMS 3-ML technology is shown in Figure 13.

To sum up, our core simplifies the twiddle factors and, thus, reduces the number of multiplications. Therefore, our FFT proposal for a DVB-T/DVB-H system results in a more efficient ASIC and FPGA implementation than the proposals found in the literature.

5. Conclusion

An FFT core for DVB-T/DVB-H receivers has been designed and implemented. The core implements a pipeline r22 SDF architecture. This architecture can be adapted to achieve an efficient 2K/4K/8K multiple mode FFT processor. The extra hardware needed for multiple mode operation is minimal. In order to guarantee a SNR of 40 dB in all modes of operation, 16 bits and 11 bits are needed for the data bitwidth and the twiddle factor bitwidth, respectively. The architecture of the proposed FFT processor makes it possible to achieve the FFT processing time requirements of DVB-T/DVB-H working at the lowest-possible clock frequency. The proposed core is an efficient implementation well suited for DVB-T/DVB-H receivers.

Acknowledgments

This research is supported in part by the Ministerio de Industria, Turismo y Comercio Grant no. FIT330100-2006-43 and by the Basque Government. A. Cortés holds the Torres Quevedo Grant no. PTQ05-02-02455, which was awarded by the Spanish Ministry of Education and Science, by the European Regional Development Fund and by the European Social Fund.