This paper presents a flexible interleaver architecture supporting multiple standards like WLAN, WiMAX, HSPA+, 3GPP-LTE, and DVB. Algorithmic level optimizations like 2D transformation and realization of recursive computation are applied, which appear to be the key to reach to an efficient hardware multiplexing among different interleaver implementations. The presented hardware enables the mapping of vital types of interleavers including multiple block interleavers and convolutional interleaver onto a single architecture. By exploiting the hardware reuse methodology the silicon cost is reduced, and it consumes 0.126 mm2 area in total in 65 nm CMOS process for a fully reconfigurable architecture. It can operate at a frequency of 166 MHz, providing a maximum throughput up to 664 Mbps for a multistream system and 166 Mbps for single stream communication systems, respectively. One of the vital requirements for multimode operation is the fast switching between different standards, which is supported by this hardware with minimal cycle cost overheads. Maximum flexibility and fast switchability among multiple standards during run time makes the proposed architecture a right choice for the radio baseband processing platform.

1. Introduction

Growth of high-performance wireless communication systems has been drastically increased over the last few years. Due to rapid advancements and changes in radio communication systems, there is always a need of flexible and general purpose solutions for processing the data. The solution not only requires adopting the variances within a particular standard but also needs to cover a range of standards to enable a true multimode environment. The symbol processing is usually done in baseband processors. A fully flexible and programmable baseband processor [13] provides a platform for true multimode communication. To handle the fast transition between different standards, such type of platform is needed in both mobile devices and especially in base stations. Other than symbol processing, one of the challenging area is the provision of flexible subsystems for forward error correction (FEC). FEC subsystems can further be divided in two categories, channel coding/decoding and interleaving/deinterleaving. Among these categories, interleavers and deinterleavers appeared to be more silicon consuming due to the silicon cost of the permutation tables used in conventional approaches. For multistandard support devices the silicon cost of the permutation tables can grow much higher, resulting in an unefficient solution. Therefore, the hardware reuse among different interleaver modules to support multimode processing platform is of significance. This paper presents a flexible and low-cost hardware interleaver architecture which covers a range of interleavers adopted in different communication standards like HSPA Evolution (HSPA+) [4], 3GPP-LTE [5], WiMAX; IEEE 802.16e [6], WLAN; IEEE 802.11a/b/g [7], IEEE 802.11n [8], and DVB-T/H [9].

Interleaving plays a vital role in improving the performance of FEC in terms of bit error rate. The primary function of the interleaver is to improve the distance properties of the coding schemes and to disperse the sequence of bits in a bit stream so as to minimize the effect of burst errors introduced in transmission [10, 11]. The main categories of interleavers are block interleavers and convolutional interleavers. In block interleavers the data are written row wise in a memory configured as a row-column matrix and then read column-wise after applying certain intra-row and inter-row permutations. They are usually specified in the form of a row-column matrix with row and/or column permutations given in tabular form, however; they can also be specified by a modulo function having more complex functions involved to define the permutation patterns. On the other hand, convolutional interleavers use multiple first-in-first-out (FIFO) cells with different width and depth. They are defined mainly by two parameters, the depth of memory cells and number of branches.

Looking at the range of interleavers used in different standards (Table 1) it seems difficult to converge to a single architecture; however, the fact that multimode coverage does not require multiple interleavers to work at the same time provides opportunities to use hardware multiplexing. The multimode functionality is then achieved by fast switching between standards. This research is to merge the functionality of different types of interleavers into a single architecture to demonstrate a way to reuse the hardware for a variety of interleavers having different structural properties. The method in general is the so-called hardware multiplexing technique well presented in [12]. It starts at analyzing and profiling multiple implementation flows, identifying opportunities of hardware multiplexing, and eventually fine tuning the microarchitecture, using minimal hardware, and maximal reuse of multifunctions.

This paper is organized as follows. Section 2 presents the previous work done for the interleaver algorithm implementations. The challenges involved to cover the wide range of standards are mentioned in Section 3. It also presents a shared data flow and hardware cost associated with different implementations. Section 4 provides the detailed explanation of the unified interleaver architecture and its subblocks. A brief explanation of the algorithmic transformations and optimizations used for efficient mapping onto single architecture is given in Section 5 with selected example cases. The usage of the proposed architecture while integrating into baseband system is explained in Section 6. Section 7 provided the VLSI implementation results and comparison to others followed by a conclusion in Section 8.

2. Previous Work

A variety of interleaver implementations having different structural properties have been addressed in literature. The main area of focus has been low cost and throughput. Most of the work covers a single or a couple of interleaver implementations which is not sufficient for a true multimode operation. The design of interleaver architecture for turbo code internal interleaver has been addressed in [1317]. Some of these designs targeted very low-cost solutions. A recent work in [18] provides a good unified design for different standards; however, it covers only the turbo code interleavers and does not meet the complete baseband processing requirements demanding an all-in-one solution. The work in [1922] covers the DVB-related interleaver implementations. Literature [2327] focuses on more than one interleaver implementations with reconfigurability for multiple variants of wireless LAN and DVB. High-throughput interleaver architectures for emerging wireless communications based on MIMO-OFDM techniques have been addressed in [25, 27]. These techniques require multiple-stream processing in parallel, thus requiring parallel addresses generation and memory architecture as shown in Figure 1.

Some commercial solutions [2830] from major FPGA vendors are also available for general purpose use. The available literature reveals that they do not compute the row or column permutations on the fly; instead they take row or column permutation tables in the form of a configuration file as input and use them to generate the final interleaved address. In this way, the complexity for on-the-fly computation of permutation patterns is avoided. This approach needs extra memory to store the permutation patterns. As these implementations are targeted for FPGA use only, they also enjoy the availability of dual port block RAM, which is not a good choice for chip implementations.

3. Shared Data Flow and Algorithm Analysis

The motivation of the research is to explore an all-in-one reconfigurable architecture which can help to meet fast time-to-market requirements from industry and customers. A summary of targeted interleaver implementations which are being widely used is provided in Table 1. The broadness of the interleaving algorithms gives rise to many challenges when considering a true multimode interleaver implementation. The main challenges are as follows:(i)on the fly computation of permutation patterns,(ii)wide range of interleaving block sizes,(iii)wide range of algorithms,(iv)fast switching between different standards,(v)sufficient throughput for high-speed communications,(vi)maximum standard coverage,(vii)acceptable silicon cost and power consumption.

Exploring the similarities between different interleaving algorithms a shared data flow in general is shown in Figure 2. This data flow is shared by different interleaver types summarized in Table 1. Many of the interleaver algorithms, for example, [4, 69] need some preprocessing before starting actual interleaving process. Therefore the whole data flow has been divided into two phases named as precomputation phase as shown in Figure 2(a) and the execution phase as shown in Figure 2(b). There are many minor differences in both the phases when we consider different types of interleavers; however, one of the main differences might be due to the type of interleaver, that is, block interleaver or convolutional interleaver. Other than the differences in address calculation for the two categories, a major difference is the memory access mechanism. In case of block interleaver the memory read and write is explicit but a convolutional interleaver needs to write and read at the same time. This demands a dual port memory; however, it has been dealt by dividing the memories and introducing a delay in the read path. To get the general idea of cost saving by using hardware multiplexed architecture with shared data flow, each of the algorithms is implemented separately after applying appropriate algorithmic transformations. Comparing the hardware cost for different implementations as given in Table 1, the proposed hardware multiplexed architecture based on shared data flow provides 3 times lower silicon cost for address generation and about 5 times lower silicon cost for data memory in shared mode. Going through all the interleaver implementations given in Table 1, different hardware requirements for computing elements and memory are summarized in Table 2. Looking at the modulo computation requirements, the use of adder appears to be the common computing element for all kinds of implementations. Further observation reveals that adder is mostly followed by a selection logic. Therefore, a common computing cell named as shown in Figure 3 is used to cover all the cases. Table 2 shows that the computational part of the reconfigurable implementation can be restricted to have 8 additions, 1 multiplication, and a comparator.

The memory requirements for different implementations are also very wide, due to different sizes, width, memory banks and ports. The memory organization and address computation is explained in detail in the next section.

4. Multimode Interleaver Architecture

The study from algorithm analysis provides the basis to multiplex the hardware intensive components and combine the functionality of multiple types of interleavers. The architecture for the multimode interleaver is given in Figure 4. The hardware partitioning is done in such a way that all computation intensive components are included in the address generation block. The other partitioned blocks are register file, control-FSM, and memory organization block. These blocks are briefly described in the following subsections.

4.1. Address Generation (ADG) Block

Address generation is the main concern for any kind of interleaving. Unified address generation is achieved by multiplexing the computation intensive blocks mentioned in Table 2. The address generation hardware is shown in detail in Figure 4. It is surrounded by other blocks like control FSM, register file, and some lookup tables. It utilizes 8 units with a multiplier and a comparator. The reconfigurability is mainly achieved through changing the behavior of and appropriate multiplexer selection. The control signals , and, are used to define the behavior of block. Using these signals in an appropriate way this block can be configured as an adder, a subtractor, a modulo operation with MSB of output as select line, or just a bypass. All the combinations are fully utilized and make it a very useful common computing element. The address generation block takes the configuration vector and configures itself with the help of a decoder block and part of the LUT. The configuration vector is 32 bit wide, which defines block size, interleaver depth, interleaving modes, and modulation schemes.

The ADG block generates the interleaved address based on all the permutations involved for implementing a block interleaver, whereas it generates memory read and write addresses concurrently while implementing a convolutional interleaver. The role of ADG block to be used as an interleaver or deinterleaver is mainly controlled by the controller after employing an addressing combination (permuted or sequential addressing) for writes and reads from the memory.

4.2. Control FSM

Two modes of operation for the hardware are defined as precomputation mode and execution mode. In order to handle the sequence of operations in the two modes a multistate control-FSM is used. The flow graph of the control-FSM is shown in Figure 5. During precomputation phase, the FSM may perform two main functions: () computation of necessary parameters required for interleaver address computation and () initialization of registers to become ready for execution phase. Other than IDLE state, 5 states (S1S4, S8) are assigned for precomputation. The common parameter to be computed in the precomputation phase is number of rows or columns; however, some specific parameters like prime number ; and intra-row permutation sequence in WCDMA turbo code interleaver are also computed during this phase. For the interleaver functions which do not require precomputation, the initialization steps for precomputation are bypassed, and the control FSM directly jumps to the execution phase. The extra cycle cost associated with the precomputation has been investigated for the current implementation and the results are presented in a later section. In the execution phase, the control-FSM helps in sequencing the loading of data frames into memory or reading data frames from memory. In total 4 states (S5S7, S9) are assigned for execution phase. S9 is used for convolutional interleaver case only, whereas states S5S7 are reused for all types of interleavers. During the execution phase the control-FSM keeps track of block size also by employing row and column counter, thus providing the block synchronization required for each type of interleaver implementation.

4.3. Register File

The requirement of temporary storage of parameters arises with many types of interleaver implementations. Register requirements from different implementations are listed in Table 2. Some special usage configuration is also required for different cases; for example, WCDMA turbo code interleaver needs 20 registers to form a circular buffer, convolutional interleaver in DVB requires 11 registers to be used as a general purpose register file, and the bit interleaver in DVB requires a long chain of single bit registers. Due to small size and special configuration requirements, a general purpose register file is not feasible here, and a fully customized register file is used. The width of registers is not the same and it is optimized as per requirement from different implementations. The registers can also be connected to form a chain, thus the single bit buffer for a bit interleaver is managed by circulating the shifted output inside register file. The two data input ports of the register file are fed through multiplexers M18 and M19 as shown in Figure 4.

4.4. Memory Organization

Memory requirements for different types of interleaver implementations are very much different as listed in Table 2. Also, soft bit processing in the decoder implies different requirements of bit width for different conditions and decoding architectures. The maximum width requirement is 6 bits for symbol interleaving and 8 bits for part of the memory in WCDMA. Multistream transmission requires multiple banks of memories in parallel. The size of the memory is taken as , which is due to large block size requirements for 3GPP-LTE, 3GPP-WCDMA, and DVB.

Memory partitioning is mainly motivated by the high-throughput requirements from the multistream system, for example, 802.11n. It requires four memory banks in parallel which appears to be a good choice to meet other requirements as well. Parallel memory banks can also be used in series to form a big memory. Partial parallelism can also be used where larger memory width is needed. Another worth full benefit of using multiple memory banks is avoiding the use of dual port RAM, which is not silicon efficient. Thus all the memories in the design are single port memories. The interleaved addresses for block and convolution interleavers computed by address generation block are combined according to the configuration requirement to make the final memory address. Figure 6 shows the memory organization with address selection logic. Particularly for convolutional interleaving, a small delay line with depth of 6 in the path of read addresses and control signals is used to avoid the data write and read for the same memory in a single clock cycle.

5. Algorithm Transformation for Efficient Mapping

The main objective is to use single architecture for interleaver implementation with maximum hardware sharing among different algorithms. The versatility of interleaving algorithms makes it an in-efficient implementation when original algorithms are directly mapped to same architecture. On the other hand some transformations based on modular algebra can be applied on the original algorithms to make them hardware efficient. Same algorithmic transformations can be used to reach to an efficient hardware multiplexing among different standards. The following subsections present some transformation examples for selected algorithms which are very much versatile in the implementation point of view. These subsections cover channel interleaving for WiMAX and WLAN including 802.11n with frequency rotation, turbo code block interleaving for LTE, WiMAX, and HSPA Evolution, and convolutional interleaving used in DVB.

5.1. Channel Interleaving in WiMAX and WLAN

The channel interleaving in 802.11a/b/g (WLAN) and 802.16e (WiMAX) is of the same type. The interleaver function defined by a set of two equations for two steps of permutations, provides spatial interleaving, whereas the newly evolved standard 802.11n [8] based on MIMO-OFDM employs frequency interleaving in addition to spatial interleaving. Most of literature available [3136] covers the performance and evaluation of WLAN interleaver design for a high-speed communication system; however, some recent work [2327] focuses on interleaver architecture design including some complexity reduction techniques along with feasibility to gain higher throughput. The 2D realization of interleaver functions is exploited to enable efficient hardware implementation. The two steps of permutations for index for interleaver data are expressed by the following equations:

Here is the block size corresponding to number of coded bits per allocated subchannels and the parameter is defined as where is the number of coded bits per subcarrier, (i.e., 1, 2, 4 or 6 for BPSK, QPSK, 16-QAM, or 64-QAM, resp.). The operator is the modulo function computing the remainder and the operator is the floor function, that is, rounding towards zero. The range of and is defined as . The direct implementation of the above mentioned equations is very much hardware in-efficient and also the mapping onto the proposed unified interleaver architecture is not possible. Therefore, realization of two 1D equations into 2D space and computation of interleaved address in recursive way is adopted to reduce the hardware complexity as explained in the following subsections.

5.1.1. BPSK-QPSK

As is 1 and 2 for BPSK and QPSK, respectively; thus for both cases and (2) simplifies to the following form:

Considering the interleaver as a block interleaver, the parameter is usually considered as total number of columns , and parameter is taken as total number of rows , but the column and row definition are swapped hereafter. The parameter is taken as total number of rows and parameter is taken as total number of columns. The functionality still remains the same, with the benefit that it ends up with the recursive expression for all the modulation schemes. According to new definitions, the term provides the behavior of row counter and the term provides the behavior of column counter. Thus introducing two new variables and as two dimensions, such that increments when expires, the ranges for and are mentioned as follows: which satisfies against when and . Defining total number of columns as , (3) can be written as

The recursive form after handling the exception against can be written as

Defining row counter as and column counter as , the hardware for (6) is shown in Figure 7(a). The case of BPSK and QPSK do not carry any specific inter-row or inter-column permutation pattern; thus it ends up with relatively simple hardware, but it provides the basis for analysis for 16-QAM and 64-QAM cases, which are more complicated.

5.1.2. -QAM

16-QAM scheme has 4 code bits per subcarrier; thus parameter is 2 and (2) becomes

Like BPSK/QPSK case, algebraic only steps cannot be used here to proceed due to the presence of floor and modulo functions. Instead, all the possible block sizes for 16-QAM are analyzed to restructure the above equation. The following structure appears to be equivalent to (7) and at the same time resembles the structure of (3); thus it suits well for hardware multiplexing:

The extra term is defined by the following expression:

This term appears due to the reason that the interleaver for 16-QAM carries specific permutation patterns, making the structure more complicated. Considering the 2-dimensions and having range as mentioned in (4), the behavior of the term is the same as that of , when is the row counter. Thus (8) can be written in 2D representation as follows: where

The term can further be simplified to a smaller expression but it is easy to realize the hardware from its current form. The modulo terms can be implemented by using the LSB of row counter and column counter , and the required sequence can be generated with the help of an XOR gate and an adder as shown in Figure 7(b).

5.1.3. -QAM

The parameter is 3 for 64-QAM; thus (2) becomes

The presence of modulo function makes it much harder to reach some valid mathematical expression algebraically. Different structures for all possible block sizes for 64-QAM are analyzed and the structure similar to (6) and (10) and equivalent to (12) is given as follows: where and represent two dimensions and their range is given by (4). Defining and , is given as The term provides the inter-row and inter-column permutation for against row counter and column counter . The expression for looks very long and complicated, but eventually, it gives a hardware efficient solution as the terms inside braces are easier to generate through a very small lookup table. The generic form for (6), (10), and (13) to compute the interleaved address can be written as

Here parameter distinguishes different modulation schemes. For BPSK/QPSK , and for 16-QAM and 64-QAM, and are given by (9) and (14), respectively. The hardware realization supporting all modulation schemes is shown in Figure 7(c). It appears to be a much optimized implementation as it involves only two additions, some registers, and a very small lookup table.

5.2. Frequency Interleaving in 802.11n

The transmission in 802.11n can be distributed among four spatial streams as shown in Figure 8. The interleaving requires frequency rotation in case more than one spatial streams are being transmitted. The frequency rotation is applied to the output of the second permutation . The expression for frequency rotation for spatial stream is given as follows:

Here is the parameter which defines different frequency rotations for 20 MHz and 40 MHz case in 802.11n. The frequency rotation also depends on the index of the spatial stream , thus each spatial stream faces different frequency rotations. Defining the rotation term as , that is,

we have

The range for the term is not bounded and it can have value greater than ; thus direct implementation cannot be low cost. Analyzing the two terms and separately, it is observed that the second term provides the starting point for computing the rotation . As the rotation is fixed for a specific spatial stream, thus the starting value also holds for all run time computations. Equation (18) in combination with (10) can be written as

Here is the joint address after applying both, spatial interleaving and frequency interleaving against row index , column index and spatial stream index . A lookup can be used for the starting values for against different spatial streams. The values for all the cases follow the condition, that is, which depicts that the term cannot be larger than . Therefore, the frequency rotation can be computed with a very small hardware as shown in Figure 9.

5.3. Multistream Interleaver Support in 802.11n

The spatial interleaver address generation block shown in Figure 7(c) is denoted as Basic Block (BB) and the frequency rotation block as shown in Figure 9 is denoted as Auxiliary Block (AB). Both these blocks combine to form a complete address generation circuit for one spatial stream. In order to provide support for four streams in parallel, one may consider replicating the two blocks four times. However, an optimized solution would be to use 2 basic blocks and 2 auxiliary blocks, still providing support for 4 spatial streams. The hardware block diagram to generate the interleaver addresses for multiple streams in parallel is shown in Figure 10. This hardware supports quick configuration changes thus providing full support to any multitasking environment. If some new combination of modulation schemes is needed to be implemented, which is not supported already, the interfacing processor can do task scheduling for different types of modulation schemes.

5.4. Turbo Code Interleaver for HSPA+

The channel coding block in HAPA+ including WCDMA uses turbo coding [37] for forward error correction. 3GPP standard [4] proposes the algorithm for block interleaving in turbo encoding/decoding as mentioned below. Here is the block size, is the row size, and is the column size in bits.

(i)Find appropriate number of rows “”, prime number “”, and primitive root “” for particular block size as given in the standard.(ii)Col Size: (iii)Construct intra-row permutation sequence by

(iv)Determine the least prime integer sequence for , by taking , such that ,  and . (v)Apply inter-row permutations to to find (vi)Perform the intra-row permutations , for and . If : and . If : , and ,  , and if then exchange with . If : .(vii)Perform the inter-row permutations.(viii)Read the address columns wise.

The presence of complex functions like modulo computation, intra-row and inter-row permutations, multiplications, finding least prime integers, and computing greatest common divisor makes it in-efficient while implementing it in its original form. Further, to get one interleaving address in each cycle, some preprocessing is also required where parameters like total number of rows or columns, least prime number sequence , inter-row permutation patterns , intra-row permutations , prime number , and associated integer are computed. Some of these parameters can be computed using lookup tables while the others need some close loop or recursive computations. The simplifications considered in the implementation are discussed in the following paragraphs.

One of the main hurdles to generate on-the-fly interleaved address is the computation of intra-row permutation sequence . Before applying the intra-row permutations, the term is computed which produces random values due to and modulo function. These random values appear as index to compute , due to which it may require many clock cycles to be computed on-the-fly. To resolve it, some precomputations are made and results are stored in a memory. These precomputations involve the computation of a modulo function which requires a divider for direct implementation. To avoid the use of divider, indirect computation of modulo function is done by using Interleaved Modulo Multiplication Algorithm [38]. It computes the modulo function in an iterative way requiring more than one clock cycles. Looking at maximum value of , which is 5 bits, a maximum of 5 iterations are needed to compute one modulo multiplication. The algorithm to compute the Interleaved Modulo Multiplications is shown in Figure 11 and the hardware required is shown in Figure 12. This hardware produces the data for memory while in precomputation phase; however, same hardware is utilized to generate the address for the memory, while in execution phase. The usage of memory depends on the parameter and it will be filled upto locations.

Finding instead of direct computation of least prime number sequence gives the benefit of computing the RAM address recursively and avoiding computation of the modulo function. This idea was introduced in [13] and later on it has been used in [14, 16, 17]. The computation of can be managed by a subtractor and a look up table, provided that all the values of placed in the look up table satisfy the condition . The similarities between different sequences for for all possible values are very helpful to improve the efficiency of the lookup table. The parameters and are stored in combined fashion in a lookup table of size . The lookup table is addressed via a counter. Against each value of , the condition is checked using a comparator to find the appropriate value for and . Once is found, the total number of columns can have only three values, that is, , or . Hence is found in at most three clock cycles by checking the condition The recursive function used to compute the RAM address with the help of parameter is given by

The data from RAM are denoted as after passing through some exception handling logic. Parameter provides the intra-row permutation pattern for a particular row. The final interleaved address can be found by combining the inter-row permutation with intra-row permutation as follows:

The complete hardware for interleaver address generation for Turbo Code interleaver is shown in Figure 12. It can be mapped to the proposed unified interleaver architecture quite efficiently.

5.5. Turbo Code Interleaving in 3GPP-LTE and WiMAX

The newly evolved standard, 3GPP LTE [5], involves interleaving in the channel coding and rate matching section. The interleaving in rate matching is called subblock interleaving and is based on simple block interleaving scheme. The channel coding in LTE involves Turbo Code with an internal interleaver. The type of interleaver here is different and it is based on quadratic permutation polynomial (QPP), which provides very compact representation. The turbo interleaver in LTE is specified by the following quadratic permutation polynomial:

Here , with as block size. This polynomial provides deterministic interleaver behavior for different block sizes and appropriate values of and . Direct implementation of the permutation polynomial given in (25) is hardware in-efficient due to multiplications, modulo function, and bit growth problem. To simplify the hardware, (25) can be rewritten for recursive computation as where . This can also be computed recursively as

The two recursive terms mentioned in (26) and (27) are easy to implement in hardware (Figure 13) with the help of a LUT to provide the starting values for and .

WiMAX standard [6] uses convolutional turbo coding (CTC) also termed duo-binary turbo coding. They can offer many advantages like performance, over classical single-binary turbo codes [39]. Parameters to define the interleaver function as described in [6] are designated as , and Two steps of interleaving are described as follows.

Step 1. Let the incoming sequence be for  if .
The new sequence is

Step 2. The function provides the address of the couple from the sequence that will be mapped onto address of the interleaved sequence. is defined by the set of four expressions with a switch selection as follows:
switch .
case  0: .
case  1: .
case  2: .
case  3: .

Combining the four equations provided in step-2, the interleaver function becomes where can be computed using recursion, that is, by taking is given by

As range of and is less than , thus can be computed by using addition and subtraction with compare and select logic as shown in Figure 13.

5.6. Convolutional Interleaving in DVB

The convolutional interleaver used in DVB is based on the Forney [40] and Ramsey type III approach [41]. The convolutional interleaver being part of outer coding resides in between RS encoding and convolutional encoding. The convolutional interleaver for DVB consists of 12 branches as shown in Figure 14. Each branch is composed of first-in-first-out (FIFO) shift registers with depth , where for DVB. The packet of 204 bytes consisting of one sync byte is entered into the interleaver in a periodic way. For synchronization purpose the sync bytes are always routed to branch-0 of interleaver.

Convolutional interleaving is best suited for real time applications with some added benefits of half the latency and less memory utilization as compared to block interleaving. Recently, convolutional interleavers have been analyzed to work with Turbo codes [4244], with improved performance, which make them more versatile; thus general and reconfigurable convolutional interleaver architecture integrated with block interleaver functionality can be of significance.

Implementation of convolutional interleavers using first-in-first-out (FIFO) register cells is silicon inefficient. To achieve a silicon efficient solution, RAM-based implementation is adopted. The memory partitioning is made in such a way that by applying appropriate read/write addresses in a cyclic way, it exhibits the branch behavior as required by a convolutional interleaver. RAM write and read addresses are generated by the hardware shown in Figure 15. The hardware components used here are almost the same as used by interleaver implementation for other standards, thus providing the basis for multiplexing the hardware blocks for reuse. To keep track of next write address for each branch, 11 registers are needed, which provides the idea of using cyclic pointers instead of using FIFO shift registers. For each branch the corresponding write address is provided by the concerned pointer register and next write address (which is also called current read address) is computed by using an addition and a comparison with the branch boundaries. Other reference implementations have used branch boundary tables directly, but to keep the design general, the branch boundaries are computed on-the-fly using an adder and a multiplier in connection with a branch counter.

For implementing a convolutional deinterleaver, the same hardware is used by implementing the branch counter in reverse order (decrementing by 1). In this way, same branch boundaries are used, and the only difference is that the sync byte in the data is now synchronized with the largest branch size as shown in Figure 14. Keeping the same branch boundaries for the deinterleaver, the width of the pointer register becomes fixed. This gives an additional benefit that the width of pointer register may be optimized efficiently.

6. Integration into Baseband System

The multimode interleaver architecture can perform interleaving or deinterleaving for various communication systems. It is targeted to be used as an accelerator core with a programmable baseband processor. The usage of the multimode interleaver core completely depends on the capability of the baseband processor. For lower throughput requirements only a single core can be utilized with baseband processor and the operations are performed sequentially. However, as a matter of fact, usual system level implementations require interleaver at multiple stages. Number of stages can be up to three, for example, WCDMA (turbo code interleaving, 1st interleaving, and 2nd interleaving). A fully parallel implementation can be realized by using three instances of the proposed multimode interleaver core, but in order to optimize the hardware cost a wise usage would be to use two instances hooked up with the main bus of the processor as shown in Figure 16. In this way the interleaving stages can be categorized as channel interleaving and coding/decoding interleaving. Further optimizations can be made in the two cores to fit in the particular requirements, for example, one interleaver core dedicated for coding/decoding and the second core dedicated for channel interleaving. By doing so the reduction of silicon cost associated with address generation is not significant, however, memory sizes can be optimized as per the targeted implementations, which can reduce the silicon cost significantly. For current implementation of multimode interleaver, the input memory used for any kind of decoding is considered to be the part of baseband processor data memory. In this way the extra memory inside interleaver core can be avoided which might be redundant in many cases. However, the integration of input memory in the main decoding operation is facilitated by the interleaver core by providing the address for input memory. In this way the interleaved/deinterleaved data can be fed to the decoder block in synchronized manner.

Although the main focus is to support the targeted standards, however, programmability of the processor may target some different types of interleaver implementation which is not directly supported by this core. To make it still usable, support for some indirect implementation of any block interleaver with or without having row or column permutations is also provided. In this case the interleaver core is configured to implement a general interleaver with external permutation patterns. The permutation patterns are computed inside baseband processor using its programmability feature and loaded in a couple of the interleaver memories during precomputation phase. Excluding these memories, a restriction on maximum block size (i.e., 4096) will be imposed in this case. This type of approach is adopted by all commercially available interleaver implementations like Xilinx [28], Altera [29], and Lattice Semiconductor [30]. The computation of interleaver permutations on processor side and loading them into memory can impose more computation and time overheads on the processor side. Another drawback is that it does not support fast switching between different interleaver implementations. A real multimode processor may require fast transition from one standard to another; therefore, it is not a perfect choice for a real multimode environment. However, it is supported by the proposed multimode interleaver core for the completeness of the design.

7. Implementation Results

The reconfigurable hardware interleaver design shown in Figure 4 provides the complete solution for multimode radio baseband processing. The wide range of standard support is the key benefit associated with it. The RTL code for the reconfigurable interleaver design was written in Verilog HDL and the correctness of the design was verified by testing for maximum possible cases. Targeting the use of interleaver core with a multimode baseband processor, one of the important parameters to be investigated is precomputation cycle cost. A lower precomputation cycle cost is beneficial for fast switching between different standards. Table 3 shows the worst case cycle cost during precomputation for different interleavers. It is observed that the cycle cost in WCDMA is higher for some block sizes, but still it works fine, as it is less than the frame size and it can be easily hidden behind the first SISO decoding by the turbo decoder. The worst case precomputation cycle cost for other interleaver implementations is not very high. Therefore, the design supports fast switching among different standards and hence it is very much suitable for a multimode environment.

The multimode interleaver design was implemented in 65 nm standard CMOS technology and it consumes area. The chip layout is shown in Figure 17 and the summary of the implementation results is provided in Table 4. The design can run at a frequency of 166 MHz and consumes 11.7 mW power in total. Therefore, having 4-bit parallel processing for four spatial streams (e.g., 802.11n) maximum throughput can reach up to 664 Mbps. However, this throughput is limited to 166 Mbps for single stream communication systems. Table 5provides the comparison of the proposed design to others in terms of standard coverage, silicon cost, and power consumption. The reference implementations have lower standard coverage as compared to the proposed design. Though more silicon is needed for more standard coverage, our solution still provides a good trade-off with an acceptable silicon cost and power consumption.

8. Conclusion

This paper presents a flexible and reconfigurable interleaver architecture for multimode communication environment. The presented architecture supports a number of standards including WLAN, WiMAX, HSPA+, 3GPP-LTE, and DVB, thus providing coverage for maximum range. To meet the design challenges, the algorithmic level simplifications like 2D transformation of interleaver functions and recursive computation for different implementations are used. The major focus has been to compute the permutation patterns on-the-fly with flexibility. Architecture level results have shown that the design provides a good tradeoff in term of silicon cost and reconfigurability when comparing with other reference designs with less standard coverage. As compared to individual implementations for different standards, the proposed unified address generation offers a reduction of silicon by a factor of three. Finally, the basic requirement of a multimode processor platform, that is, fast switching between different standards has been met with minimal precomputation cycle cost. It enables the processor to use the interleaver core for one standard at some time and use it for another standard in the next time slot by just changing the configuration vector and small preprocessing overheads.