Journal of Electrical and Computer Engineering

Volume 2012 (2012), Article ID 614259, 14 pages

http://dx.doi.org/10.1155/2012/614259

## VLSI Architectures for Sliding-Window-Based Space-Time Turbo Trellis Code Decoders

School of Electronic and Electrical Engineering, University of Leeds, Leeds LS2 9JT, UK

Received 15 July 2011; Revised 17 February 2012; Accepted 4 April 2012

Academic Editor: Zhiyuan Yan

Copyright © 2012 Georgios Passas and Steven Freear. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The VLSI implementation of SISO-MAP decoders used for traditional iterative turbo coding has been investigated in the literature. In this paper, a complete architectural model of a space-time turbo code receiver that includes elementary decoders is presented. These architectures are based on newly proposed building blocks such as a recursive add-compare-select-offset (ACSO) unit, A-, B-, Γ-, and LLR output calculation modules. Measurements of complexity and decoding delay of several sliding-window-technique-based MAP decoder architectures and a proposed parameter set lead to defining equations and comparison between those architectures.

#### 1. Introduction

The ability of turbo codes (TCs) to achieve very low BER that approaches the Shannon limit is very attractive. These channel capacity approaching codes have been proposed by Berrou et al. [1, 2]. Iterative turbo decoding involves the parallel concatenation of two or more component codes, which operate on the data and exchange information in order to progressively reduce the error rate. The exchanged information is in the form of log-likelihood ratio (LLR) soft decisions and can be exchanged between elementary decoders, which apply either the soft output Viterbi algorithm (SOVA) [3, 4] or the maximum a posteriori (MAP) probability algorithm [5].

The principles of iterative turbo decoding can be combined with those of space-time coding resulting in a bandwidth efficient low-error-rate channel coding scheme named space-time turbo trellis coding (STTuTC) [6–10]. The latter scheme benefits from the impressive coding gain of turbo codes and the diversity gain of space-time codes, to obtain a very power-efficient system. These power gains are very important for high-performance communication systems, particularly in scenarios where low signal-to-noise ratio is in demand.

Despite the complexity reduction, iterative turbo codes still have prohibitively high implementation complexity and suffer from large decoding delay, motivating researchers to seek efficient implementations [11–19]. The sliding window technique (SWT) presented in [17–21] enables the early acquisition of the state metric values without having to scan the whole trellis frame in the one direction before scanning the other, and this results in reduced elementary decoder latency as well as smaller state metric memory requirements.

#### 2. Architectural Overview

##### 2.1. Turbo Transmitter

A space-time turbo trellis code is a parallel concatenation scheme using two identical elementary encoders to operate on an information message , where each symbol consists of * b* bits [8, 9]. The same message is sent to both component encoders, but each one receives this information in a different order through an interleaver.

Elementary encoders using recursive space time trellis coding (STTC) benefits from interleaver gain and iterative decoding [6–8]. These encoders add redundancy to the encoded messages, and the second sequence is deinterleaved back to its original order. The two sequences of codewords are multiplexed and truncated to preserve bandwidth efficiency and form a single * P * length sequence denoted by , where each space-time vector consists of * K b*-bit modulated symbols. In vector * X*, half of the symbols (odd locations) come from component encoder 1 () and the other half (even locations) come from component encoder 2 () [6]. The sequence is forwarded for carrier modulation and transmission since baseband modulation occurs within the encoders. The structure results in a long block code from small memory convolutional codes, a fact that enables the turbo code to approach the Shannon limit.

Note that the recursive STTCs are trellis based, meaning that during the encoding process, in each elementary encoder a trellis path is created [7]. Let be a state row vector containing all the states in a trellis path chosen by the information message. The combined trellis paths will be decomposed and exploited at the receiver side to decode the message.

##### 2.2. Proposed Turbo Receiver

Consider a baseband MIMO receiver, shown in Figure 1, the received frame , where , can be used to estimate the original sequence *C*. Using the baseband signal representation, the received noisy signal from antenna * j * at time * t * over a MIMO (possibly fading, when ) Gaussian channel is given by the equation
where represents the noiseless version of the signal [7–9]. The receiver, after estimating the channel fading coefficients, separates the signals using maximal ratio receiver combining (MRRC), which also equalises the signals by removing the effect of the channel. With MRRC, costly joint detection of space-time symbols is avoided.

The equalised frame symbols are delivered to a soft-demapper to calculate the soft channel output given in

The received symbols have already prescaled by a SNR scaler in the analogue domain [12]. A frame formation block (FBB) separates the even and odd symbols and inserts zeros, so it is assured that each log-MAP decoder core receives information only from the corresponding component encoder.

The STTuTC decoder at the algorithmic level consists of two soft-in soft-out (SISO) component decoders, which exchange soft information and progressively through many iterations result in a better estimate of the values of the information symbols [11, 12]. Since the two component decoders do not operate on data at the same time, only one silicon IP core can be used to implement both. As illustrated in Figure 1, the decoder contains a symbol-based SISO component core that applies the log-MAP algorithm on a nonbinary trellis. It also uses two memories, one for storing the exchanged soft information and another to perform symbol interleaving/deinterleaving.

##### 2.3. MAP and Log-MAP Algorithms

Maximum likelihood decoding using Viterbi algorithm determines the most likely path through the trellis and from that determines the information symbol sequence, whereas MAP decoding determines the latter sequence by considering each information symbol independently using the channel observations . The MAP algorithm is used to compute the a posteriori LLR soft decisions (3), about the value of the symbols in the information message [6, 8]: where . This equation expresses the excess probability of a symbol being in the logarithmic domain over the probability of being equal to a reference symbol 0 [11, 13]. The log-MAP decoder calculates the LLRs of all the possible modulation symbols in the constellation diagram using the forward and backward state metric probabilities as well as the branch metric probabilities in the logarithmic domain:

The metric represents the logarithmic probability that the state s at stage *t* has been reached from the beginning of the trellis. Similarly, the metric represents the logarithmic probability that the state at stage has been reached from the end of the trellis [13, 15]. The branch metric probabilities can be computed using
where . The above equation gives the probability of a transition at time * t * from state to state * s * in the trellis diagram [13]. Translating (5) in the logarithmic domain [15],
where is a constant, the term is the soft channel output, and the term is the a priori information coming from the other component decoder:

Finally, (7) gives the elementary decoder output, which is a posteriori probability LLR in terms of logarithmic domain state and branch metrics [13].

#### 3. Soft-In Soft-Out-MAP Component Decoder

##### 3.1. Decoder Module Architecture

The proposed elementary decoder core is illustrated in Figure 2. The main modules within this decoder are the branch metric calculation module (BM-CM), the A- and B-state metric calculation modules (A-SM-CM and B-SM-CM, resp.), the LLR calculation module (LLR-CM), and a large A-metric RAM to temporarily store metric values.

The BM-CM calculates all possible normalised branch metric values (see Section 3.1) and store them in a register file. The latter consists of registers to store the possible branch metric values. These metrics are given to A-SM-CM, B-SM-CM, and LLR-CM. The A-metric CM scans the entire trellis diagram in the forward direction, starting from stage zero and ending at stage , calculating all A-state metrics and storing them in the A-metric RAM.

After this first trellis pass the B-metric CM and the LLR-CM simultaneously compute the B-state metrics and the LLR outputs, respectively. The B-SM-CM scans the entire trellis in the backward direction, from stage to 0. At this time, the B-metrics do not need to be stored in a memory, since every time a B-metric is calculated the corresponding A-metric is extracted from the A-metric RAM and the LLR calculation is computed.

Therefore, only the A-state metrics are stored in a RAM memory, the size of which must be , for continuous mode decoders. In a decoder operating in the terminated mode, where the trellis path begins and ends at state s(0), that is, and , the memory requirement is slightly reduced to , whereas in truncated operating mode (i.e., must hold for every frame, but there is no termination condition), the storage requirement is , where is the word length of the data format used to represent the internal data of the MAP decoder. Table 1 shows the complexity and decoding delay of a log-MAP decoder operating in different modes. The metrics are calculated using data flow units (DFUs) that operate in decoding steps. As can be seen the log-MAP decoder results in very high memory requirements and long elementary decoder latencies.

##### 3.2. Branch Metric Calculation Module

The branch metric calculation module computes the normalised branch metrics . The branch metric appears on both the numerator and denominator of the a posteriori LLR formula, thus any constants will cancel out and play no role in the calculation:

The module receives as input the soft output of the channel and the a priori information from the other component decoder and computes all possible values of branch metrics as shown in Figure 3.

##### 3.3. State Metric Calculation Module

In the literature, the proposed architectures that recursively calculate the state metrics are based in the so-called ACSO (add-compare-select-offset) processing unit equivalent to add-max*() presented in [18]. It is responsible for the computation of a state metric based on the previous stage metrics and the branch metrics arriving at the state in question.

Several different, but similar architectures have been proposed. The diagram in Figure 4 suggests a new type of the ACSO unit, which instead of employing more multiplexers to cope with the more than two input state metrics in nonbinary trellises, it recursively accepts the input state and branch metrics and calculates the output state metric in a number of cycles. A look-up table can be used to store precalculated correction factors and its quantization is discussed in [18, 22].

This module requires clock cycles to calculate the output state metric. In the first clock cycle, the feedback register is reset, and the and are applied at the inputs. Also the SW1 switch is connected to the zero register. The result is that is stored in the feedback register. In the remaining cycles, the LUT output is connected through SW1 to the output adder. In the second cycle, the + is applied at the input; therefore the is stored in the feedback register. Similarly, the is stored at the third cycle, whereas the output state metric equal to is ready at the output port during the forth clock cycle. A-metric calculation is illustrated, but the architecture operates equally good for B-metric calculation in a backward recursion. This unit will be referred to as recACSO unit.

The forward/backward recursion computations may lead to an over flow. If a state metric value constantly grow the finite word length will not be sufficient to hold this value and an over flow will occur. Since the operation is linear and shift invariant and a global shift of state metric values would not change the LLR output value, it is the difference between the state metrics and not their absolute values that is important. Rescaling of the state metrics can be readily performed to avoid over flow. The rescaling technique is the same as that used in other SISO-log MAP algorithms [11, 18, 23].

The precision of the state metrics determines the word length . The precision depends on the dynamic range of the state metrics.

Thus, recACSO units can be employed to calculate all the states of trellis stage in one step, as depicted in Figure 5 for a state trellis. This unit receives possible branch metrics and the previous state metrics and computes all the state metrics of the current stage. Configuring the module this way fastens computation since it avoids reading the previous stage state metrics from the RAM memory. At the input of the module two selectors are used to drive the recACSO units with the appropriate inputs.

##### 3.4. LLR Output Calculation Module

The LLR-CM depicted in Figure 6 is responsible for computing the output reliability estimates given the state and branch metric calculations according to

A careful observation of (9) reveals a close relation with the recACSO unit calculation. The LLR computation has two terms that can be computed separately and then subtracted. If and are added in advance, then the recACSO unit can then be used to calculate , where . Because the when has to be subtracted from all other LLRs, an inverter is included to negate this term.

As already mentioned, need not be calculated since zero is the reference symbol and therefore its LLR is zero. The remaining log-likelihood ratios are computed by the proposed architecture and outputted out of the symbol-based log-MAP component decoder.

#### 4. SWT-Based Architectures for the SISO-MAP Decoder

##### 4.1. The Concept

SISO-MAP decoders suffer from very long latency and high memory requirements. A different organisation (scheduling) of the computation and exploitation of the convergence property [24–26] of the decoder lead to reduction in the latency and the amount of state metrics that need to be stored. For this reason, the sliding window technique (SWT) has been proposed in the literature [18, 19].

SWTs are divided in single-flow structure (SFS), double-flow structure (DFS), and pointer-based (PNT) techniques. In these techniques the trellis or transmitted frame is divided in windows of size , where is the convergence length [25]. Care must be taken to ensure that the whole frame can be exactly divided into an integer number of windows. MAP decoder convergence property allows recursive calculation of a state metric to result in a good approximation of the actual metric value after recursion steps, even if the initial data is a dummy value [8, 9].

Three important metrics for comparison of MAP decoders can be considered: (i) decoding delay (latency), (ii) memory requirements (or storage lifetime), and (iii) computational complexity (number of recursion modules). Several different schedulings of the MAP algorithm computation reveal trade-offs between these quantities. Since the window size is recursion steps, a DFU proposed in Section 3 can be employed to calculate all the state metric vectors in a window in steps. A DFU can perform forward, backward, or dummy metric calculations.

Defining the following parameter set will assist in the analysis of SWT-based MAP decoders (technique, , , shift, ). The “technique” element is the type of computation organisation with possible values being SFS, DFS, SFS/PNT, and DFS/PNT.

“” is the relative position between (A- and B-) forward and backward coupled metric recursive calculations and therefore determines when the B-metric calculation begins relative to A-metric calculation. It is a continuous value variable, but only three values are a meaningful choice. means that the backward calculation starts after the forward ends, means that the forward calculation starts after the backward ends, and means that the two calculations take place simultaneously and switch role in the middle of the convergence block calculation.

“” is the ratio of the valid over the invalid (dummy) metric recursive calculation. To compute the backward metrics before the end of the whole frame a dummy metric calculation is required to estimate an intermediate value using the convergence property. This means that for every A- and B-metric pair one dummy module is required, but that can vary depending on the technique, is the ratio of invalid metrics over valid A- and B-metrics.

The computation can take place in two directions (two flows) simultaneously each calculating half of the frame. The “Shift” is the time shift in decoding steps in between the two flows. In the pointer technique pointers are used to reduce memory requirements, is the number of pointers, which when exists indicates the pointer technique PNT.

This set of SWT-based architectures has been fully investigated and some results are given in Table 2. These indicate the hardware requirements and decoding delay of some common implementation structures.

##### 4.2. Single-Flow Structure SISO-MAP Decoders

Figures 7 and 8 illustrate, as an example, the data flow graph (DFG) of a single flow and the architecture of SM-CM of (SFS, = 0, = 1) and the architecture of (SFS, = 0, = 1/2) single-flow structures. The flow diagram explains the computation.

In the DFG the horizontal axis represents time quantized in terms of symbol periods, whereas the vertical axis represents the symbol number within the frame. In this specific diagram a frame of *P* = symbols decoded by a trellis of stages is presented, where is the convergence length of the decoder.

Three data flow units can be seen in the diagram. The dashed arrow represents recursions of a forward DFU calculating A-metric, which are stored as calculated. The continuous arrow represents recursion of a backward DFU calculating B-metrics. This calculation is done simultaneously with the LLR soft output calculation. The dotted arrow represents recursions of a dummy metric DFU.

In the diagrams to follow the dotted arrow always represents dummy metric recursive use of DFUs, whereas for forward and backward metric DFUs the arrow is continuous or dashed when the metric is not stored or stored in the memory, respectively. The dummy metric DFU results in a valid metric after the backward recursions.

The fact that the dummy metric DFU is working backwards can be understood from the fact that its projection on the vertical axis points downwards. DFUs whose projection on the vertical axis points upwards are operating on the data in a forward manner. Also note that the maximum number of arrows a vertical line can cross in this diagram gives the number of DFUs employed in this structure [21].

Storage is also represented by the shaded rectangular areas. Taking into account one of these areas (they are all the same), the projection of this rectangular area on the vertical line gives the amount of state metric vectors to be stored [18, 21], whereas the projection on the horizontal line gives the time required to store the vectors.

Decoding delay (or latency) is the horizontal distance between the acquisition curve (always ) and the decoding curve in DFG. Since the decoding curve in this structure is at , the decoding delay is symbol periods. The symbol period is denoted by .

DFGs and tile graphs model the same thing, they model the resource-time scheduling of the recursions of the algorithm. A diagram is a DFG when viewed as a concrete graph or tile graph when viewed as tile repetition [17]. On the right of Figure 7 a tile, which consist of 3 DFUs, is illustrated. This tile is repeated as many times as required to form a complete DFG.

Let us concentrate more on the scheduling of the operations described by the DFG in Figure 7. During the symbol periods to , the dummy metric DFU (dotted arrow 1) starting from a zero vector assigned to state metric vector computes invalid metrics from down to . This calculation results in a valid state metric vector , because convergence is reached. All other metrics calculated during these recursions are invalid. At the same time the forward DFU (dashed arrow 2) calculates and stores a sequence of A-metrics from to .

In the next symbol periods from to the backward DFU (continuous arrow 3) comes into play to calculate to . During the backward calculation the A-state metrics from to are extracted from the memory and together with the B-state metrics are used to calculate the first soft outputs.

Thus, a set of 3 recursion units (called a “tile” in a tile diagram as indicated in DFG) results in the calculation of A- and B-metrics required to produce soft output LLR values. Note that the LLR values are calculated in a reverse manner. At symbol periods to , the order of soft output values is reversed [18]. In a turbo coding scheme, the interleaver reordering can be exploited to perform this reversing. If this last LLR reversing step is done by the MAP decoder, then the latency of this core is symbol periods. If the reversing is done by the interleaver, the decoding delay is symbol periods.

The above process (tile) is repeated until the end of the frame. The set of tiles in the tile graph (or DFG) represents a single flow passing through the data of a frame (or trellis). Hence, the proposed structure is designated as a single-flow structure (SFS). Structures where the above process occurs in both directions are called double-flow structures (DFSs) and will be discussed later in this paper.

In single-flow structures the benefit of structures with compared to is the use of a single RAM memory as opposed to two RAM memories of half size each, required in the latter structures. Although the memory requirements as well as other requirements are the same, less complex datapath circuits are needed in the (SFS, = 0, ) case.

Also the (SFS, = , ) structure experiences smaller decoding delay than the (SFS, = 2, ); LLR reversing is performed in the interleaver rather than in the MAP decoder. Thus, for SFS, = 0 is clearly the best choice. For (SFS, = 0, ) decoders,

Note that, when , both the storage requirements and the decoding delay are increased without any reduction in the number of DFUs; therefore architectures with are not suggested.

For , the storage requirements and decoding delay are decreasing as decreases, but the number of required DFUs is increasing. The decoding delay is when MAP decoder reverses the LLR outputs and when the interleaver reverses the LLR outputs.

##### 4.3. Double-Flow Structure SISO-MAP Decoders

The DFG and the state metric calculation modules for a (DFS, , , shift = /4) structure are shown in Figures 9 and 10, respectively. The double-flow structures useall of the concepts of single flows, in the sense that they also have parameters with the interesting values being , and and 1/2, 1 and 2. Some DFSs have the benefit of operating with twice as much speed, without doubling the hardware requirements.

For example, when the RAM memories used by the data flow units to store the state metrics A and B of the SISO-MAP algorithm operate in the write mode for half of the time. The remaining half of the time they operate in read mode. This read period can be exploited by a second flow which runs over half of the frame data in the opposite direction. This is illustrated in Figure 9 for = , = 1, which shows a structure that decodes a frame of = 8 symbols in half time.

Figure 13 depicts a DFS, with , = 1/2, . Note that for RAM sharing between the two data flows a shift of is required in this case, as opposed to shift used in other DFSs. This means that the appropriate shift is not constant but depends on the other parameters of the structure. In this diagram it is clearly indicated that two flows are operating splitting the frame into two symbol blocks. The first flow is using 4 DFUs, one forward, one backward, and two dummy metric DFUs. The same resources are employed by the second flow, therefore in total 8 DFUs are required.

Double-flow structures lead to efficiency only for = , since only for this value of memory sharing can take place. For all other values of the memory requirements and the number of DFUs are doubling for double-flow structures with the added advantage of doubling the MAP decoder speed since the frame decoding is split between the two flows. For (DFS,, , shift) decoders,

Setting above 1 results in prohibitively large decoding delay. Even if is only 3, the decoding delay is , which is unacceptably large, therefore architectures for are of interest only.

##### 4.4. Pointer Technique-Based SISO-MAP Decoders

Figures 11 and 12 illustrate the DFG and the SM-CM for a (SFS/PNT, , , Ptrs) structure.

In all of the above structures all the necessary metrics are stored in the RAM memories. Another idea is to selectively store some of the state metrics and recompute all the others. This idea of selective recomputation leads to reduced storage requirements.

The pointers are nothing else but the actual state metric values, which are stored and extracted from a storage device whenever is necessary. The values to be stored in pointers are depicted with small white circles in Figure 11.

In this structure, during time to , after dummy metric recursion steps, the obtained valid value is stored in a register. This value is also used to initialise a backward recursion, which operates during the symbol periods to . This backward B-metric DFU as progresses stores in registers the values and and in the RAM memory the last state metric values from down to . Thus, only state vectors are stored in RAM. An extra backward DFU will then extract the pointer to recalculate the other B-metric values down to and store them in the memory. This procedure will be repeated until the calculation of all B-metric vectors in the window. As the backward metrics are calculated, a forward DFU computes the corresponding A-state metrics.

As can be observed from Figure 11 at any given time only state metric vectors of size each are stored in the RAM. This means the storage requirements of this structure are a RAM of size bits and 3 pointer vectors of bits each.

For pointer-based SFS and the storage of state metrics starts right after the dummy metric calculation has computed a valid metric, so there is no time space for pointers. is equally well with the case regarding pointers in the sense that the trade-offs mentioned before are not affected.

For (SFS/PNT, , , shift = −, ) structures,

Increasing (the number of pointers) reduces the RAM memory but has no effect on the number of DFUs. The size of the required memory is given by . Of course, some registers need to be allocated for each pointer, whose size is . The decoding delay is only affected by the amount of pointers used if interleaver LLR reversing is employed. In this case the decoding delay is and holds unless .

If the structures (SFS/PNT, , = 1, shift = −, = 3) with , are compared with the corresponding SFS structures that use no pointers, it is observed that using three pointers and one more DFU results in 75% reduction in the required RAM memory. The decoder delay of the pointer-based structures when interleaver LLR reversing is used is increased by 0.75 for the first two cases .

For , if , the pointer-based techniques (single flow and double flow) are not applicable. If , the resultant architecture does not reduce memory size but has higher throughput (computation speeds up times). In this case, (12a) and (12c) hold as they are, but in (12b) the number of DFUs is given by .

is the best choice for pointer-based SFS, whereas = is a better choice for pointer-based DFS, since the latter allows memory sharing. For (DFS/PNT, , , shift = −, ), Again, note that increasing reduces the storage requirement but not the data flow unit count of the structures.

Figure 13 shows a complete SISO-MAP decoder VLSI architecture using a double flow with = , = 1, 3Ptrs, shift = . This architecture requires 8 DFUs (2 dummy, 2 pointer saving, and 2 forward and 2 backward) and two RAM memories each and has a decoding delay of . With selective recomputation only a fraction of the required state metric vectors are stored in the RAM memories. If pointers are used, then only of the state metrics need to be stored in the RAM memory. The remaining is recalculated in blocks of .

#### 5. Conclusions

The VLSI architectures of space-time turbo trellis coding decoders as well as of a set of SISO-MAP component channel decoders used in turbo coding are proposed and investigated. The space-time turbo code receiver as opposed to binary turbo codes is based on nonbinary trellises, which imposes a number of differences. Except the fact that channel estimation as well as MRRC combining is needed to cope with the more than one diversity received symbols, the frame formation block is different than in a traditional binary turbo receiver, since in the latter systematic and parity redundant information can be separated and stored in separate memory banks, whereas in the former it cannot, so the whole symbol is stored in a memory bank. The difference is that in a STTuTC receiver case, the equalised symbols are demultiplexed and stored in memory banks, zeros are inserted at even or odd locations, and one memory bank is sent to the decoder each time. This ensures that the SISO-MAP decoder accepts the information from the corresponding encoder. The symbols are demapped by a symbol hard or soft demapper.

The proposed STTuTC architectures are based on a different ACSO unit than the binary turbo codes. This ACSO unit can handle the more than two state and transition metric pairs iteratively. In nonbinary trellises the state and transition metric pairs are more than two, because of the increased number of transitions in between states. Thus, the ACSO unit must either grow or work iteratively. Many ACSO units can be used to calculate all state metrics in one step without the need for storing those values in a state metric memory. Thus, the state metric calculation modules are considerably different than in the binary turbo decoders. The LLR calculation module is also different because more than one LLR value needs to be calculated in every decoding step. “*f*” ACSO units appropriately connected rather than two are required to calculate the appropriate amount of LLR values for all possible symbols. This also means that the number of soft LLR outputs the SISO-APP demapper delivers every time depends on the “*f*” possible values. That is, *f*-1 LLR must be outputted, so the LLR memory size is different than the binary turbo decoder case.

A parameter set (FS, , , shift, ) helped in the comparison and led to defining equations for many different cases, single-flow, double-flow, and pointer-based techniques. Table 3 gives a list of formulae, which determine the quantities of memory size, number of deployed DFUs, and decoding delay of the most efficient techniques from those discussed in the this paper.

#### References

- C. Berrou, A. Glavieux, and P. Thitimajshima, “Near Shannon limit error-correcting coding and encoding: turbo-codes,” in
*Proceedings of the IEEE International Conference on Communications (ICC '93)*, vol. 2, pp. 1064–1070, Geneva, Switzerland, May 1993. View at Scopus - C. Berrou and A. Glavieux, “Near optimum error correcting coding and decoding: turbo codes,”
*IEEE Transactions on Communications*, vol. 44, no. 10, pp. 1261–1271, 1996. View at Google Scholar - J. Hagenauer and P. Hoher, “A Viterbi Algorithm with soft-decision outputs and its applications,” in
*Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '89)*, vol. 3, pp. 1680–1686, 1989. - C. Berrou, P. Adde, E. Angui, and S. Faudeil, “Low complexity soft-output viterbi decoder architecture,” in
*Proceedings of the IEEE International Conference on Communications (ICC '93)*, pp. 737–740, May 1993. View at Scopus - L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,”
*IEEE Transactions on Information Theory*, vol. 20, no. 2, pp. 284–287, 1974. View at Google Scholar · View at Scopus - D. Tujkovic, “Recursive space-time trellis codes for turbo coded modulation,” in
*Proceedings of the IEEE Global Telecommunications Conference (GLOBECOM '00 )*, San Francisco, Calif, USA, 2000. - D. Cui and A. M. Haimovich, “Performance of parallel concatenated space-time codes,”
*IEEE Communications Letters*, vol. 5, no. 6, pp. 236–238, 2001. View at Publisher · View at Google Scholar · View at Scopus - W. Firmanto, B. Vucetic, J. Yuan, and Z. Chen, “Space-time turbo trellis coded modulation for wireless data communications,”
*Eurasip Journal on Applied Signal Processing*, vol. 2002, no. 5, pp. 459–470, 2002. View at Publisher · View at Google Scholar · View at Scopus - Y. Hong, J. Yuan, Z. Chen, and B. Vucetic, “Space-time turbo trellis codes for two, three, and four transmit antennas,”
*IEEE Transactions on Vehicular Technology*, vol. 53, no. 2, pp. 318–328, 2004. View at Publisher · View at Google Scholar · View at Scopus - W. Firmanto, Z. Chen, B. Vucetic, and J. Yuan, “Design of space-time turbo trellis coded modulation for fading channels,” in
*IEEE Global Telecommunicatins Conference (GLOBECOM '01)*, pp. 1093–1097, November 2001. View at Scopus - G. Masera, G. Piccinini, M. R. Roch, and M. Zamboni, “VLSI architectures for turbo codes,”
*IEEE Transactions on Very Large Scale Integration Systems*, vol. 7, no. 3, pp. 369–379, 1999. View at Publisher · View at Google Scholar · View at Scopus - Y. Tong, T. H. Yeap, and J. Y. Chouinard, “VHDL implementation of a turbo decoder with log-MAP-based iterative decoding,”
*IEEE Transactions on Instrumentation and Measurement*, vol. 53, no. 4, pp. 1268–1278, 2004. View at Publisher · View at Google Scholar · View at Scopus - C. M. Wu, M. D. Shieh, C. H. Wu, Y. T. Hwang, and J. H. Chen, “VLSI architectural design tradeoffs for sliding-window Log-MAP decoders,”
*IEEE Transactions on Very Large Scale Integration Systems*, vol. 13, no. 4, pp. 439–447, 2005. View at Publisher · View at Google Scholar · View at Scopus - C. Chaikalis and J. Noras, “Reconfigurable turbo decoding for 3G applications,”
*Signal Processing*, vol. 84, no. 10, pp. 1957–1972, 2004. View at Publisher · View at Google Scholar - C. Chaikalis, “Implementation of a reconfigurable turbo decoder in 3GPP for flat rayleigh fading,”
*Digital Signal Processing*, vol. 18, no. 2, pp. 189–208, 2008. View at Publisher · View at Google Scholar · View at Scopus - L. Yonghui, B. Vucetics, Z. Qishan, and H. Yi, “Assembled space-time turbo trellis codes,”
*IEEE Transactions on Vehicular Technology*, vol. 54, no. 5, pp. 1768–1772, 2005. View at Publisher · View at Google Scholar · View at Scopus - M. M. Mansour and N. R. Shanbhag, “VLSI Architectures for SISO-APP Decoders,”
*IEEE Transactions on Very Large Scale Integration Systems*, vol. 11, no. 4, pp. 627–650, 2003. View at Publisher · View at Google Scholar · View at Scopus - E. Boutillon, W. J. Gross, and P. G. Gulak, “VLSI architectures for the MAP algorithm,”
*IEEE Transactions on Communications*, vol. 51, no. 2, pp. 175–185, 2003. View at Publisher · View at Google Scholar · View at Scopus - K. C. Nguyen, U. Gunawardana, and R. Liyana-Pathirana, “Optimization of sliding window algorithm for space-time turbo trellis codes,” in
*Proceedings of the International Symposium on Communications and Information Technologies (ISCIT '07)*, pp. 1182–1186, October 2007. View at Publisher · View at Google Scholar · View at Scopus - A. Worm, H. Lamm, and N. Wehn, “VLSI architectures for high-speed MAP decoders,” in
*Proceedings of the 14th International Conference on VLSI Design*, pp. 446–453, January 2001. View at Scopus - C. Schurgers, F. Catthoor, and M. Engels, “Memory optimization of MAP turbo decoder algorithms,”
*IEEE Transactions on Very Large Scale Integration Systems*, vol. 9, no. 2, pp. 305–312, 2001. View at Publisher · View at Google Scholar · View at Scopus - S. S. Pietrobon, “Implementation and performance of a turbo/MAP decoder,”
*International Journal of Satellite Communications*, vol. 16, no. 1, pp. 23–46, 1998. View at Google Scholar · View at Scopus - A. Worm, H. Michel, F. Gilbert, G. Kreiselmaier, M. Thul, and N. Wehn, “Advanced implementation issues of turbo decoders,” in
*Proceedings of the 2nd International Symposium on Turbo Codes*, pp. 351–354, Brest, France, September 2000. - S. ten Brink, “Convergence of iterative decoding,”
*Electronics Letters*, vol. 35, no. 13, pp. 1117–1118, 1999. View at Google Scholar · View at Scopus - D. Divsalar, S. Dolinar, and F. Pollara, “Iterative turbo decoder analysis based on density evolution,”
*IEEE Journal on Selected Areas in Communications*, vol. 19, no. 5, pp. 891–907, 2001. View at Publisher · View at Google Scholar · View at Scopus - S. ten Brink, “Convergence behavior of iteratively decoded parallel concatenated codes,”
*IEEE Transactions on Communications*, vol. 49, no. 10, pp. 1727–1737, 2001. View at Publisher · View at Google Scholar · View at Scopus