Abstract

The VLSI implementation of SISO-MAP decoders used for traditional iterative turbo coding has been investigated in the literature. In this paper, a complete architectural model of a space-time turbo code receiver that includes elementary decoders is presented. These architectures are based on newly proposed building blocks such as a recursive add-compare-select-offset (ACSO) unit, A-, B-, Γ-, and LLR output calculation modules. Measurements of complexity and decoding delay of several sliding-window-technique-based MAP decoder architectures and a proposed parameter set lead to defining equations and comparison between those architectures.

1. Introduction

The ability of turbo codes (TCs) to achieve very low BER that approaches the Shannon limit is very attractive. These channel capacity approaching codes have been proposed by Berrou et al. [1, 2]. Iterative turbo decoding involves the parallel concatenation of two or more component codes, which operate on the data and exchange information in order to progressively reduce the error rate. The exchanged information is in the form of log-likelihood ratio (LLR) soft decisions and can be exchanged between elementary decoders, which apply either the soft output Viterbi algorithm (SOVA) [3, 4] or the maximum a posteriori (MAP) probability algorithm [5].

The principles of iterative turbo decoding can be combined with those of space-time coding resulting in a bandwidth efficient low-error-rate channel coding scheme named space-time turbo trellis coding (STTuTC) [610]. The latter scheme benefits from the impressive coding gain of turbo codes and the diversity gain of space-time codes, to obtain a very power-efficient system. These power gains are very important for high-performance communication systems, particularly in scenarios where low signal-to-noise ratio is in demand.

Despite the complexity reduction, iterative turbo codes still have prohibitively high implementation complexity and suffer from large decoding delay, motivating researchers to seek efficient implementations [1119]. The sliding window technique (SWT) presented in [1721] enables the early acquisition of the state metric values without having to scan the whole trellis frame in the one direction before scanning the other, and this results in reduced elementary decoder latency as well as smaller state metric memory requirements.

2. Architectural Overview

2.1. Turbo Transmitter

A space-time turbo trellis code is a parallel concatenation scheme using two identical elementary encoders to operate on an information message 𝐶=[𝐶0,𝐶1,,𝐶𝑡,,𝐶𝑃1], where each symbol consists of b bits [8, 9]. The same message is sent to both component encoders, but each one receives this information in a different order through an interleaver.

Elementary encoders using recursive space time trellis coding (STTC) benefits from interleaver gain and iterative decoding [68]. These encoders add redundancy to the encoded messages, and the second sequence is deinterleaved back to its original order. The two sequences of codewords are multiplexed and truncated to preserve bandwidth efficiency and form a single P length sequence denoted by 𝑋=[𝑋0,𝑋1,,𝑋𝑡,,𝑋𝑃1], where each space-time vector 𝑋𝑡=[𝑥1𝑡,𝑝,𝑥2𝑡,𝑝,,𝑥𝑖𝑡,𝑝,,𝑥𝐾𝑡,𝑝]𝑇 consists of K b-bit modulated symbols. In vector X, half of the symbols (odd locations) come from component encoder 1 (𝑝=1) and the other half (even locations) come from component encoder 2 (𝑝=2) [6]. The sequence is forwarded for carrier modulation and transmission since baseband modulation occurs within the encoders. The structure results in a long block code from small memory convolutional codes, a fact that enables the turbo code to approach the Shannon limit.

Note that the recursive STTCs are trellis based, meaning that during the encoding process, in each elementary encoder a trellis path is created [7]. Let 𝑆=[𝑠0,𝑠1,,𝑠𝑡,,𝑠𝑃1,𝑠𝑃] be a state row vector containing all the states in a trellis path chosen by the information message. The combined trellis paths will be decomposed and exploited at the receiver side to decode the message.

2.2. Proposed Turbo Receiver

Consider a baseband MIMO receiver, shown in Figure 1, the received frame 𝑅=[𝑅0,𝑅1,,𝑅𝑡,,𝑅𝑃1], where 𝑅𝑡=[𝑟1𝑡,𝑟2𝑡,,𝑟𝑗𝑡,,𝑟𝜓𝑡]𝑇, can be used to estimate the original sequence C. Using the baseband signal representation, the received noisy signal from antenna j at time t over a MIMO (possibly fading, when 𝑡𝑗,𝑖1) Gaussian channel is given by the equation 𝑟𝑗𝑡=𝑉𝑗𝑡+𝑛𝑗𝑡=𝐾𝑖=1𝑡𝑗,𝑖𝑥𝑖𝑡,𝑝+𝑛𝑗𝑡,(1) where 𝑉𝑗𝑡 represents the noiseless version of the 𝑟𝑗𝑡 signal [79]. The receiver, after estimating the channel fading coefficients, separates the signals using maximal ratio receiver combining (MRRC), which also equalises the signals by removing the effect of the channel. With MRRC, costly joint detection of space-time symbols is avoided.

The equalised frame symbols are delivered to a soft-demapper to calculate the soft channel output given in 𝐻𝑡𝑠𝜂,𝑠=ln𝑡𝑠𝑒,𝑠=ln𝐸𝑠𝑅𝑡𝑉𝑗𝑡2/2𝜎2𝐸=𝑠𝑅𝑡𝑉𝑡22𝜎2.(2)

The received symbols have already 𝐸𝑠/2𝜎2 prescaled by a SNR scaler in the analogue domain [12]. A frame formation block (FBB) separates the even and odd symbols and inserts zeros, so it is assured that each log-MAP decoder core receives information only from the corresponding component encoder.

The STTuTC decoder at the algorithmic level consists of two soft-in soft-out (SISO) component decoders, which exchange soft information and progressively through many iterations result in a better estimate of the values of the information symbols [11, 12]. Since the two component decoders do not operate on data at the same time, only one silicon IP core can be used to implement both. As illustrated in Figure 1, the decoder contains a symbol-based SISO component core that applies the log-MAP algorithm on a nonbinary trellis. It also uses two memories, one for storing the exchanged soft information and another to perform symbol interleaving/deinterleaving.

2.3. MAP and Log-MAP Algorithms

Maximum likelihood decoding using Viterbi algorithm determines the most likely path through the trellis and from that determines the information symbol sequence, whereas MAP decoding determines the latter sequence by considering each information symbol 𝑐𝑡 independently using the channel observations 𝑅. The MAP algorithm is used to compute the a posteriori LLR soft decisions (3), about the value of the symbols in the information message [6, 8]: 𝜆𝑓𝑢𝑡𝑢𝑅=lnPr𝑡=𝑓𝑅𝑢Pr𝑡=0𝑅=ln𝑠,𝑠𝑢𝑡=𝑓𝛽𝑡(𝑠)𝛾𝑓𝑡𝑠𝑎,𝑠𝑡1𝑠𝑠,𝑠𝑢𝑡=0𝛽𝑡(𝑠)𝛾𝑓𝑡(𝑠,𝑠)𝑎𝑡1(𝑠),(3) where 𝑓=0,1,2,,2𝑏1. This equation expresses the excess probability of a symbol 𝑢𝑡 being 𝑓 in the logarithmic domain over the probability of being equal to a reference symbol 0 [11, 13]. The log-MAP decoder calculates the LLRs of all the possible modulation symbols in the constellation diagram using the forward 𝐴𝑡(𝑠)=ln(𝑎𝑡(𝑠)) and backward 𝐵𝑡(𝑠)=ln(𝛽𝑡(𝑠)) state metric probabilities as well as the branch metric probabilities Γ𝑡(𝑠,𝑠)=ln(𝛾𝑓𝑡(𝑠,𝑠)) in the logarithmic domain: 𝐴𝑡(𝑠)=2𝐺1𝑠2=0𝑏1𝑓=0Γ𝑓𝑡𝑠,𝑠+𝐴𝑡1𝑠,𝐵𝑡1𝑠=2𝐺12𝑠=0𝑏1𝑓=0Γ𝑓𝑡𝑠,𝑠+𝐵𝑡.(𝑠)(4)

The 𝐴𝑡(𝑠) metric represents the logarithmic probability that the state s at stage t has been reached from the beginning of the trellis. Similarly, the 𝐵𝑡1(𝑠) metric represents the logarithmic probability that the state 𝑠 at stage 𝑡1 has been reached from the end of the trellis [13, 15]. The branch metric probabilities can be computed using 𝛾𝑓𝑡𝑠𝑅,𝑠=Pr𝑡𝑠𝑠𝑅=Pr𝑡𝑠𝑠Pr𝑠𝑠𝛾𝑓𝑡𝑠𝑅,𝑠=Pr𝑡𝑉𝑡Pr𝑡(𝑓)Pr𝑡=1(0)2𝜋𝜎2𝑒𝐸𝑠|𝑅𝑡𝑉𝑡|2/2𝜎2Pr𝑡(𝑓)Pr𝑡𝛾(0)𝑓𝑡𝑠,𝑠=𝐶1𝑒𝐸𝑠|𝑅𝑡𝑉𝑡|2/2𝜎2Pr𝑡(𝑓)Pr𝑡(0)=𝐶1𝜂𝑡𝑠,𝑠Pr𝑡(𝑓)Pr𝑡,(0)(5) where 𝐶1=1/2𝜋𝜎2. The above equation gives the probability of a transition at time t from state 𝑠 to state s in the trellis diagram [13]. Translating (5) in the logarithmic domain [15], Γ𝑓𝑡𝑠𝛾,𝑠=ln𝑓𝑡𝑠=𝐶,𝑠1+𝐻𝑡𝑠,𝑠+lnPr𝑡(𝑓)Pr𝑡(0),(6) where 𝐶1=ln(𝐶1) is a constant, the 𝐻𝑡(𝑠,𝑠) term is the soft channel output, and the ln(Pr𝑡(𝑓)/Pr𝑡(0)) term is the a priori information coming from the other component decoder: 𝜆𝑓𝑢𝑡𝑅=ln𝑠,𝑠𝑢𝑡=𝑓𝛽𝑡(𝑠)𝛾𝑓𝑡𝑠𝑎,𝑠𝑡1𝑠𝑠,𝑠𝑢𝑡=0𝛽𝑡(𝑠)𝛾0𝑡(𝑠,𝑠)𝑎𝑡1(𝑠)𝜆𝑓𝑢𝑡𝑅=ln𝑠,𝑠𝑢𝑡=𝑓𝑒(𝐵𝑡(𝑠)+Γ𝑓𝑡(𝑠,𝑠)+𝐴𝑡1(𝑠))ln𝑠,𝑠𝑢𝑡=0𝑒(𝐵𝑡(𝑠)+Γ0𝑡(𝑠,𝑠)+𝐴𝑡1(𝑠))𝜆𝑓𝑢𝑡𝑅=max𝑠,𝑠𝑢𝑡=𝑓𝐵𝑡(𝑠)+Γ𝑓𝑡𝑠,𝑠+𝐴𝑡1𝑠max𝑠,𝑠𝑢𝑡=0𝐵𝑡(𝑠)+Γ0𝑡𝑠,𝑠+𝐴𝑡1𝑠.(7)

Finally, (7) gives the elementary decoder output, which is a posteriori probability LLR in terms of logarithmic domain state and branch metrics [13].

3. Soft-In Soft-Out-MAP Component Decoder

3.1. Decoder Module Architecture

The proposed elementary decoder core is illustrated in Figure 2. The main modules within this decoder are the branch metric calculation module (BM-CM), the A- and B-state metric calculation modules (A-SM-CM and B-SM-CM, resp.), the LLR calculation module (LLR-CM), and a large A-metric RAM to temporarily store metric values.

The BM-CM calculates all possible normalised branch metric values (see Section 3.1) and store them in a register file. The latter consists of 𝑀=2𝑏 registers to store the 𝑀 possible branch metric values. These metrics are given to A-SM-CM, B-SM-CM, and LLR-CM. The A-metric CM scans the entire trellis diagram in the forward direction, starting from stage zero and ending at stage 𝑃, calculating all A-state metrics and storing them in the A-metric RAM.

After this first trellis pass the B-metric CM and the LLR-CM simultaneously compute the B-state metrics and the LLR outputs, respectively. The B-SM-CM scans the entire trellis in the backward direction, from stage 𝑃 to 0. At this time, the B-metrics do not need to be stored in a memory, since every time a B-metric is calculated the corresponding A-metric is extracted from the A-metric RAM and the LLR calculation is computed.

Therefore, only the A-state metrics are stored in a RAM memory, the size of which must be (𝑃+1)2𝐺𝑤𝐿, for continuous mode decoders. In a decoder operating in the terminated mode, where the trellis path begins and ends at state s(0), that is, 𝑠0=𝑠(0) and 𝑠𝑃=𝑠(0), the memory requirement is slightly reduced to [(𝑃3)2𝐺+22𝑏+2]𝑤𝐿, whereas in truncated operating mode (i.e., 𝑠0=𝑠(0) must hold for every frame, but there is no termination condition), the storage requirement is [(𝑃1)2𝐺+2𝑏+1]𝑤𝐿, where 𝑤𝐿 is the word length of the data format used to represent the internal data of the MAP decoder. Table 1 shows the complexity and decoding delay of a log-MAP decoder operating in different modes. The metrics are calculated using data flow units (DFUs) that operate in decoding steps. As can be seen the log-MAP decoder results in very high memory requirements and long elementary decoder latencies.

3.2. Branch Metric Calculation Module

The branch metric calculation module computes the normalised branch metrics Γ𝑓𝑡(𝑠,𝑠). The branch metric appears on both the numerator and denominator of the a posteriori LLR formula, thus any constants will cancel out and play no role in the calculation: Γ𝑓𝑡𝑠,𝑠=𝐻𝑡𝑠,𝑠+lnPr𝑡(𝑓)Pr𝑡,(0)(8)

The module receives as input the soft output of the channel 𝐻𝑡(𝑠,𝑠) and the a priori information from the other component decoder and computes all possible values of branch metrics as shown in Figure 3.

3.3. State Metric Calculation Module

In the literature, the proposed architectures that recursively calculate the state metrics are based in the so-called ACSO (add-compare-select-offset) processing unit equivalent to add-max*() presented in [18]. It is responsible for the computation of a state metric based on the previous stage metrics and the branch metrics arriving at the state in question.

Several different, but similar architectures have been proposed. The diagram in Figure 4 suggests a new type of the ACSO unit, which instead of employing more multiplexers to cope with the more than two input state metrics in nonbinary trellises, it recursively accepts the input state and branch metrics and calculates the output state metric in a number of cycles. A look-up table can be used to store precalculated correction factors and its quantization is discussed in [18, 22].

This module requires 𝑀 clock cycles to calculate the output state metric. In the first clock cycle, the feedback register is reset, and the 𝐴𝑡1(𝑠1) and Γ0𝑡(𝑠1,𝑠) are applied at the inputs. Also the SW1 switch is connected to the zero register. The result is that 𝑎1=𝐴𝑡1(𝑠1)+Γ0𝑡(𝑠1,𝑠) is stored in the feedback register. In the remaining cycles, the LUT output is connected through SW1 to the output adder. In the second cycle, the 𝑎2=𝐴𝑡1(𝑠2) + Γ1𝑡(𝑠2,𝑠) is applied at the input; therefore the max(𝑎1,𝑎2) is stored in the feedback register. Similarly, the max(𝑎3,max(𝑎1,𝑎2)) is stored at the third cycle, whereas the output state metric equal to max(𝑎4,(max(𝑎3,max(𝑎1,𝑎2))) is ready at the output port during the forth clock cycle. A-metric calculation is illustrated, but the architecture operates equally good for B-metric calculation in a backward recursion. This unit will be referred to as recACSO unit.

The forward/backward recursion computations may lead to an over flow. If a state metric value constantly grow the finite word length 𝑤𝐿 will not be sufficient to hold this value and an over flow will occur. Since the max() operation is linear and shift invariant and a global shift of state metric values would not change the LLR output value, it is the difference between the state metrics and not their absolute values that is important. Rescaling of the state metrics can be readily performed to avoid over flow. The rescaling technique is the same as that used in other SISO-log MAP algorithms [11, 18, 23].

The precision of the state metrics determines the word length 𝑤𝐿. The precision depends on the dynamic range of the state metrics.

Thus, 2𝐺 recACSO units can be employed to calculate all the states of trellis stage in one step, as depicted in Figure 5 for a 2𝐺 state trellis. This unit receives 2𝐺 possible branch metrics and the previous state metrics and computes all the state metrics of the current stage. Configuring the module this way fastens computation since it avoids reading the previous stage state metrics from the RAM memory. At the input of the module two selectors are used to drive the recACSO units with the appropriate inputs.

3.4. LLR Output Calculation Module

The LLR-CM depicted in Figure 6 is responsible for computing the output reliability estimates 𝜆𝑓(𝑢𝑡𝐑) given the state and branch metric calculations according to 𝜆𝑓𝑢𝑡𝐑=max𝑠,𝑠𝑢𝑡=𝑓𝐵𝑡(𝑠)+Γ𝑓𝑡𝑠,𝑠+𝐴𝑡1𝑠max𝑠,𝑠𝑢𝑡=0𝐵𝑡(𝑠)+Γ0𝑡𝑠,𝑠+𝐴𝑡1𝑠.(9)

A careful observation of (9) reveals a close relation with the recACSO unit calculation. The LLR computation has two terms that can be computed separately and then subtracted. If 𝐴𝑡1(𝑠) and 𝐵𝑡(𝑠) are added in advance, then the recACSO unit can then be used to calculate max(), where 𝑓=0,1,,2𝑏1. Because the max() when 𝑓=0 has to be subtracted from all other LLRs, an inverter is included to negate this term.

As already mentioned, 𝜆0(𝑢𝑡𝐑) need not be calculated since zero is the reference symbol and therefore its LLR is zero. The remaining 2𝑏2 log-likelihood ratios are computed by the proposed architecture and outputted out of the symbol-based log-MAP component decoder.

4. SWT-Based Architectures for the SISO-MAP Decoder

4.1. The Concept

SISO-MAP decoders suffer from very long latency and high memory requirements. A different organisation (scheduling) of the computation and exploitation of the convergence property [2426] of the decoder lead to reduction in the latency and the amount of state metrics that need to be stored. For this reason, the sliding window technique (SWT) has been proposed in the literature [18, 19].

SWTs are divided in single-flow structure (SFS), double-flow structure (DFS), and pointer-based (PNT) techniques. In these techniques the trellis or transmitted frame is divided in windows of size 𝐿, where 𝐿 is the convergence length [25]. Care must be taken to ensure that the whole frame can be exactly divided into an integer number of windows. MAP decoder convergence property allows recursive calculation of a state metric to result in a good approximation of the actual metric value after 𝐿 recursion steps, even if the initial data is a dummy value [8, 9].

Three important metrics for comparison of MAP decoders can be considered: (i) decoding delay (latency), (ii) memory requirements (or storage lifetime), and (iii) computational complexity (number of recursion modules). Several different schedulings of the MAP algorithm computation reveal trade-offs between these quantities. Since the window size is 𝐿 recursion steps, a DFU proposed in Section 3 can be employed to calculate all the state metric vectors in a window in 𝐿 steps. A DFU can perform forward, backward, or dummy metric calculations.

Defining the following parameter set will assist in the analysis of SWT-based MAP decoders (technique, Π, 𝜋, shift, 𝜀). The “technique” element is the type of computation organisation with possible values being SFS, DFS, SFS/PNT, and DFS/PNT.

Π” is the relative position between (A- and B-) forward and backward coupled metric recursive calculations and therefore determines when the B-metric calculation begins relative to A-metric calculation. It is a continuous value variable, but only three values are a meaningful choice. Π=0 means that the backward calculation starts after the forward ends, Π=2𝐿 means that the forward calculation starts after the backward ends, and Π=𝐿 means that the two calculations take place simultaneously and switch role in the middle of the convergence block calculation.

𝜋” is the ratio of the valid over the invalid (dummy) metric recursive calculation. To compute the backward metrics before the end of the whole frame a dummy metric calculation is required to estimate an intermediate value using the convergence property. This means that for every A- and B-metric pair one dummy module is required, but that can vary depending on the technique, 𝜋 is the ratio of invalid metrics over valid A- and B-metrics.

The computation can take place in two directions (two flows) simultaneously each calculating half of the frame. The “Shift” is the time shift in decoding steps in between the two flows. In the pointer technique pointers are used to reduce memory requirements, 𝜀 is the number of pointers, which when exists indicates the pointer technique PNT.

This set of SWT-based architectures has been fully investigated and some results are given in Table 2. These indicate the hardware requirements and decoding delay of some common implementation structures.

4.2. Single-Flow Structure SISO-MAP Decoders

Figures 7 and 8 illustrate, as an example, the data flow graph (DFG) of a single flow and the architecture of SM-CM of (SFS, Π = 0, 𝜋 = 1) and the architecture of (SFS, Π = 0, 𝜋 = 1/2) single-flow structures. The flow diagram explains the computation.

In the DFG the horizontal axis represents time quantized in terms of symbol periods, whereas the vertical axis represents the symbol number within the frame. In this specific diagram a frame of P = 4𝐿 symbols decoded by a trellis of 𝑃+1=4𝐿+1 stages is presented, where 𝐿 is the convergence length of the decoder.

Three data flow units can be seen in the diagram. The dashed arrow represents 𝐿 recursions of a forward DFU calculating A-metric, which are stored as calculated. The continuous arrow represents 𝐿 recursion of a backward DFU calculating B-metrics. This calculation is done simultaneously with the LLR soft output calculation. The dotted arrow represents 𝐿 recursions of a dummy metric DFU.

In the diagrams to follow the dotted arrow always represents dummy metric recursive use of DFUs, whereas for forward and backward metric DFUs the arrow is continuous or dashed when the metric is not stored or stored in the memory, respectively. The dummy metric DFU results in a valid metric after the 𝐿 backward recursions.

The fact that the dummy metric DFU is working backwards can be understood from the fact that its projection on the vertical axis points downwards. DFUs whose projection on the vertical axis points upwards are operating on the data in a forward manner. Also note that the maximum number of arrows a vertical line can cross in this diagram gives the number of DFUs employed in this structure [21].

Storage is also represented by the shaded rectangular areas. Taking into account one of these areas (they are all the same), the projection of this rectangular area on the vertical line gives the amount of state metric vectors to be stored [18, 21], whereas the projection on the horizontal line gives the time required to store the vectors.

Decoding delay (or latency) is the horizontal distance between the acquisition curve (always 𝑦1=𝑡) and the decoding curve in DFG. Since the decoding curve in this structure is at 𝑦2=𝑡4𝐿, the decoding delay is 𝑦1𝑦2=𝑡(𝑡4𝐿)=4𝐿 symbol periods. The symbol period is denoted by 𝑡𝑆.

DFGs and tile graphs model the same thing, they model the resource-time scheduling of the recursions of the algorithm. A diagram is a DFG when viewed as a concrete graph or tile graph when viewed as tile repetition [17]. On the right of Figure 7 a tile, which consist of 3 DFUs, is illustrated. This tile is repeated as many times as required to form a complete DFG.

Let us concentrate more on the scheduling of the operations described by the DFG in Figure 7. During the symbol periods 2𝐿 to 3𝐿1, the dummy metric DFU (dotted arrow 1) starting from a zero vector assigned to state metric vector 𝐵2𝐿 computes invalid metrics from 𝐵(2𝐿1) down to 𝐵𝐿. This calculation results in a valid state metric vector 𝐵𝐿, because convergence is reached. All other metrics calculated during these 𝐿 recursions are invalid. At the same time the forward DFU (dashed arrow 2) calculates and stores a sequence of 𝐿 A-metrics from 𝐴0 to 𝐴(𝐿1).

In the next 𝐿 symbol periods from 3𝐿 to 4𝐿1 the backward DFU (continuous arrow 3) comes into play to calculate 𝐵(𝐿1) to 𝐵0. During the backward calculation the A-state metrics from 𝐴(𝐿1) to 𝐴0 are extracted from the memory and together with the B-state metrics are used to calculate the first 𝐿 soft outputs.

Thus, a set of 3 recursion units (called a “tile” in a tile diagram as indicated in DFG) results in the calculation of 𝐿 A- and B-metrics required to produce 𝐿 soft output LLR values. Note that the LLR values are calculated in a reverse manner. At symbol periods 4𝐿 to 5𝐿1, the order of soft output values is reversed [18]. In a turbo coding scheme, the interleaver reordering can be exploited to perform this reversing. If this last LLR reversing step is done by the MAP decoder, then the latency of this core is 4𝐿 symbol periods. If the reversing is done by the interleaver, the decoding delay is 3𝐿 symbol periods.

The above process (tile) is repeated until the end of the frame. The set of tiles in the tile graph (or DFG) represents a single flow passing through the data of a frame (or trellis). Hence, the proposed structure is designated as a single-flow structure (SFS). Structures where the above process occurs in both directions are called double-flow structures (DFSs) and will be discussed later in this paper.

In single-flow structures the benefit of structures with Π=0 compared to Π=𝐿 is the use of a single RAM memory as opposed to two RAM memories of half size each, required in the latter structures. Although the memory requirements as well as other requirements are the same, less complex datapath circuits are needed in the (SFS, Π = 0, 𝜋) case.

Also the (SFS, Π = 𝐿, 𝜋) structure experiences smaller decoding delay than the (SFS, Π = 2𝐿, 𝜋); LLR reversing is performed in the interleaver rather than in the MAP decoder. Thus, for SFS, Π = 0 is clearly the best choice. For (SFS, Π = 0, 𝜋) decoders, []storagereq.bits=𝜋𝐿2𝐺𝑤𝐿,1numberofDFUs=3,𝜋1,2+𝜋,𝜋1,decodingdelay=3𝜋𝐿or2𝜋𝐿,𝜋>1,(2+2𝜋)𝐿or(2+𝜋)𝐿,𝜋1.(10)

Note that, when 𝜋>1, both the storage requirements and the decoding delay are increased without any reduction in the number of DFUs; therefore architectures with 𝜋>1 are not suggested.

For 𝜋<1, the storage requirements and decoding delay are decreasing as 𝜋 decreases, but the number of required DFUs is increasing. The decoding delay is (2+2𝜋)𝐿 when MAP decoder reverses the LLR outputs and (2+𝜋)𝐿 when the interleaver reverses the LLR outputs.

4.3. Double-Flow Structure SISO-MAP Decoders

The DFG and the state metric calculation modules for a (DFS, Π=𝐿, 𝜋=1/2, shift = 𝐿/4) structure are shown in Figures 9 and 10, respectively. The double-flow structures useall of the concepts of single flows, in the sense that they also have parameters Π,𝜋 with the interesting values being Π=0, 𝐿 and 2𝐿 and 𝜋= 1/2, 1 and 2. Some DFSs have the benefit of operating with twice as much speed, without doubling the hardware requirements.

For example, when Π=𝐿 the RAM memories used by the data flow units to store the state metrics A and B of the SISO-MAP algorithm operate in the write mode for half of the time. The remaining half of the time they operate in read mode. This read period can be exploited by a second flow which runs over half of the frame data in the opposite direction. This is illustrated in Figure 9 for Π = 𝐿, 𝜋 = 1, which shows a structure that decodes a frame of Π = 8𝐿 symbols in half time.

Figure 13 depicts a DFS, with Π=𝐿, 𝜋 = 1/2, shift=𝐿/4. Note that for RAM sharing between the two data flows a shift of 𝐿/4 is required in this case, as opposed to 𝐿/2 shift used in other DFSs. This means that the appropriate shift is not constant but depends on the other parameters of the structure. In this diagram it is clearly indicated that two flows are operating splitting the 8𝐿 frame into two 4𝐿 symbol blocks. The first flow is using 4 DFUs, one forward, one backward, and two dummy metric DFUs. The same resources are employed by the second flow, therefore in total 8 DFUs are required.

Double-flow structures lead to efficiency only for Π = 𝐿, since only for this value of Π memory sharing can take place. For all other values of Π the memory requirements and the number of DFUs are doubling for double-flow structures with the added advantage of doubling the MAP decoder speed since the frame decoding is split between the two flows. For (DFS,Π=𝐿, 𝜋, shift) decoders, []storagereq.bits=𝜋𝐿2𝐺𝑤𝐿,21numberofDFUs=6,𝜋1,2+𝜋3,𝜋<1,decodingdelay=4𝜋23𝐿or3𝜋2𝐿,𝜋>1,5𝜋2+2𝐿or3𝜋2+2𝐿,𝜋1,shift=(2𝜋1)𝐿2,𝜋>1,𝜋𝐿2,𝜋1.(11)

Setting 𝜋 above 1 results in prohibitively large decoding delay. Even if 𝜋 is only 3, the decoding delay is (4𝜋3/2)𝐿/(3𝜋3/2)𝐿=10.5𝐿/7.5𝐿, which is unacceptably large, therefore architectures for 𝜋1 are of interest only.

4.4. Pointer Technique-Based SISO-MAP Decoders

Figures 11 and 12 illustrate the DFG and the SM-CM for a (SFS/PNT, Π=2𝐿, 𝜋=1, 𝜀 Ptrs) structure.

In all of the above structures all the necessary metrics are stored in the RAM memories. Another idea is to selectively store some of the state metrics and recompute all the others. This idea of selective recomputation leads to reduced storage requirements.

The pointers are nothing else but the actual state metric values, which are stored and extracted from a storage device whenever is necessary. The values to be stored in pointers are depicted with small white circles in Figure 11.

In this structure, during time 𝑡=2𝐿𝑡𝑆 to (3𝐿1)𝑡𝑆, after 𝐿 dummy metric recursion steps, the obtained valid 𝐵𝐿 value is stored in a register. This value is also used to initialise a backward recursion, which operates during the symbol periods 3𝐿 to 4𝐿1. This backward B-metric DFU as progresses stores in registers the values 𝐵(3𝐿/4) and 𝐵(𝐿/2) and in the RAM memory the last 𝐿/4 state metric values from 𝐵(𝐿/41) down to 𝐵0. Thus, only 𝐿/4 state vectors are stored in RAM. An extra backward DFU will then extract the pointer 𝐵(𝐿/2) to recalculate the other 𝐿/4 B-metric values (𝐵(𝐿/21) down to 𝐵(𝐿/4)) and store them in the memory. This procedure will be repeated until the calculation of all 𝐿 B-metric vectors in the window. As the backward metrics are calculated, a forward DFU computes the corresponding A-state metrics.

As can be observed from Figure 11 at any given time only 𝐿/4 state metric vectors of size 2𝐺𝑤𝐿 each are stored in the RAM. This means the storage requirements of this structure are a RAM of size (𝐿/4)2𝐺𝑤𝐿 bits and 3 pointer vectors of 2𝐺𝑤𝐿 bits each.

For pointer-based SFS and Π=0 the storage of state metrics starts right after the dummy metric calculation has computed a valid metric, so there is no time space for pointers. Π=2𝐿 is equally well with the Π=𝐿 case regarding pointers in the sense that the trade-offs mentioned before are not affected.

For (SFS/PNT, Π=2𝐿, 𝜋=1, shift = −, 𝜀) structures,[]=𝐿storagereq.bits2𝜀+1𝐺𝑤𝐿,𝜀(12a)numberofDFUs=4,(12b)decodingdelay=4𝐿or4𝜀+1𝐿.(12c)

Increasing 𝜀 (the number of pointers) reduces the RAM memory but has no effect on the number of DFUs. The size of the required memory is given by (𝐿/(𝜀+1))2𝐺𝑤𝐿. Of course, some registers need to be allocated for each pointer, whose size is 2𝐺𝑤𝐿. The decoding delay is only affected by the amount of pointers used if interleaver LLR reversing is employed. In this case the decoding delay is 4𝐿(1/(𝜀+1))𝐿 and holds unless 𝜋>𝜀.

If the structures (SFS/PNT, Π, 𝜋 = 1, shift = −, 𝜀 = 3) with Π=0,𝐿,2𝐿, are compared with the corresponding SFS structures that use no pointers, it is observed that using three pointers and one more DFU results in 75% reduction in the required RAM memory. The decoder delay of the pointer-based structures when interleaver LLR reversing is used is increased by 0.75𝐿 for the first two cases Π=0,𝐿.

For 𝜋1, if mod(𝜀+1,1/𝜋)0, the pointer-based techniques (single flow and double flow) are not applicable. If mod(𝜀+1,1/𝜋)=0, the resultant architecture does not reduce memory size but has higher throughput (computation speeds up 𝜋 times). In this case, (12a) and (12c) hold as they are, but in (12b) the number of DFUs is given by 1/𝜋+3.

Π=2𝐿 is the best choice for pointer-based SFS, whereas Π = 𝐿 is a better choice for pointer-based DFS, since the latter allows memory sharing. For (DFS/PNT, Π=𝐿, 𝜋=1, shift = −, 𝜀), []=𝐿storagereq.bits(𝜀+1)2𝐺𝑤𝐿,𝜀numberofDFUs=8,decodingdelay=4+𝐿𝜀2(𝜀+1)or4𝐿2(𝜀+1)𝐿,shift=.2(𝜀+1)(13) Again, note that increasing 𝜀 reduces the storage requirement but not the data flow unit count of the structures.

Figure 13 shows a complete SISO-MAP decoder VLSI architecture using a double flow with Π = 𝐿, 𝜋 = 1, 3Ptrs, shift = 𝐿/8. This architecture requires 8 DFUs (2 dummy, 2 pointer saving, and 2 forward and 2 backward) and two RAM memories (𝐿/8)2𝐺𝑤𝐿 each and has a decoding delay of 4𝐿. With selective recomputation only a fraction of the required state metric vectors are stored in the RAM memories. If 𝜀 pointers are used, then only 𝐿/(𝜀+1) of the state metrics need to be stored in the RAM memory. The remaining 𝐿(𝐿/(𝜀+1)) is recalculated in blocks of 𝐿/(𝜀+1).

5. Conclusions

The VLSI architectures of space-time turbo trellis coding decoders as well as of a set of SISO-MAP component channel decoders used in turbo coding are proposed and investigated. The space-time turbo code receiver as opposed to binary turbo codes is based on nonbinary trellises, which imposes a number of differences. Except the fact that channel estimation as well as MRRC combining is needed to cope with the more than one diversity received symbols, the frame formation block is different than in a traditional binary turbo receiver, since in the latter systematic and parity redundant information can be separated and stored in separate memory banks, whereas in the former it cannot, so the whole symbol is stored in a memory bank. The difference is that in a STTuTC receiver case, the equalised symbols are demultiplexed and stored in memory banks, zeros are inserted at even or odd locations, and one memory bank is sent to the decoder each time. This ensures that the SISO-MAP decoder accepts the information from the corresponding encoder. The symbols are demapped by a symbol hard or soft demapper.

The proposed STTuTC architectures are based on a different ACSO unit than the binary turbo codes. This ACSO unit can handle the more than two state and transition metric pairs iteratively. In nonbinary trellises the state and transition metric pairs are more than two, because of the increased number of transitions in between states. Thus, the ACSO unit must either grow or work iteratively. Many ACSO units can be used to calculate all state metrics in one step without the need for storing those values in a state metric memory. Thus, the state metric calculation modules are considerably different than in the binary turbo decoders. The LLR calculation module is also different because more than one LLR value needs to be calculated in every decoding step. “f” ACSO units appropriately connected rather than two are required to calculate the appropriate amount of LLR values for all possible symbols. This also means that the number of soft LLR outputs the SISO-APP demapper delivers every time depends on the “f” possible values. That is, f-1 LLR must be outputted, so the LLR memory size is different than the binary turbo decoder case.

A parameter set (FS, Π, 𝜋, shift, 𝑒) helped in the comparison and led to defining equations for many different cases, single-flow, double-flow, and pointer-based techniques. Table 3 gives a list of formulae, which determine the quantities of memory size, number of deployed DFUs, and decoding delay of the most efficient techniques from those discussed in the this paper.