Abstract

Multi-input multi-output (MIMO) systems combined with orthogonal frequency-division multiplexing (OFDM) gained a wide popularity in wireless applications due to the potential of providing increased channel capacity and robustness against multipath fading channels. However these advantages come at the cost of a very high processing complexity and the efficient implementation of MIMO-OFDM receivers is today a major research topic. In this paper, efficient architectures are proposed for the hardware implementation of the main building blocks of a MIMO-OFDM receiver. A sphere decoder architecture flexible to different modulation without any loss in BER performance is presented while the proposed matrix factorization implementation allows to achieve the highest throughput specified in the IEEE 802.11n standard. Finally a novel sphere decoder approach is presented, which allows for the realization of new golden space time trellis coded modulation (GST-TCM) scheme. Implementation cost and offered throughput are provided for the proposed architectures synthesized on a 0.13  CMOS standard cell technology or on advanced FPGA devices.

1. Introduction

MIMO-OFDM (Multi-input multi-outputβ€”orthogonal fre-quency-division multiplexing) is a very promising communication technique that enables to establish very high throughput and reliable wireless links. In order to achieve this goal, space-time (ST) codes are used, since they can conjugate both transmission rate and reliability enhancement of the communication system. ST codes have been considered for some recently proposed standards such as IEEE 802.11n WLAN and 802.16e WMAN.

However, the computational complexity of MIMO-OFDM receivers is much higher than in the single-input single-output (SISO) OFDM approach; as a consequence the potentials offered by MIMO-OFDM are still far from being fully exploited in actual implementations.

Figure 1 depicts the structure of a transmit-receive antenna MIMO-OFDM communication scheme. At the receiving side, after the RF/Analog front-end, multiple OFDM demodulation stages, implemented as FFT processors (one per antenna) are allocated, followed by the MIMO signal detector. The adoption of a full-rate and full-diversity ST code demands specific demapping and decoding capabilities, which are covered in Figure 1 by the β€œST-code decoder and demapper” block. Finally, a trellis coded modulation (TCM) channel decoder implements forward error correction.

The MIMO channel is modeled by its impulse response between each transmit-receive antenna pair. Assuming represents the time-varying channel fading coefficient between the th transmit antenna and the th receive antenna, the MIMO channel with transmit and receive antennas is described through a matrix , where . Transmitted space-time codewords are matrices, where is the number of channel uses required by the ST code. Assuming the β€œblock fading” channel model, each transmitted will be affected by an independently varying channel matrix . Then, the received matrix is where is the additive white Gaussian noise matrix with entries .

When data symbols belong to a -QAM modulation, it is convenient to represent the codewords in vectorized form, where real and imaginary components of the -QAM are separated in two -PAM modulations, resulting in real component codewords . Consequently, the channel matrix is rearranged in a real-valued matrix and is replaced with the real-valued . For a linear ST block code, can be obtained as , where is the ST-block code generator matrix and is the vectorized data vector with entries in -PAM. Note that with is a hypercubic-shaped constellation carved from a multidimensional integer grid .

Provided that is perfectly known at the receiver, the optimal detector, able to minimize the codeword error rate in a MIMO channel, is the maximum likelihood (ML) detector, which solves the problemwhere and . The cardinality of the search space, , depends on the number of receive antennas, the chosen modulation scheme, and the number of channel uses, while the factor 2 comes from the decomposition in real and imaginary components.

Hereinafter, in order to consider a currently practical situation, we will consider a two transmit and two receive antennas system, with a two-channel use ST block code (). An example of such a code is the golden code proposed in [1–3] and adopted by the IEEE 802.16e WMAN standard. We then have , , , and are real vectors and is a real-valued matrix. Thus, when using 16-QAM symbols, the direct computation of (2) results in the evaluation of possible solutions.

Due to the high complexity of the exhaustive search, more efficient methods were proposed. Most of these approaches rely on the rearrangement of (2). In particular, a linear transformation such as QR or Cholesky decomposition allows to rewrite as the product of two matrices, one of which is upper triangular [4]. Imposing , (2) can be rewritten as where we have exploited the orthogonality of and .

One of the most interesting consequences of this interpretation is that the exploration of the constellation lattice can be thought as a tree traversal. This search tree has levels and each node in a level has exactly sons, representing the points in one dimension of the -PAM's. This traversal can be done with polynomial complexity adopting the so-called sphere decoder (SD), [5].

Recently proposed concatenated ST coding schemes [6] offer a further reliability enhancement by adopting a combined forward error correction approach based on a high rate bandwidth-efficient trellis coded modulation (TCM) scheme. This Golden ST TCM (GST-TCM) scheme for MIMO provides a reasonable ML decoding complexity solution by using Viterbi algorithm and a branch metric computer based on several parallel sphere decoders. A modified sphere decoder is required to support this kind of concatenated scheme, which is an unexplored subject of investigation, from the implementation point of view.

This paper deals with the implementation issues of main processing tasks that enable the development of MIMO receivers. A MIMO detector is organized in two key processing tasks, matrix factorization and sphere decoding (or tree traversal): we then propose efficient architectures for these two key functions. The latter function is the core function of a high performance MIMO detector and its hardware implementation tends to be critical in terms of both throughput and complexity, especially in high data rate systems.

The matrix factorization task operates on the lattice generator matrix . Since the code generator matrix is constant, the processing must be performed at the channel estimation update frequency, which can change significantly according to the scenario and is generally one or two orders of magnitude lower than the signaling rate. However, in a MIMO-OFDM scheme space, time decoding has to be carried out independently on each subcarrier, determining a dramatic growth of the throughput demand even for matrix factorization.

In Section 2, the sphere decoding algorithm is briefly overviewed, while Section 3 deals with the hardware design of three key building blocks: a sphere decoder, a matrix factorization architecture, and an enhanced sphere decoder for GST-TCM. Finally, Section 4 points out the implementation results achieved for the proposed architectures.

2. The Sphere Decoding Algorithm

Sphere decoding algorithms are a family of algorithms originally proposed to search the closest point to a given one in a lattice. Their use in wireless communications was suggested for the first time in [5], where the lattice structure of multidimensional constellation is exploited to find the closest point to the received vector.

When solving the minimization problem (2), sphere decoding algorithms achieve a polynomial average complexity by exploring only a subset of the solution space [4].

In particular, a hypersphere is constructed around the received vector and only points inside it are taken into account. This constraint can be expressed aswhere is the square radius of the hypersphere [5, 7, 8].

The upper triangular structure of the matrix in (3) enables every component to be separately considered for the computation of the distance between the two points. The distance can also be computed recursively as follows. We consider the partial metrics where , with . Since the term for , then we have . After steps, the distance is obtained as .

As an example, a three-level tree for a 4-PAM modulation is depicted in Figure 2. is the distance metric at level defined in (5). At every level, the radius constraint (4) must be verified and satisfied, otherwise the branch is pruned. In general, the radius is progressively reduced every time a leaf is reached at a distance that is smaller than current radius.

Several algorithms have been studied in order to make the tree traversal efficient. First algorithm, proposed by Fincke and Pohst in [7], needs to chose explicitly an initial radius. A more efficient solution was proposed by Schnorr and Euchner(SE) [9]. In this case, the initial radius is selected as the distance from the (ZF-DFE) solution and a β€œdepth and best first” traversal of the tree is performed. Originally thought for infinite lattices, the SE algorithm was then adapted to finite lattices [4, 10].

The SE algorithm has intrinsically variable throughput and this makes it not very suitable for hardware implementation. The key to make this algorithm efficient or, at least, with a predictable throughput, is to make an effective pruning. Many theoretical studies in recent literature aim at reaching this goal [11]. A very interesting approach consists in an effective column reordering, which uses heuristic methods to reduce the search complexity with limited performance loss [12]. This technique results in very efficient tree search circuits but additional area is necessary for the preprocessing phase.

On the contrary, the approach proposed in this paper is based on the computational complexity reduction of the tree search algorithm with no column reordering. This solution is suitable for a flexible implementation that can adapt to different modulation sizes.

3. VLSI Architectures

Implementation architectures for two key building blocks in MIMO detectors are presented in this section (tree search processing and matrix factorization). An enhanced sphere decoder is then described to be applied in the concatenated GST-TCM scheme.

3.1. Tree Search Processing Block

Given the choice of adopting a fully ML detection algorithm for (2), several implementation options have been proposed in the literature.

A first classification can be done with respect to the choice of real- or complex-valued tree construction. In the real case, the tree is twice deeper than the complex one. In complex trees, on the contrary, every node has the square of the number of sons with respect to the real tree. As an example, with and 16-QAM modulation, a complex-valued tree construction would lead to a 4-level tree, where each node has 16 sons, while 8 levels and 4 sons per node appear in the corresponding real-valued tree. Although [13] demonstrates that a complex-valued tree results in a lower number of visited nodes, the construction of a real-valued tree allows for a more flexible solution, adaptable to different modulation schemes.

Another classification criterion is with respect to the implementation parallelism: (i)parallelism at the level of tree exploration;(ii)parallelism at the level of the metric computation for all sons of a given node and in the selection of the most probable son.The first technique can be adopted only with suboptimal algorithms, while the second approach is not feasible with large cardinality QAM modulation schemes, as it implies a large number of concurrent multiplications. Hence, parallelism is not viable for the implementation of flexible architectures. A serial architecture, designed for high throughput, can achieve both flexibility and low area cost.

Detailed descriptions of the proposed architecture can be found in [14, 15]. The proposed architecture adopts a real-valued tree construction and a serial organization. This key advantage offered by this choice is the possibility of a run-time selection of the modulation scheme. The system is furthermore adaptable to different transmitting schemes including the golden code through the use of some instantiation parameters, which allow to choose the datapath width and the number of levels of the search tree.

The SE algorithm adopts the β€œdepth and best first” traversal of the tree and the minimization of is required according to the problem formulation given in (5). The computation of the values for all possible tends to become infeasible when the order of the modulation increases due to the large number of required operations. Core of the proposed approach is the selection of the that minimizes by means of the division .

In particular, the iterative evaluation of (5) is rearranged in two steps. At the first step, the value is received as an input from the previous iteration and the desired for the analyzed node is directly obtained through the division ; moreover, the output is calculated for the selected as . The second processing step receives and , to actually compute , according to (5). The two operations are performed by units U_psi_Unit and Metric_Compute in Figure 3, where memories required to store amounts and metrics are also shown.

It is worth noting that the result of the division is rounded to the closest -PAM constellation points . As a consequence, a general purpose hardware divisor is not necessary and the required operation can be executed by means of the first steps of a successive subtraction divider [16]. This divider has a very simple architecture that employs only shifts and subtractions; although it tends to be very slow for a complete division, this solution can be effectively used when only a few shift and add elementary operations are required.

In a high throughput sphere decoder, a new metric must be evaluated at each clock cycle. In order to achieve this target, the two steps exploit a pipelined architecture. Additionally, an alternative metric must always be ready, also when a pruning of the tree occurs; therefore, in the proposed architecture, two β€œcandidate” nodes are selected in parallel when processing a given father node. The first one is a direct son of the current node, selected by the U_psi unit of Figure 3 by descending along the tree. The β€œalternative” node, selected by U_psi_Step Unit, is placed at a higher level in the tree and it is chosen when the branch has to be pruned, that is when the current metric exceeds the best current metric evaluated in the tree traversal. The procedure adopted to select the alternative node is described below.

In the U_psi Unit, the evaluation of the direct son of the current node makes use of the division and the result is approximated either by defect or by excess to the nearest PAM constellation point: the best choice for is given by (see Figure 4)where is the correction term. The sign of is exploited to select the second (and following) nearest point in the PAM constellation, according to the following rule: where is the distance between two consecutive points.

Thus, U_psi_Step Unit simply computes (7) to find the second most probable value of . Figure 4 shows the sequence of alternative nodes selected at a given tree level, after the occurrence of pruning.

Summarily, we have the following.

(i) The division approach achieves low complexity and flexibility in terms of supported modulation schemes.(ii)The concurrent evaluation of two β€œcandidate” nodes provides a significant speed-up to the inherently serial SE sphere decoding algorithm and has a limited impact on complexity.
3.2. Matrix Factorization

Understanding of throughput requirements is fundamental in the architectural study of this processing block. The IEEE 802.11n WLAN standard, which adopts space-time coding, implies that a new channel estimation is performed whenever a packet arrives; this means that the number of matrix factorizations ranges from a minimum of 64 in a time period of microseconds, to a maximum of 128 in 28 microseconds.

In the design of the matrix factorization block, a first choice between householder transformations and Givens-rotations-based algorithms [17] has to be made. The latter approach results in a sequence of rotation operations that cancel elements under the main diagonal of the matrix. Givens rotations require a larger number of floating-point operations compared to householder transformations; nevertheless they may be implemented using parallel systolic arrays and for this reason they are usually preferred for hardware implementation.

Every single processing element (PE) of the systolic array must perform the angle calculation and the rotation to cancel the matrix elements. Several alternatives exist to accomplish these two tasks, and the most common ones are

(1)computation of sine and cosine of the angle by means of operations including square roots and divisions;(2) direct angle calculation and rotation using CORDIC processors [18].

The main advantage of the sine and cosine approach is that primitives can be optimized resulting in an efficient, although expensive, implementation. The second technique is less expensive, but outputs are generated with longer latencies and data dependency between operations. The very high throughput required by this application can hardly be achieved by iterative CORDIC-based algorithms. Other alternatives have to be explored to reduce the latency of every single processor. Among the square root-free algorithms, the squared Givens rotations (SGR) proposed by DΓΆlher [19] constitute a good compromise between complexity and speed [20, 21].

Let us indicate with the row of an matrix, where a 0 must be introduced in the th position and with another row having the same number of leading zeros; the standard Givensrotations (StdGR) algorithm employs this set of updating equations to cancel the element :

The SGR algorithm takes advantage of the observation that introduces the matrix and exploits the relations and . Then, to simplify the notation, the new vectors and are introduced for some . After some algebra, we can express (8) with a new set of updating equations:

When compared to StdGR, SGR algorithm shows half the number of multiplications and no square-root operation. The updating sequence can be arranged in a systolic array of PEs performing the aforementioned computations.

The PE array can be arranged according to different structures, namely the triangular (TA), square, and linear (LA) shapes: each of them shows a different percentage of PE reuse and a different throughput. Slightly different functions are then associated in the array organization to boundary and internal PEs.

Figure 5 pictures a generic systolic array layout, able to perform QR decomposition of a matrix. The identity matrix must enter the systolic array immediately after the matrix to be processed, in order to produce the matrix. During the processing of the input matrix , the coefficients of are already computed and stored in the internal registers.

Depending on (9), boundary and internal processing elements must behave differently when a diagonal element of the matrix enters a node. In Table 1, the computations performed by the nodes in the different operating modes are listed. In the table, Reg and Reg2 are two registers needed to store the parameters between different steps. The subscript in indicate that a parameter takes origin from the preceding PE in accordance with the connections in Figure 5, while subscript out indicates that a parameter takes origin in the current PE. It must be also noted that the parameter is updated only in diagonal mode, while in the other modes it maintains the registered value.

The internal processing element (IPE) appears to be the most computationally intensive block of the entire system. Figure 6 depicts the architecture of the IPEs derived from Table 1. Although the divisor has a latency of two clock cycles and two divisions are needed in the diagonal mode, a proper overlapping with the nondiagonal mode guarantees a total latency of three clock cycles.

The method proposed in [22] is adopted to realize the division operation. Using the a Taylor series, the divisor expressed on 2 bits is decomposed into two -bit groups, higher and lower bits . Since , we can write with maximum fractional error . This divisor takes two clock cycles to complete the division on 16 bit fixed-point data [23]; it requires a multiplier, an adder/subtracter, and a 256 8-bit entries LUT to store the inverse of . The overall complexity of the internal PE is therefore given by two 16 bit multipliers, two adders/subtracters and a LUT.

In this paper, we considered real matrices as required by the MIMO system with two channel uses per codeword. With a plain triangular architecture, which allows to obtain the highest throughput, a new matrix can enter the array after 16 steps (8 for computing matrix and 8 for ), that is every 48 clock cycles. In order to factorize 64 matrices in 28 microseconds we need to maintain the clock period shorter than 9 nanoseconds, while a period of 4.5 nanoseconds is required to factorize 128 matrices.

3.3. Enhanced Sphere Decoder for Lattices

In this section, we address concatenated bandwidth efficient coding schemes for MIMO channels, where a space-time code with nonvanishing determinant is used as inner code and an outer trellis code is concatenated to further increase the reliability of the communication [6].

This TCM exploits the basic idea of partitioning the inner constellation; at each channel use, a signal is selected from one of the partitions. In standard TCM for AWGN channels, the Euclidean distance between points in the same subset is made as large as possible [24]. Full rank ST code design is based on the maximization of the minimum determinant where , are distinct codeword matrices. This pseudo-distance replaces the role of the Euclidean distance. In [6] is optimized using set-partitioning that increases the minimum determinant with the partitions. The lattice structure of the inner golden code is used, so that sublattices and their cosets are used as partitions. The outer convolutional encoder guarantees that signals are selected properly from different cosets. Among the possible 8-dimensional sublattices considered in GST-TCM, we choose the Gosset lattice (the densest packing in 8 dimensions [25]).

Any received point has to be decoded to one of the 16 possible cosets of compounding . The decoder needs to compute the branch metrics of the inner code to perform Viterbi ML decoding of the concatenated codeword. This is obtained by ML lattice decoding of the received vector in each coset of the sublattice.

In order to decode the lattice, we consider that and adapt the classical sphere decoder (as that in [14]) operating on .

Consequently, this decoding problem can be solved by thinking of as a punctured lattice and setting proper constraints to discriminate the relevant points within . This means that at a given tree level, the integer signal vector cannot assume all values; actually it is constrained by the selections that have already been made at upper levels.

These constraints can be derived directly from the construction A of based on the (8,4,4) extended Hamming code [6]. Let denote one of the 16 binary codewords that are used as coset leaders of to obtain .

Taking into account that the tree must be traversed starting from the last dimension, we have

If, at level , is , then the signal can assume any value in the original QAM constellation, otherwise its value is constrained.

In order to perform the ML detection, we have to derive the proper evolution of the received signal among the different sublattices. In particular, we can define as the output of the convolutional encoder, which is related to the current state of the encoder, and as one of the 16 coset leaders of in . Combining with the coset leader , we obtain a binary vector that gives the 256 distinct coset leaders of in . Thus, all vectors identify the actual allowed points inside . From the practical point of view, is fixed for the considered decoder, while the allowed and interdicted values of the signal depend on the value of . If = β€œ0,” then can take the values , otherwise it can take the values ; the bounds of this sets depend on the constellation used for the transmission. It is worth noting that, when is , can assume both the values 0 and 1, leading to assume any value in the original PAM constellation.

Figure 7 shows levels 3 and 4 of a tree for the sphere decoding of 4-PAM systems: solid lines are practicable edges, while dashed lines correspond the interdicted ones. For this cross-section, we assume = β€œ1,” = β€œ0,” and = β€œ1,” resulting in = β€œ0,” = β€œ1,” and = β€œ1.” Therefore, values [βˆ’1, 3] are allowed in this example. At level 3, instead, is free, and as a consequence can assume both values β€œ0” and β€œ1” and the four branches are all admissible.

The proposed scheme allows to realize with a unique circuit the branch metric computer unit required in the Viterbi algorithm necessary for the decoding of the TCM transmission scheme in [6]. Note that, at each stage of the trellis, 16 different decoders are required.

The adopted architecture is very similar to the architecture described in Section 3.1 and in [14]. The only difference is the additional functional block, the β€œconstraint maker,” able to realize (12).

4. Implementation Results

Sphere decoder performance in the golden code scenario described in Section 1 is reported in Figure 8 in terms of bit error rate (BER) versus SNR, for 4-, 16-, and 64-QAM modulations. Fixed-point results are also plotted for the case of a 16-bit data representation (7 bits for integer and 9 for fractional parts): in accordance with [23], these results prove that, for this particular application, 16-bit representation is sufficient to achieve the floating-point performances and thus it has been adopted for all the processing blocks here described.

The proposed architectures have been synthesized on a 0.13 μm commercial CMOS standard cell technology with synopsys design compiler. The synthesis results are presented in Table 2: the sphere decoder synthesis results here listed are obtained with a flexible architecture able to decode 4 to 64 QAM modulations, while the matrix factorization block has been realized with a triangular array architecture (Mmat/s indicates millions of matrices processed in a second). It must be noticed that synthesis results differ from those in [14], although referred to the same implementation, due to the use of different synthesis libraries.

For comparison purposes, the tree search block has been also synthesized on 0.25 μm CMOS Standard cell technology (Table 3): we then compare our architecture to the ML implementation described in [26] and the quasi-ML implementation in [27]. It must be noted that two different implementations are presented in [26], one is ML, while the other has close to ML BER performance: as the latter implementation adopts a completely different approach and maps a suboptimal algorithm, only the first implementation figures are included in Table 3 for comparison purposes.

Analyzing data in Table 3, it can be observed that our rearranged approach for the sphere decoder with a single metric computation per cycle allows a significant complexity reduction (approx. 50% for 16 QAM modulation) with respect to parallel structures. At the same time, thanks to the pipelined architecture, we can achieve a remarkable average decoding throughput without any highly specialized structure. Moreover, our flexible decoder is not limited to a single modulation scheme, but it can adapt to different modulations (4-, 16-, and 64-QAM).

Fair comparisons to other implementations cannot be done for the matrix factorization block, as published solutions adopts completely different architectures. For the sake of completeness, we report here two FPGA developments, [21, 28], which implement the SGR algorithm. The main features of these architectures are summarized in Table 4, together with the synthesis results of our solution mapped onto a Xilinx Virtex4 (xc4vlx200) FPGA device. Both [21, 28] carry out the computation of complex matrices, while we process real-valued ones. This means that, while the single PE complexity will be greater in the complex scenario, the number of PE the data flow pass through is twice and with the basic TA topology, while for a matrix there will be 8 PEs, and 32 PEs are required for a matrix.

Another difference among these implementations is related to the processing topology; while our solution adopts a TA processing topology with 32 PEs, [28] makes use of a linear array (LA) organization with 4 PEs and two single PEs are used in [21], one for boundary processing and the second one for internal processing.

A further difference with respect to [28] is that in our implementation weight is updated according to (9) while in [28] it is fixed to a constant value.

In conclusion, the standard cell version fully reaches both the 64 matrices in 36 microseconds and the 128 matrices in 28 microseconds goals and the throughput of the proposed approach compares favourably to that of the other implementations showing high performances at a limited additional cost. On the contrary, the FPGA implementation enables only to reach the 64 matrices in 36 microseconds.

The decoder, instead, adopting the same architecture as the sphere decoder presents a comparable complexity. A little increase in area is due to the addition of the functional block β€œconstraint maker,” leading the overall complexity to 62 kGates, and the maximum achievable frequency to 196 MHz.

5. Conclusions

The hardware implementation of key building blocks in a MIMO-OFDM receiver has been presented. The analysis of the blocks shows their high level of complexity, which justifies the ASIC design approach. The sphere decoder architecture enables to manage different modulations without any loss in BER performance while the proposed matrix factorization algorithm and arrangement allow to achieve the highest throughput specified in the 802.11n standard. Finally, the design of an enhanced sphere decoder, capable of supporting decoding in a ST-TCM concatenated schemes, has been proposed.