Multi-input multi-output (MIMO) systems combined
with orthogonal frequency-division multiplexing (OFDM)
gained a wide popularity in wireless applications due to the
potential of providing increased channel capacity and robustness
against multipath fading channels. However these advantages
come at the cost of a very high processing complexity and
the efficient implementation of MIMO-OFDM receivers is today
a major research topic. In this paper, efficient architectures
are proposed for the hardware implementation of the main
building blocks of a MIMO-OFDM receiver. A sphere decoder
architecture flexible to different modulation without any loss in
BER performance is presented while the proposed matrix factorization
implementation allows to achieve the highest throughput
specified in the IEEE 802.11n standard. Finally a novel sphere
decoder approach is presented, which allows for the realization of
new golden space time trellis coded modulation (GST-TCM)
scheme. Implementation cost and offered throughput are provided
for the proposed architectures synthesized on a 0.13 CMOS
standard cell technology or on advanced FPGA devices.
1. Introduction
MIMO-OFDM (Multi-input multi-output—orthogonal fre-quency-division multiplexing) is a very promising communication technique that enables to establish very high
throughput and reliable wireless links. In order to achieve this goal,
space-time (ST) codes are used, since they can conjugate both transmission rate
and reliability enhancement of the communication system. ST codes have been
considered for some recently proposed standards such as IEEE 802.11n WLAN and 802.16e WMAN.
However, the computational complexity of MIMO-OFDM
receivers is much higher than in the single-input single-output (SISO) OFDM
approach; as a consequence the potentials offered by MIMO-OFDM are still far
from being fully exploited in actual implementations.
Figure 1 depicts the structure of a transmit-receive antenna MIMO-OFDM
communication scheme. At the receiving side, after the RF/Analog front-end,
multiple OFDM demodulation stages, implemented as FFT processors (one per
antenna) are allocated, followed by the MIMO signal detector. The adoption of a
full-rate and full-diversity ST code demands specific demapping and decoding
capabilities, which are covered in Figure 1 by the “ST-code decoder and
demapper” block. Finally, a trellis coded modulation (TCM) channel decoder
implements forward error correction.
Figure 1: ST-Code MIMO System.
The MIMO channel is modeled by its impulse
response between each transmit-receive antenna pair. Assuming represents the
time-varying channel fading coefficient between the th transmit
antenna and the th receive
antenna, the MIMO channel with transmit and receive
antennas is described through a matrix , where . Transmitted space-time codewords are matrices, where is the number
of channel uses required by the ST code. Assuming the “block fading” channel
model, each transmitted will be
affected by an independently varying channel matrix . Then, the received matrix
is where is the additive
white Gaussian noise matrix with entries .
When data symbols belong to a -QAM
modulation, it is convenient to represent the codewords in vectorized
form, where real and imaginary components of the -QAM are
separated in two -PAM
modulations, resulting in real component codewords . Consequently, the channel matrix is rearranged
in a real-valued matrix and is replaced
with the real-valued . For a linear ST block code, can be obtained
as , where is the ST-block
code generator matrix and is the
vectorized data vector with entries in -PAM. Note
that with is a
hypercubic-shaped constellation carved from a multidimensional integer grid .
Provided that is perfectly
known at the receiver, the optimal detector, able to minimize the codeword
error rate in a MIMO channel, is the maximum likelihood (ML) detector, which
solves the problemwhere and . The cardinality of the search space, , depends on the number of receive antennas, the
chosen modulation scheme, and the number of channel uses, while the factor 2
comes from the decomposition in real and imaginary components.
Hereinafter, in order to consider a currently
practical situation, we will consider a two transmit and two receive antennas
system, with a two-channel use ST block code (). An example
of such a code is the golden code proposed in [1–3] and adopted by the IEEE 802.16e WMAN standard. We then
have , , , and are real vectors
and is a real-valued
matrix. Thus, when using 16-QAM symbols, the direct computation of
(2) results
in the evaluation of possible
solutions.
Due to the high complexity of the exhaustive search,
more efficient methods were proposed. Most of these approaches rely on the
rearrangement of (2). In particular, a linear transformation such as QR or
Cholesky decomposition allows to rewrite as the product
of two matrices, one of which is upper triangular [4]. Imposing , (2) can be rewritten as where we have
exploited the orthogonality of and .
One of the most interesting consequences of this
interpretation is that the exploration of the constellation lattice can be
thought as a tree traversal. This search tree has levels and each
node in a level has exactly sons,
representing the points in one dimension of the -PAM's. This
traversal can be done with polynomial complexity adopting the so-called sphere
decoder (SD), [5].
Recently proposed concatenated ST coding schemes
[6] offer a further
reliability enhancement by adopting a combined forward error correction
approach based on a high rate bandwidth-efficient trellis coded modulation
(TCM) scheme. This Golden ST TCM (GST-TCM) scheme for MIMO provides a
reasonable ML decoding complexity solution by using Viterbi algorithm and a
branch metric computer based on several parallel sphere decoders. A modified
sphere decoder is required to support this kind of concatenated scheme, which
is an unexplored subject of investigation, from the implementation point of
view.
This paper deals with the implementation issues of
main processing tasks that enable the development of MIMO receivers. A MIMO
detector is organized in two key processing tasks, matrix factorization and sphere decoding (or tree traversal): we then propose efficient
architectures for these two key functions. The latter function is the core
function of a high performance MIMO detector and its hardware implementation
tends to be critical in terms of both throughput and complexity, especially in
high data rate systems.
The matrix factorization task operates on the lattice
generator matrix . Since the code generator matrix is constant, the
processing must be performed at the channel estimation update frequency, which
can change significantly according to the scenario and is generally one or two
orders of magnitude lower than the signaling rate. However, in a MIMO-OFDM
scheme space, time decoding has to be carried out independently on each
subcarrier, determining a dramatic growth of the throughput demand even for
matrix factorization.
In Section 2, the sphere decoding algorithm is briefly overviewed, while Section 3 deals with the hardware design of three
key building blocks: a sphere decoder, a matrix factorization architecture, and
an enhanced sphere decoder for GST-TCM. Finally, Section 4 points out the
implementation results achieved for the proposed architectures.
2. The Sphere Decoding Algorithm
Sphere decoding algorithms are a family of algorithms
originally proposed to search the closest point to a given one in a lattice.
Their use in wireless communications was suggested for the first time in
[5], where the lattice
structure of multidimensional constellation is exploited to find the closest
point to the received vector.
When solving the minimization problem (2), sphere
decoding algorithms achieve a polynomial average complexity by exploring only a
subset of the solution space [4].
In particular, a hypersphere is constructed around the
received vector and only points
inside it are taken into account. This constraint can be expressed
aswhere is the square
radius of the hypersphere [5, 7, 8].
The upper triangular structure of the matrix in (3) enables
every component to be separately considered for the computation of the distance
between the two points. The distance can also be computed recursively as follows. We consider the partial metrics where , with . Since the term for , then we have . After steps, the
distance is obtained as .
As an example, a three-level tree for a 4-PAM
modulation is depicted in Figure 2. is the distance
metric at level defined in (5).
At every level, the radius constraint (4) must be verified and satisfied, otherwise
the branch is pruned. In general, the radius is progressively reduced every
time a leaf is reached at a distance that is smaller than current radius.
Figure 2: Tree organization for the sphere
decoder. represent the vector of symbol value in 4-PAM, [−3, −1, 1, 3].
Several algorithms have been studied in order to make
the tree traversal efficient. First algorithm, proposed by Fincke and Pohst in [7], needs to chose explicitly
an initial radius. A more efficient solution was proposed by Schnorr and
Euchner(SE) [9]. In
this case, the initial radius is selected as the distance from the
(ZF-DFE) solution and a “depth and best first” traversal of the tree is
performed. Originally thought for infinite lattices, the SE algorithm was then
adapted to finite lattices [4, 10].
The SE algorithm has intrinsically variable throughput
and this makes it not very suitable for hardware implementation. The key to
make this algorithm efficient or, at least, with a predictable throughput, is
to make an effective pruning. Many theoretical studies in recent literature aim
at reaching this goal [11]. A very interesting approach consists in an effective
column reordering, which uses heuristic methods to reduce the search complexity
with limited performance loss [12]. This technique results in very efficient tree search
circuits but additional area is necessary for the preprocessing phase.
On the contrary, the approach proposed in this paper
is based on the computational complexity reduction of the tree search algorithm
with no column reordering. This solution is suitable for a flexible
implementation that can adapt to different modulation sizes.
3. VLSI Architectures
Implementation architectures for two key building blocks
in MIMO detectors are presented in this section (tree search processing and
matrix factorization). An enhanced sphere decoder is then described to be
applied in the concatenated GST-TCM scheme.
3.1. Tree Search Processing Block
Given the choice of adopting a fully ML detection algorithm for
(2), several implementation options have been proposed in the
literature.
A first classification can be done with respect to the
choice of real- or complex-valued tree construction. In the real case, the tree
is twice deeper than the complex one. In complex trees, on the contrary, every
node has the square of the number of sons with respect to the real tree. As an
example, with and 16-QAM
modulation, a complex-valued tree construction would lead to a 4-level tree,
where each node has 16 sons, while 8 levels and 4 sons per node appear in the
corresponding real-valued tree. Although [13] demonstrates that a complex-valued tree results in a
lower number of visited nodes, the construction of a real-valued tree allows
for a more flexible solution, adaptable to different modulation schemes.
Another classification criterion is with respect to
the implementation parallelism:
(i)parallelism at
the level of tree exploration;(ii)parallelism at
the level of the metric computation for all sons of a given node and in the selection of the most probable son.The first technique can be adopted only with suboptimal algorithms, while the second
approach is not feasible with large cardinality QAM modulation schemes, as it
implies a large number of concurrent multiplications. Hence, parallelism is not
viable for the implementation of flexible architectures. A serial architecture,
designed for high throughput, can achieve both flexibility and low area cost.
Detailed descriptions of the proposed architecture can
be found in [14, 15]. The proposed
architecture adopts a real-valued tree construction and a serial organization.
This key advantage offered by this choice is the possibility of a run-time
selection of the modulation scheme. The system is furthermore adaptable to
different transmitting schemes including the golden code through the use of
some instantiation parameters, which allow to choose the datapath width and the
number of levels of the search tree.
The SE algorithm adopts the “depth and best
first” traversal of the tree and the minimization of is required
according to the problem formulation given in (5). The computation of the values for all
possible tends to become
infeasible when the order of the modulation increases due to the large number
of required operations. Core of the proposed approach is the selection of the that minimizes by means of the
division .
In particular, the iterative evaluation of (5) is
rearranged in two steps. At the first step, the value is received as
an input from the previous iteration and the desired for the
analyzed node is directly obtained through the division ; moreover, the output is calculated
for the selected as . The second processing
step receives and , to actually compute , according to
(5). The two operations are performed
by units U_psi_Unit and Metric_Compute in Figure 3, where memories
required to store amounts and metrics are
also shown.
Figure 3: Sphere decoder block scheme (case of a
node expanded in the depth-first mode, with no pruning).
It is worth noting that the result of the division is rounded to
the closest -PAM
constellation points . As a consequence, a general purpose hardware divisor
is not necessary and the required operation can be executed by means of the
first steps of a
successive subtraction divider [16]. This divider has a very simple architecture that
employs only shifts and subtractions; although it tends to be very slow for a
complete division, this solution can be effectively used when only a few shift
and add elementary operations are required.
In a high throughput sphere decoder, a new metric must be
evaluated at each clock cycle. In order to achieve this target, the two steps
exploit a pipelined architecture. Additionally, an alternative metric must always
be ready, also when a pruning of the tree occurs; therefore, in the proposed
architecture, two “candidate” nodes are selected in parallel when processing
a given father node. The first one is a direct son of the current node,
selected by the U_psi unit of
Figure 3 by descending along the tree. The “alternative” node, selected by U_psi_Step Unit, is placed at a higher
level in the tree and it is chosen when the branch has to be pruned, that is
when the current metric exceeds the best current metric evaluated in the tree
traversal. The procedure adopted to select the alternative node is described
below.
In the U_psi Unit, the evaluation of the direct son of the current node makes use of the
division and the result
is approximated either by defect or by excess to the nearest PAM constellation
point: the best choice for is given by
(see Figure 4)where is the
correction term. The sign of is exploited to select the second (and following) nearest point in the PAM constellation, according to the following rule:
where is the distance
between two consecutive points.
Figure 4: Method used to select alternative nodes in U_psi_step unit.
Thus, U_psi_Step Unit simply computes (7) to
find the second most probable value of . Figure 4 shows the sequence of alternative nodes
selected at a given tree level, after the occurrence of pruning.
Summarily, we have the following.
(i) The division approach
achieves low complexity and flexibility in terms of supported modulation
schemes.(ii)The concurrent
evaluation of two “candidate” nodes provides a significant speed-up to the
inherently serial SE sphere decoding algorithm and has a limited impact on
complexity.3.2. Matrix Factorization
Understanding of throughput requirements is fundamental in the architectural study of this processing block. The IEEE 802.11n WLAN standard, which adopts space-time coding, implies that a new channel estimation is performed whenever a packet arrives; this means that the number of matrix factorizations ranges from a minimum of 64 in a time period of microseconds,
to a maximum of 128 in 28 microseconds.
In the design of the matrix factorization block, a
first choice between householder transformations and Givens-rotations-based
algorithms [17] has to
be made. The latter approach results in a sequence of rotation operations that
cancel elements under the main diagonal of the matrix. Givens rotations require
a larger number of floating-point operations compared to householder
transformations; nevertheless they may be implemented using parallel systolic
arrays and for this reason they are usually preferred for hardware
implementation.
Every single processing element (PE) of the systolic
array must perform the angle calculation and the rotation to cancel the matrix
elements. Several alternatives exist to accomplish these two tasks, and the most common ones are
(1)computation of sine and cosine of the angle by means of operations
including square roots and divisions;(2) direct angle
calculation and rotation using CORDIC processors [18].The main advantage of the sine and cosine approach is that primitives can be optimized resulting in an
efficient, although expensive, implementation. The second technique is less expensive, but outputs are generated with longer latencies and data dependency between operations. The very high throughput required by this application can hardly be achieved by iterative CORDIC-based algorithms. Other alternatives have to be explored to reduce the latency of every single processor. Among the square root-free algorithms, the squared Givens rotations (SGR) proposed by
Dölher [19] constitute
a good compromise between complexity and speed [20, 21].
Let us indicate with the row of an matrix, where a
0 must be introduced in the th position and
with another row
having the same number of leading zeros; the standard Givensrotations (StdGR)
algorithm employs this set of updating equations to cancel the element :
The SGR algorithm takes advantage of the observation
that introduces the
matrix and exploits
the relations and . Then, to simplify the notation, the new vectors and are introduced
for some . After some algebra, we can express (8) with a new
set of updating equations:
When compared to StdGR, SGR algorithm shows half the
number of multiplications and no square-root operation. The updating sequence
can be arranged in a systolic array of PEs performing the aforementioned
computations.
The PE array
can be arranged according to different structures, namely the triangular (TA),
square, and linear (LA) shapes: each of them shows a different percentage of PE
reuse and a different throughput. Slightly different functions are then
associated in the array organization to boundary and internal PEs.
Figure 5 pictures a generic systolic array layout,
able to perform QR decomposition of a matrix. The
identity matrix must enter the systolic array immediately after the matrix to
be processed, in order to produce the matrix. During
the processing of the input matrix , the coefficients of are already
computed and stored in the internal registers.
Figure 5: Systolic array for QR decomposition of a matrix.
Depending on (9), boundary and internal processing
elements must behave differently when a diagonal element of the matrix enters a node. In Table 1, the computations performed by the nodes in the different
operating modes are listed. In the table, Reg and Reg2 are two registers needed
to store the parameters between different steps. The subscript in indicate that a parameter takes origin
from the preceding PE in accordance with the connections in Figure 5, while
subscript out indicates that a
parameter takes origin in the current PE. It must be also noted that the
parameter is updated only
in diagonal mode, while in the other modes it maintains the registered value.
Table 1: Operations performed by the PE's.
The internal processing element (IPE) appears to be
the most computationally intensive block of the entire system. Figure 6 depicts
the architecture of the IPEs derived from Table 1. Although the divisor has a
latency of two clock cycles and two divisions are needed in the diagonal mode,
a proper overlapping with the nondiagonal mode guarantees a total latency of
three clock cycles.
Figure 6: Block diagram of internal PE.
The method proposed in [22] is adopted to realize the division operation. Using the a Taylor series, the divisor expressed on 2 bits is
decomposed into two -bit groups, higher and lower bits . Since , we can write with maximum fractional error . This divisor takes two clock cycles to complete the
division on 16 bit fixed-point data [23]; it requires a multiplier, an adder/subtracter, and a
256 8-bit entries LUT to store the inverse of . The overall complexity of the internal PE is
therefore given by two 16 bit multipliers, two adders/subtracters and a LUT.
In this paper, we considered real matrices
as required by the MIMO system
with two channel uses per codeword. With a plain triangular architecture, which
allows to obtain the highest throughput, a new matrix can enter the array after
16 steps (8 for computing matrix and 8
for ), that is
every 48 clock cycles. In order to factorize 64 matrices in 28 microseconds we need to
maintain the clock period shorter than 9 nanoseconds, while a period of 4.5 nanoseconds is required to factorize 128 matrices.
3.3. Enhanced Sphere Decoder for Lattices
In this section, we address concatenated bandwidth
efficient coding schemes for MIMO channels, where a space-time code with
nonvanishing determinant is used as inner code and an outer trellis code is
concatenated to further increase the reliability of the communication [6].
This TCM exploits the basic idea of partitioning the
inner constellation; at each channel use, a signal is selected from one of the
partitions. In standard TCM for AWGN channels, the Euclidean distance between
points in the same subset is made as large as possible [24]. Full rank ST code design is based on the
maximization of the minimum determinant
where , are distinct
codeword matrices. This pseudo-distance replaces the role of the Euclidean
distance. In [6] is optimized
using set-partitioning that increases the minimum determinant with the
partitions. The lattice
structure of the inner golden code is used, so that sublattices and their
cosets are used as partitions. The outer convolutional encoder guarantees that
signals are selected properly from different cosets. Among the possible
8-dimensional sublattices considered in GST-TCM, we choose the Gosset lattice (the densest
packing in 8 dimensions [25]).
Any received point has to be decoded to one of the 16
possible cosets of compounding . The decoder needs to compute the branch metrics of
the inner code to perform Viterbi ML decoding of the concatenated codeword.
This is obtained by ML lattice decoding of the received vector in each coset of
the sublattice.
In order to decode the lattice, we
consider that and adapt the
classical sphere decoder (as that in [14]) operating on .
Consequently, this decoding problem can be solved by
thinking of as a punctured lattice and
setting proper constraints to discriminate the relevant points within . This means that at a given tree level, the integer
signal vector cannot assume all values; actually it is constrained by the
selections that have already been made at upper levels.
These constraints can be derived directly from the
construction A of based on the
(8,4,4) extended Hamming code [6]. Let denote one of
the 16 binary codewords that are used as coset leaders of to obtain .
Taking into account that the tree must be traversed
starting from the last dimension, we have
If, at level , is , then the signal can assume any value in the original
QAM constellation, otherwise its value is constrained.
In order to perform the ML detection, we have to
derive the proper evolution of the received signal among the different
sublattices. In particular, we can define as the output
of the convolutional encoder, which is related to the current state of the
encoder, and as one of the
16 coset leaders of in . Combining with the coset
leader , we obtain a binary vector that gives the
256 distinct coset leaders of in . Thus, all vectors
identify the actual allowed points inside . From the practical point of view, is fixed for
the considered decoder, while
the allowed and interdicted values of the signal depend on the
value of . If = “0,” then can take the
values , otherwise it
can take the values ; the bounds
of this sets depend on the constellation used for the transmission. It is worth
noting that, when is , can assume both
the values 0 and 1, leading to assume any
value in the original PAM constellation.
Figure 7 shows levels 3 and 4 of a tree for the sphere
decoding of 4-PAM systems: solid lines are practicable edges, while dashed
lines correspond the interdicted ones. For this cross-section, we assume = “1,” = “0,” and = “1,”
resulting in = “0,” = “1,” and = “1.”
Therefore, values [−1, 3] are allowed in this example. At level 3, instead, is free, and as
a consequence can assume both
values “0” and “1” and the four branches are all admissible.
Figure 7: Cross-section at levels 4 and 3,
assuming
= “1,”
= “0,”
= “1” and
= “1” we obtain
= “0” while
= “1” where
is the output
of the convolutional encoder and represents a coset leader of
in
and
is a coset
leader of
in
.
The proposed scheme allows to realize with a unique
circuit the branch metric computer unit required in the Viterbi algorithm
necessary for the decoding of the TCM
transmission scheme in [6]. Note that, at each stage of the trellis, 16 different decoders are
required.
The adopted architecture is very similar to the
architecture described in Section 3.1 and in [14]. The only difference is the additional functional
block, the “constraint maker,” able to realize (12).
4. Implementation Results
Sphere decoder performance in the golden code
scenario described in Section 1 is reported in Figure 8 in terms of bit error
rate (BER) versus SNR, for 4-, 16-, and 64-QAM modulations. Fixed-point results
are also plotted for the case of a 16-bit data representation (7 bits for
integer and 9 for fractional parts): in accordance with [23], these results prove that, for this particular
application, 16-bit representation is sufficient to achieve the floating-point
performances and thus it has been adopted for all the processing blocks here
described.
Figure 8: Proposed system performance with different modulations.
The proposed architectures have been synthesized on a 0.13 μm commercial CMOS standard cell technology with synopsys design compiler. The synthesis
results are presented in Table 2: the sphere decoder synthesis results here
listed are obtained with a flexible architecture able to decode 4 to 64 QAM
modulations, while the matrix factorization block has been realized with a
triangular array architecture (Mmat/s indicates millions of matrices processed
in a second). It must be noticed that synthesis results differ from those in
[14], although
referred to the same implementation, due to the use of different synthesis
libraries.
Table 2: Synthesis results at 0.13 μm technology for SD and matrix factorization blocks.
For comparison purposes, the tree search block has
been also synthesized on 0.25 μm CMOS Standard
cell technology (Table 3): we then compare our architecture to the ML
implementation described in [26] and the quasi-ML implementation in [27]. It must be noted that two different implementations are presented in [26], one is ML, while the other has close to ML BER performance: as the latter implementation adopts a completely different approach and maps a suboptimal algorithm, only the first implementation figures are included in Table 3 for comparison purposes.
Table 3: Comparison results for SD building block.
Analyzing data in Table 3, it can be observed that our rearranged approach for the sphere decoder with a single metric computation per cycle allows a significant complexity reduction (approx. 50% for 16 QAM
modulation) with respect to parallel structures. At the same time, thanks to
the pipelined architecture, we can achieve a remarkable average decoding
throughput without any highly specialized structure. Moreover, our flexible
decoder is not limited to a single modulation scheme, but it can adapt to
different modulations (4-, 16-, and 64-QAM).
Fair comparisons to other implementations cannot be
done for the matrix factorization block, as published solutions adopts
completely different architectures. For the sake of completeness, we report
here two FPGA developments, [21, 28], which implement the SGR algorithm. The main features
of these architectures are summarized in Table 4, together with the synthesis
results of our solution mapped onto a Xilinx Virtex4 (xc4vlx200) FPGA device.
Both [21, 28] carry out the
computation of complex
matrices, while we process real-valued
ones. This means that, while the single PE complexity will be greater in the
complex scenario, the number of PE the data flow pass through is twice and with
the basic TA topology, while for a matrix there will be 8 PEs, and 32 PEs
are required for a matrix.
Table 4: FPGA synthesis results for matrix factorization building block.
Another difference among these implementations is
related to the processing topology; while our solution adopts a TA processing
topology with 32 PEs, [28] makes use of a linear array (LA) organization with 4
PEs and two single PEs are used in [21], one for boundary processing and the second one for
internal processing.
A further difference with respect to [28] is that in our
implementation weight is updated
according to (9) while in [28] it is fixed to a constant value.
In conclusion, the standard cell version fully reaches
both the 64 matrices in 36 microseconds and the 128
matrices in 28 microseconds goals and the
throughput of the proposed approach compares favourably to that of the other
implementations showing high performances at a limited additional cost. On the
contrary, the FPGA implementation enables only to reach the 64 matrices in 36 microseconds.
The decoder,
instead, adopting the same architecture as the sphere decoder
presents a comparable complexity. A little increase in area is due to the
addition of the functional block “constraint maker,” leading the overall
complexity to 62 kGates, and the maximum achievable
frequency to 196 MHz.
5. Conclusions
The hardware implementation of key building blocks in a MIMO-OFDM receiver has been presented. The analysis of the blocks shows their high level of complexity, which justifies the ASIC design approach. The sphere decoder architecture enables to manage different modulations without any loss in BER performance
while the proposed matrix factorization algorithm and arrangement allow to
achieve the highest throughput specified in the 802.11n standard. Finally, the
design of an enhanced sphere decoder, capable of supporting decoding in a
ST-TCM concatenated schemes, has been proposed.