Abstract

This paper presents a VLSI architecture for the suboptimal hard-output Vertical-Bell Laboratories Layered Space-Time (V-BLAST) algorithm in the context of Spatial Multiplexing Multiple-Input Multiple-Output (SM-MIMO) systems immersed in Rayleigh fading channels. The design and implementation of its corresponding data-path and control-path components over FPGA devices are considered. Results on synthesis, bit error rate performance, and data throughput are reported.

1. Introduction

Multiple-Input Multiple-Output (MIMO) communication systems enhance spectral efficiency and bit error rate (BER) performance over wireless communication links [13]. MIMO is already considered as the transmission scheme for emerging wireless communication standards such as 802.11 n (WiFi), 802.16 d/e (WiMAX), and 802.11 ac (multi-user MIMO WLAN) [4]. Digital signal processing (DSP) algorithms for symbol decoding in these systems immersed in Rayleigh fading channels require trade-off design challenges regarding BER performance, data throughput, and complexity. Sub-optimal hard-output Spatial-Multiplexing MIMO (SM-MIMO) demodulation techniques [1, 2] offer low complexity aspects with a very high data throughput, but at a penalty in BER performance degradation as compared to Maximum-Likelihood (ML) performance. The V-BLAST (Vertical-Bell Laboratories Layered Space-Time) algorithm [5, 6] is an adequate sub-optimal hard-output SM-MIMO demodulation technique due to the previous mentioned properties. To the best of the authors’ knowledge, previous attempts on state-of-the-art VLSI implementations for hard-output V-BLAST-based sub-optimal SM-MIMO demodulation architectures are reported in literature [7, 8]. Seeking for low-power consumption and cost-effectiveness, these ASIC-based (Application-Specific Integrated Circuit) approaches permit no flexibility in meeting these attributes towards prototyping this kind of DSP solutions. In this work, a VLSI architecture for the sub-optimal hard-output detection V-BLAST algorithm is presented. The contribution of this paper is to present a novel VLSI architecture of the V-BLAST algorithm implemented on FPGA devices that perfectly suits SM-MIMO demodulation requirements regarding BER performance, hardware complexity, and data throughput while operating in Rayleigh fading channels, behaving competitively against other earlier approaches. The organization of this paper is as follows: Section 2 presents the MIMO communication model. The V-BLAST algorithm is presented in Section 3. Section 4 highlights the architecture proposal for the V-BLAST algorithm. Implementation results and comparison analysis are exposed in Section 5. Conclusions are covered in Section 6.

2. MIMO Communication Model

The MIMO communication model consists of an antenna array of elements at the transmitter end and elements at the receiver in the presence of an ccs-iid AWGN (circularly-complex symmetric, identically-distributed Additive White Gaussian Noise) Rayleigh fading channel [13]. Information signal vector , whose entries are symbols drawn from a -QAM (-ary Quadrature Amplitude Modulation) constellation (known also in this context as a Gaussian finite-integer lattice), is transmitted throughout these antennas. The received signal vector of dimension can be mathematically described as where and were defined above, is a AWGN vector, and is the MIMO channel matrix (entries from correspond to the fading between the th receiver and the th transmitter antennas). System statistics are assumed to be invariant during a MIMO channel realization [1]. Without loss of generality, will be considered in the sequel. Applying QR decomposition [9, 10] to in (1), that is, , yields where are complex orthogonal and upper-triangular matrices, respectively; moreover, and , where is the conjugate-transpose operator. Obviously speaking, which reveals that is equivalent to inverse of , that is, . The problem to solve in this MIMO communication scenario is to find the transmitted vector among possible candidates given that vector was received and matrix has been accurately estimated. As can be seen, an exhaustive search is prohibitively complex. A fast decoding procedure alleviates this complexity constraint by taking advantage of matrix structure of presented in (2) under a successive interference cancellation (SIC) strategy [5, 6], thus yielding high data throughputs for symbol-decoding purposes. That is why sub-optimal decoding algorithms are preferred. One of the best sub-optimal hard-output decoding algorithms is the V-BLAST which is explained next.

3. The V-BLAST Algorithm

The main idea behind the V-BLAST algorithm, as a sub-optimal hard-output SM-MIMO demodulation technique, is that instead of performing an exhaustive search, symbol decoding of is performed under an ordered-iterative and back-propagation way (identified also as OSIC: Ordered Successive Interference Cancellation), in which noise (associated with ) and cochannel interference (related to elements in ) are treated through a SNR (Signal-to-Noise Ratio) optimization criterion that determines the order in which symbol entries of will be decoded [5, 6]. At each iteration, an entry of vector is sliced into a -QAM value at the receiver end assuming that only one transmitter end element possesses the highest SNR (choice based on an optimization criterion), and thus the remaining transmitter elements are considered as interference (besides the presence of AWGN). With these ideas exposed, the V-BLAST algorithm is defined into the following steps (Steps 18).

Step 1 (initial conditions). Perform the assignments: Let and be an index vector.

Step 2 (pre/post-detection). While , and for , a SNR optimization criterion dictates the order in which symbol entries of will be decoded according to index , and is stated as where elements are obtained form the Generalized-Inverse (or Left-Pseudoinverse) [11] of the MIMO channel matrix deflated versions , that is, with . Make be the -th element of .

Step 3 (nulling). Co-channel interference is mitigated by the use of a nulling vector , that is, obtained from rows of (7), through operation with ,  , and   is the signal to be sliced.

Step 4 (quantization). A slicing operator (which maps the signal into a constellation point of the -QAM modulation lattice) is applied to (8), yielding This outcome will be the th entry of signal vector , that is, . If , go to Step 8; otherwise, go to Step 5.

Step 5 (cancellation). The treatment for co-channel interference and AWGN mitigation after (8)–(10) at iteration is cancelled out according to where is the th column of .

Step 6 (updating). By, respectively, removing the th element of and the th column of , deflated versions for the MIMO channel matrix are created, as well as reshaping index vector , such as:

Step 7. Increment and go to Step 2.

Step 8. is the transmitted vector.

4. Architecture Proposal

The block diagram of the proposed architecture for the sub-optimal hard-output V-BLAST-based SM-MIMO algorithm is shown in Figure 1. The overall architecture for the V-BLAST algorithm consists of two components: the Data-Path (DP) is constituted by processing elements that implement all necessary mathematical operations; the Control-Path (CP) provides signaling for synchronization of data-flow for the appropriate decoding of vector (represented by output ). In Figure 1,    is the system clock, and are the inputs. Additional external signals are employed for initialization of V-BLAST operations through a reset signal (), detection of valid information at the inputs (signal DV_inputs), and indication of valid information at the output (flag DV_output). Remember that the V-BLAST architecture is designed in order to perform symbol decoding of generated from a 16-QAM constellation in AWGN Rayleigh fading channels for MIMO systems with . The details concerning the design of the data-path and control-path components are exposed in the subsequent sections.

4.1. Finite-Precision Analysis

The configuration of the DP component in the V-BLAST architecture considers design specifications based on fixed-point arithmetic for inputs and (results were carried out by floating and fixed point MATLAB simulations). Figure 2 reveals the BER performance of the V-BLAST algorithm considering specifications mentioned above. Optimal performance is obviously the ML solution (ML-D legend, red line). It can be seen from the figure that using a 16-bit fixed-point word-length for showed an acceptable performance of the V-BLAST as compared to its floating-point model (blue and blue-dotted lines) with less than 0.01 dB in loss. On the contrary, a reduction in finite-precision, that is, 8-bit word-length, caused a remarkable performance degradation of more than 1 dB (black-dotted line). In addition, 16 bits for both the real and imaginary parts of each entry in was also considered.

4.2. Data-Path Architecture

The architecture of the DP component is illustrated in Figure 3, and consists of the following elements: (i) data multiplexors MR and MYG (identified by and , resp.) for selecting between and , as well as for and ; (ii) registers REG_Hi and REG_Xi store, respectively, information regarding in (6) and in (11); (iii) a block InviPseudo for matrix inversion and left-pseudoinversion [911] presented in (7); (iv) P/P-D implements optimization criterion (5); (v) computes the nulling operation in (8); (vi) slicing operation for quantization (10) are implemented in ; (vii) performs cancellation (11); (viii) So is in charge of properly assign decoded symbol entries to vector ; and (ix) manages deflated matrix versions in (12) and vector index reshaping in (13).

The V-BLAST architecture provides a decoded symbol entry of at an iteration (exactly iterations are required), meaning that multiplexors MYG and MR, and registers REG_Hi and REG_Xi regulate data flow contained in , , , , and . For is evident that MR selects matrix , and MYG does the same for vector . Whenever , the vector is updated into following (11) through REG_Xi; similarly happens to the matrix deflated versions following (12) through REG_Hi. Moreover, the matrix deflated versions are handled as just zeroing the th column of after each iteration , yielding matrices where denotes a column vector filled with zeros, and indicate the remaining column entries, which are properly reordered according to entries contained in index vector . The block computes (5) for every row entry of . With the aid of a Batcher sorting network [12], a nulling vector is chosen from a set of vector candidates based on which is equivalent to . Each one of these results is labeled consecutively from 1 to . The index adopts then the value appended to the selected nulling vector. The block uses the selected nulling vector in order to perform (8), implementing all complex multiply-and-accumulate operations related to . The block (or slicer) transforms into a signal point belonging to a  -QAM constellation ( is coded into a -bit word). For , complex additions and multiplications inherent in complete cancellation in    with the sliced symbol , the multiplexed vector, and the selected column from . The So block registers and accordingly assigns symbol decoded entries into elements in vector based on index , in other words (the value of is the th element of vector ). The block deals with the process of generating deflated versions from , and keeps track of indexes and . Also, provides the pertinent value of at every iteration . As mentioned before, in order to perform the deflating operations in (12) the th column of matrix is removed and substituted by an -dimensional zero vector. For the index vector update in (13), the th element of the ()-dimensional array is assigned to index , that is, , then this element is removed afterwards from and it is re-sized into a ()-dimensional array .

The InviPseudo block is the key element for performing iterative generalized-inverses found in (7). The heavy and critical operation to be treated relies on , since . InviPseudo implements a strategy for computing based on a block-matrix Left-Pseudoinverse (hereinafter referred as LPI kernel) approach as proposed in [11]. For the case developed in this V-BLAST architecture, this LPI kernel is divided into the following entities: ,  ,  ,  ;  ,  ;  ,  ,  ;  ,  ;  ,  ;  ,  . All of these entities represent complex-valued matrices that after all V-BLAST iterations are accomplished, will be reassembled as . Depending on the V-BLAST iteration , data inside each entity is properly initialized for the correct computation of block-matrix inversion within . For example: (i) at : , , , ; (ii) at : , , , ; (iii) at : , , , ; (iv) at : , , , . For the cases presented in (i)–(iv), elements are taken accordingly from at its corresponding iteration V-BLAST iteration . Additionally, all arithmetic divisions presented in the LPI kernel and throughout iterations concerning (7) were implemented with CORDIC (Coordinate Rotate Digital Computer) processors [13]. For this purpose, the CORDIC processor (or CORDIC engine) is structured as where is associated with the polarity of micro-rotations in (15) for ; and are initial conditions in (15); is an approximated scaling factor; and this engine is customized to perform .

4.3. Control-Path

The CP component is a finite state machine that handles all required control signals for the DP to properly decode during a time period when channel statistics remain invariant, that is, a MIMO channel realization where inputs and do not change in time. Referring to Figures 1 and 3, CP employs the following signals: (i) signals get_Hi and get_Xi capture data and in registers REG_Hi and REG_Hi, respectively: (ii) signal sasiSo captures every decoded entry in at block So; (iii) re-sizing in (13) is regulated by signals get_sXYZ, get_sAB and get_sL12; while deflated versions of in (12) are generated through signals get_sREG_Hi01, get_sREG_Hi10 and get_sREG_Hi11. In addition to these, the roll of signal is fundamental for synchronizing V-BLAST iterations, because selects elements in MYG and MR, controls data flow in  , allows the generation of generalized-inverses at InviPseudo, and regulates the choice of nulling vectors in .

5. Results

The sub-optimal hard-output V-BLAST architecture was designed for operating with transmitted signal vectors generated from 16-QAM modulators immersed in AWGN Rayleigh fading channels with . Simulations for functional validation were programmed with MATLAB. A million of MIMO channel realizations were considered or until a thousand of error blocks were found, that is, a decoded block consisting of bits. Synthesis was performed with the Altera Quartus II IDE tool over Cyclone III FPGA devices.

5.1. BER Performance

The results for BER performance shown in Figure 2 were corroborated for the finite-precision analysis of the V-BLAST architecture. That is, the FPGA implementation showed the same performance as the one obtained in simulation for the fixed point case, confirming a negligible degradation as compared to theoretical fixed-point performance. Two different simulation scenarios were considered: (i) a MATLAB-based simulation model used for obtaining floating and fixed points (restricted to a 16-bit precision) ML and V-BLAST performances; (ii) a testbench designed for evaluating performance of the device under test, whose test vectors were generated with MATLAB.

5.2. Synthesis Results

The Cyclone III device form Altera FPGA family was selected as implementation target for the V-BLAST architecture. Different synthesis and fitting modes were performed within the Quartus II IDE tool. For instance, synthesis considered speed (sp), balanced (bd), and area (ar) optimization techniques, while fitting considered standard (std) and fast (fst) fitter efforts. Therefore, six different modes were evaluated for implementation purposes, namely: sp-std, bd-std, ar-std, sp-fst, bd-fst, and ar-fst. In all of these cases, hardware complexity resided on logic elements (LEs) and embedded multipliers (eMults). The whole V-BLAST architecture demanded respectively a 27% and a 72% usage of the total amount of LEs and eMults available in the FPGA device. In fact, (a) regarding LEs usage: 62.98% belonged to the InviPseudo block, 24.94% to , 3.67% to , 3.29% to , 2.39% for , and less than 0.5% for the remaining blocks (MR, MYG, REG_Hi, REG_Xi, , and So); (b) regarding eMults usage, 84.61% belonged to InviPseudo block, and 15.39% to . The best temporal performance of the V-BLAST architecture is offered by the bd-std implementation mode, exhibiting a critical path of 10.056 ns with data throughput of 265.182 Mbps, since 4 bits were decoded after 1.5 clock cycles. This temporal performance perfectly suits SM-MIMO requirements for Rayleigh fading channels as stipulated in 802.11 n, whose specifications are around the 100–200 Mbps [14]. The V-BLAST architecture worked at a maximum clock frequency of 99.443 MHz with an overall decoding latency of 12 clock cycles, equivalent to 120.672 ns. Furthermore, Table 1 shows comparative results against other related works. The floorplan result of the complete V-BLAST architecture is provided in Figure 4: the overall V-BLAST architecture (top-view, label A); MR, MYG, REG_Hi, and REG_Xi (left middle-view, label B1); , , , So, and (right middle-view, label B2); InviPseudo (left bottom-view, label C1), and (right bottom-view, label C2).

5.3. Comparison Analysis

Earlier attempts in providing architectural implementations on sub-optimal hard-output V-BLAST-based SM-MIMO solutions are cited in [7, 8]. Albeit their implementation structured on ASIC devices, which inhibits design flexibility and cost-effectiveness, the FPGA-based VLSI architectural approach developed in this work represents a modular, portable, and scalable implementation of the V-BLAST algorithm as depicted in Figure 3. Moduli constituting the V-BLAST architecture are configured under an RTL level, yielding a moderate capability support for high-dimensional lattices (i.e., -QAM with 16, 64, 256 values), and high-dimensional MIMO communication scenarios (i.e., ). Performance results reported in Table 1 exhibit competitive aspects against other state-of-the-art solutions: a low-dimensional lattice (i.e., QPSK) V-BLAST [7], and a high-dimensional lattice (i.e., 16-QAM) V-BLAST [8]. The heaviest hardware complexity resides on how iterative operations are performed. For instance, both the LPI kernel and operations handled in [7] yield the same complexity; however, the LPI kernel significantly avoids matrix unitary transformations, an issue that demands a more sophisticated and complex control-path design. Also, the use of more CORDIC engines in [7, 8] affect data throughput as well as affecting symbol decoding latencies. On the other hand, operations in [8] are alleviated through level-thresholding, another issue which incurs into BER performance degradation for sub-optimal hard-output SM-MIMO demodulation purposes.

6. Conclusions

In this work, a VLSI architecture for the sub-optimal hard-output SM-MIMO V-BLAST algorithm was proposed. The architecture was designed for operating with symbols drawn from 16-QAM modulators under AWGN Rayleigh fading channels with parameter . Simulation testing and hardware implementation on Altera Cyclone III FPGA devices validated the functionality of the V-BLAST architecture.

Acknowledgment

This work was supported by CONACYT (National Science and Technology Council, Reg. 332852/229015) under the supervision, revision, and sponsorship of ITESM Campus Guadalajara and ITESM Campus Estado de México universities.