- About this Journal
- Abstracting and Indexing
- Aims and Scope
- Article Processing Charges
- Articles in Press
- Author Guidelines
- Bibliographic Information
- Citations to this Journal
- Contact Information
- Editorial Board
- Editorial Workflow
- Free eTOC Alerts
- Publication Ethics
- Reviewers Acknowledgment
- Submit a Manuscript
- Subscription Information
- Table of Contents
Journal of Electrical and Computer Engineering
Volume 2012 (2012), Article ID 938490, 9 pages
A Power-Efficient Soft-Output Detector for Spatial-Multiplexing MIMO Communications
Department of Electrical Engineering, National Tsing Hua University, Hsinchu 30013, Taiwan
Received 15 June 2011; Revised 14 September 2011; Accepted 17 November 2011
Academic Editor: Zhiyuan Yan
Copyright © 2012 Hsiao-Chi Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
VLSI implementation of a configurable power-efficient MIMO detector is proposed to support spatial multiplexing and modulation from QPSK to 64-QAM. A novel tree search algorithm is proposed to enable the detector to provide soft outputs and to be implemented in parallel and pipelined hardware architecture. The frame error rate (FER) of the detector approaches the quasi-optimal sphere decoder, with 0.5-dB degradation. Moreover, the proposed detector can operate at the optimal voltage under different configurations and detect/recover timing error at run time by a novel adaptive voltage scaling technique with double sampling circuitry. The proposed detector, using TSMC 0.18 μm single-poly six-metal CMOS process with a core area of mm2, provides fixed throughput of 45 Mbps in 64-QAM configuration, 120 Mbps in 16-QAM configuration, and 60 Mbps in QPSK configuration. The normalized power efficiency of the design for 64-QAM and 16-QAM configurations is 1.56 Mbps/mW and 2.53 Mbps/mW, respectively. Compared with the conservative margin-based design, the proposed design achieves a 48.8% power saving.
Multiple-input multiple-output (MIMO) techniques in combination with high constellation orders have been identified as a promising approach to high spectral efficiency systems. Prominent detection in spatial multiplexing is essential for system performance. Maximum likelihood (ML) detection is the optimal method in spatial multiplexing systems. Sphere detection (SD) methods [1–3] compute the ML solution by taking into consideration only the lattice points inside the sphere with a given radius. Because a soft-in-soft-out error-correction-code (ECC) decoder  can have better error-correction performance than a usual hard ECC decoder, a soft-output MIMO detector, which cooperates with ECC better, is needed for coded communication system. The soft-output MIMO detector algorithms, such as list sphere decoder (LSD)  and soft single tree search sphere decoder (STS-SD) , have quasi-optimal performance with ECC. However, LSDs have huge complexity when list is extended, and STS-SD has variable throughput when channel condition is changed. The high computational intensity and the variable throughput characteristic of the iterative methods prevent current practical implementations from conforming the requirements for actual chip area, latency, and power consumption.
Fixed-complexity sphere decoder (FSD) algorithm  is another practical solution to MIMO detection. The FSD is similar to SD but with different search criteria. FSD performs a fixed tree search by visiting all nodes in the top level and just visiting one node in other levels to simplify the tree search in SD. Therefore, the complexity of FSD can be lower than SD. The FSD algorithm presents a quasi-ML performance and fixed complexity in an uncoded system. However, FSD is not compatible with powerful soft-input ECC decoders and needs some modifications. The work in  provides a good reference. In this paper, we reduced the visiting nodes by another way of node visiting distribution and found the local minima for soft-output under the tradeoff between performance and complexity using a modified FSD scheme. Furthermore, a novel method to simplify the SE enumeration  is proposed with the FSD. Therefore, the proposed FSD does not need to sort all points of the constellation.
To achieve more power saving, a full-pipelined parallel architecture is proposed according to the modification of the algorithm. The parallel design and adaptive voltage scaling technique can provide a power-efficient ASIC solution.
In this paper, a robust and power-efficient solution for spatial-multiplexing MIMO detection is proposed. The proposed soft MIMO detector can provide LLR output to ECC decoder and provide a turbo-MIMO solution. The proposed MIMO detector has the following features.
(a) Provision of High Order Modulations
The proposed MIMO detector supports multiple modulation configurations, including QPSK, 16-QAM, and 64-QAM modulation.
(b) Soft-Output Performance
With a modified tree search algorithm, the proposed MIMO detector, which provides soft-valued outputs, is compatible with soft-in-soft-out ECC decoders to attain enhanced detection performance.
(c) Parallel and Pipelined Architecture
Iterative sphere decoding methods prevent decoders from efficient hardware implementation. The proposed soft-output fixed-complexity sphere decoder (SFSD) retains the advantage of the fixed-complexity sphere decoder (FSD)  and therefore can be implemented in parallel and full pipelined designs to increase hardware efficiency.
(d) Fixed Throughput
General soft-output sphere decoding solutions only provide variable throughput, which makes imperfect use of hardware resources due to the sequential nature. The SFSD provides fixed throughput and achieves better performance than usual sphere decoders when the throughput is fixed. The maximum throughput is 120 Mbps in the proposed decoder.
(e) Error-Recovered Adaptive Voltage Scaling
A novel adaptive voltage scaling method is applied to the detector to reduce power dissipation. With double sampling circuitry, a timing error will be detected and recovered at run time. Therefore, an optimal voltage can be achieved and also keep the processing from functionality violation. With these techniques, a 48.8% saving in power is achieved.
(f) Configurable, Complexity-Efficient, and Power-Efficient Hardware Implementation
The configurable ASIC provides fixed throughput of 45 Mbps in 64-QAM configuration, 120 Mbps in 16-QAM configuration, and 60 Mbps in QPSK configuration. The normalized power efficiency is 1.56 Mbps/mW and 2.53 Mbps/mW for 64-QAM and 16-QAM configurations, respectively. The complexity efficient is 1.19 Mbps/K-gate for 16-QAM configuration.
The remainder of the paper is organized as follows. Section 2 reviews the conventional sphere decoding algorithm. The proposed SFSD algorithm with related simulation results is introduced in Section 3. Section 4 presents the logic design. Section 5 reports the hardware implementation. Finally the paper is concluded in Section 6.
2. Conventional Sphere Decoding Algorithm
Let us consider a MIMO system with transmitting and receiving antennas (). streams of bits of data, , and , are mapped to an -dimensional transmitted symbol vector , using -QAM modulation. The received complex vector, , is given by where is an channel matrix, which is assumed to be known in advance, and is the complex Gaussian noise vector.
The a posteriori log-likelihood ratio (LLR) of the bit , conditioned on the received symbol vector, , provides a soft-in-soft-out detector information for decision and can be expressed as By Bayes’ theorem, the max-log approximation, and proof in [5, 6], (2) can be rewritten as where and are search spaces, and , respectively. The maximum-likelihood (ML) minima in (3) is expressed as where is the set of constellation symbols in the -dimensional complex space. Letting be the binary complement of the th bit in the th data symbol of , the other minima in (3) can be expressed as
By means of the QR decomposition of the channel matrix , (4) and (5) can be reformulated as respectively, where is an matrix with orthogonal unit norm columns, is an upper-triangle matrix, and denotes the Hermitian matrix operator. Note that the noise term keeps the same statistical properties as is an unitary matrix .
In addition, sorted QR decomposition (SQRD)  is applied, computing the Euclidean distance (ED) in (6) and (7) recursively as where , , and and are, respectively, elements in and . Lattice points with partial Euclidean distance (PED), , greater than the square of a given radius , are invalid. Therefore, the candidate search is confined to lattice points within a sphere to reduce the detection complexity. Nevertheless, the sequential search property and the variable computational complexity prevent the conventional SD from efficient hardware design.
3. Proposed Soft-Output FSD (SFSD)
3.1. Algorithm Description
FSD algorithm  is another solution to MIMO detection, which is similar to SD but has two major differences.(i)A fixed tree search is performed that FSD visits all nodes in the top level and just visits one node in other levels to simplify the tree search in SD.(ii)The channel matrix ordering  is modified that the smallest column norm of channel matrix is ordered in the first level of the tree.
The FSD algorithm presents a quasi-ML performance and fixed complexity in an uncoded system. However, FSD cannot be compatible with powerful soft-input ECC decoders.
For soft-output MIMO detection, the quasi-ML solution in (6) and the local minima in (7) are essential to calculate LLR. Therefore, the proposed FSD is a modification of the FSD by finding the local minima in (7) for soft-output. Accordingly, a soft-output SD requires more branches than the FSD. Monte Carlo simulations for the number of required visiting branches for SD and FSD algorithms were performed. Tables 1 and 2 show the mean and variance of the number of branches for a visited node at each layer in a MIMO system using 16-QAM and 64-QAM, respectively. The distribution helps FSD algorithm to fix the number of branches in tree traversal. Both tables show higher mean and variance of visited branches in SD ordering compared with FSD ordering in all layers except layer 4. This can reduce the searching nodes and save computation complexity. By denoting the number of branches at the th layer as , we can denote the distribution of branch number for 4 layers by . A diagram can show the idea in Figure 1. We can see that means in Tables 1 and 2 are in the descending order from the third level to the first level. Then we can find that the property of is SFSD may follow the property and can be fixed to , , where is equal to the number of the constellation points () or other distributions which follow (9).
Figure 2 shows FER performance of different SFSD distributions with LLR maximum in 4 × 4 MIMO 16-QAM system. The simulation distributions on the figure are selected based on (9) and the mean/variance of the visiting branches in Tables 1 and 2. SD can be considered as the optimal performance here. We can see that the branches of node can affect the performance of SFSD, and more branches can have better performance directly. But too high distribution can cause the tree search complexity to increase. Table 3 presents the total visited nodes in different SFSD distributions with 16-QAM modulation. The distribution is used in the hard output FSD algorithm. The total visited nodes may represent the computation complexity of SFSD. We can find that the = SFSD needs the least branch expansion and has only 0.5 db performance degeneration compared to SD. The = SFSD has better performance which degrades about 0.1 dB but has the highest complexity. It is a tradeoff between performance and complexity. Therefore, the branch distribution = for tree search has low complexity and is proposed to full-pipeline parallel hardware architecture implementation in the proposed SFSD. The top layer, , is set to the number of all constellation points.
The usual sphere decoders adopt the SE enumeration , which sorts all of the constellation points by the Euclidean distances between the received signal and itself. The point with the smallest distance has the first priority and enumerates others in ascending order of distance. Because proposed SFSD only needs to enumerate the two smallest points of constellation, the SE enumeration can be simplified here. Therefore, the proposed SFSD does not need to sort all points. The following is the presentation of simplified enumeration. Figure 3 is an example of the 16-QAM constellation. The red point is the point of received signal . The complex valued constellation can be separated into real part and imaginary part. The nearest point can be found by comparing with the value of the dotted lines in real part and imaginary part, respectively. Then the nearest point which is inside the green cycle is confirmed. The second nearest point of real part and imaginary part can be easily found by comparing with the values of the gray lines which include the nearest point. The two points and inside the blue cycles are the second nearest points of real part and imagine part, respectively. Then PED computations between those points and are required. The distances between the nearest point and the received signal point in the real part and imaginary part are denoted as and , respectively. The distances between the second nearest point and the received signal point in the real part and imaginary part are denoted as and , respectively. Then, the second nearest point can be decided by . Only some comparisons help the SFSD to achieve the SE enumeration. Note that these PED computations of real part and imaginary part are equivalent to the PED computations of two points in the complex valued constellation. Therefore, the simplified SE enumeration just costs two PED computations.
The sequential design for tree search has variable throughput due to the unpredictable number of visited nodes, which causes hardware inefficiency. The early termination (ET)  can solve the problem. It confines the frame run time to and lets every frame have the same delay during MIMO detections. The function confines the maximum number of visited nodes to a value in each tree search as where denotes the number of visited nodes for the th MIMO symbol vector.
3.2. Simulation Results
The simulation environment is Rayleigh flat fading channel with AWGN. The algorithm is evaluated in terms of the number of visited parent nodes in tree search and frame-error-rate (FER) performance. All simulations are performed in a MIMO system with Gray mapping QAM constellation modulation. In the simulations, data are coded in a rate convolutional code with constraint length 7 and a polynomial generator. A soft-input Viterbi decoder (max-log BCJR algorithm ) is employed in the receiver. A frame consists of MIMO symbols. In other words, a frame consists of randomly interleaved bits after outer encoding.
Figure 4 illustrates the frame-error-rate (FER) performance of the optimal SD, STS-SD , and proposed SFSD. At 2%-FER, only a 0.5-dB loss is observed between the SFSD and the optimal SD. The performance degradation is even smaller between the SFSD and the STS-SD.
Recursive computation of the PED as (8) for tree search leads to considerable computational complexity and latency. Hence, the number of visited parent nodes in tree search is an indicator for searching complexity and latency. Figure 5 shows the comparison of the average number of visited parent nodes between the STS-SD algorithm  and the proposed SFSD. It can be seen that the proposed method can reduce 20% of the average number of visited parent nodes. Accordingly, the low complexity advantage of the SFSD over the STS-SD can be observed.
4. Logic Design
The proposed MIMO detector, shown in Figure 6, consists of a preprocessing and an SFSD block. The preprocessing block performs sorted QR decomposition (SQRD) and FSD ordering. The matrix is decomposed by SQRD and consists of real valued diagonal elements, , and complex valued off-diagonal elements, . The computation of is also performed in the block. FSD ordering chooses the column with the second smallest column norm to process in each iteration of the SQRD algorithm instead of the smallest column norm. The smallest column norm of channel matrix is ordered in the top level of tree; the other columns of channel are ordered to be decreased from right to left. In other words, the second top level has the greatest column norm, and the leaf level has the second smallest column norm. The SFSD block performs the proposed SFSD algorithm and adopts proper parallel design for high throughput and full pipeline design for high clock rate. Figure 7 shows the detailed block diagram of the proposed SFSD. The SFSD block consists of several PED modules (PED4, PED3, PED2, and PED1), a list administration unit (LAU), and an LLR module. The PED4, PED3, PED2, and PED1 modules calculate PEDs for layer 4, 3, 2, and 1, respectively. The LAU module compares and sorts the Euclidean distances (EDs) which are accumulated by the PED modules. After times ( is the constellation size) of sequential update checking, the LAU module outputs the calculated local minima for each layer and the ML solution (Figure 7). LLR computation module substrates the smallest distance value from the local minimum distance values which corresponds to the local minima with binary complement of the th bit of and computes each LLR value. The order of LLR values is not the same as the order of the original input data because SQRD preprocessing reorders the matrix . Hence, the order recovery is also implemented in the LLR computation module.
4.1. Adaptive Voltage Scaling with Double Sampling
Dynamic voltage scaling (DVS)  is a common technique to adjust system voltage and reduce power consumption. Some methods have been proposed for pessimistic designs that require large safety margins . Instead of using guard range from the safety margin, Razor [15, 16] proposed an in situ timing error detection and correction mechanism, which can overcome variation from the PVT and significantly reduce the power consumption. In Razor-based DVS, each flip-flop in the critical path is augmented with a shadow latch, which is triggered by a delayed clock. By comparing the data sampled by the main flip-flop and the shadow latch, a timing error can be detected and the corrupt value will be restored by the correct value in the shadow latch.
An adaptive voltage scaling with double sampling scheme is applied to the proposed SFSD. As shown in Figure 8, the main flop is clocked by clk and the delayed flop is clocked by clk_del, which is delayed with respect to clock clk. The delay between the clock edges of clk and clk_del is designed such that the correct value is obtained at the next clock edge of clk_del, even in the presence of unpredictability in the wire delay. Therefore, the delayed flop is designed to operate in an error-free manner. The XOR gate is used to compare the data captured by the main flop and delayed flop. When the outputs of the two flops differ, the signal errq is active to indicate a timing error and turn the main flop to the error mode. Afterward, the correct value captured by the delayed flop will be sent by the main flop in the next cycle, causing a one-cycle penalty for error recovery. Figure 9 shows the timing diagram of the error detection and recovery.
At each pipeline stage, an error control circuit presented in Figure 10 is added for generating suitable control signals when an error is detected. Let the number of bit-lines in the pipeline be lines. The XOR outputs (errq signals) generated at all the bit lines at each pipeline stage of the pipeline are ORed together and fed as an input to the error control circuit. For the Error-OR-tree, the error signals from all pipeline flip-flops are OR together to generate a unified error signal. If the design has many double sampling flip-flop, the fan-in of the OR gate can be very large, requiring OR signal to be pipelined. For error stabilization, in some cases, it is possible for error signal to become metastable. So we add two flip-flops at the global error output to overcome this problem.
4.2. Modified Cell-Based Flow
The timing paths of each submodule are analyzed after top-down synthesis. The critical paths of the target design are in PED1 and PED2 modules. Therefore, the double sampling flip-flops are inserted into the critical paths and ORed all the error signals on each double sampling pipeline stage. Figure 11 shows the block diagram of the SFSD after double sampling pipeline insertion.
5. Hardware Implementation
The proposed DVS soft output MIMO detector, using TSMC m single-poly six-metal (1P6M) CMOS technology, is shown in Figure 12. Table 4 lists the ASIC specification. All the power and throughput information in Table 4 operate at 1.8 V and 120 MHz.
This design surpasses the previous researches [6, 17] for its high power efficiency. A comparison (with the proposed postlayout simulation results) is given in Table 5, where the power efficiency is normalized to mitigate the impact of different process factors and throughputs using While normalizing power efficiency using (11), the nominal of the proposed design is 1.8 V, instead of the scaled voltage, to show the power-saving effect of the applied voltage scaling technique.
By scaling down the supply voltage, the power consumption of the design can be reduced, as shown in Figure 13. The lowest supply voltage for error-free operations at 100-MHz clock rate for 64-QAM and 16-QAM modulation is 1.44 V and 1.26 V, leading to 34.7% and 49.22% power reduction, respectively. When voltage scales to 1.26 V for 64-QAM modulation or 1.08 V for 16-QAM modulation, timing error occurs on the critical path, which is pipelined by double sampling circuitry. The duplicate circuitry (delayed flip-flop) is triggered by the error signal and recovers the critical path from the timing error. While further scaling down the supply voltage, some subcritical paths may suffer from timing error and cannot be recovered. When the supply voltage decreases to 1.08 V for 64-QAM modulation, the functionality of the design fails due to incorrect data transferred by subcritical paths. With the proposed adaptive voltage scaling method with double sampling, the supply voltage can scale down to 1.08 V and 1.26 V, and the power reduction approaches 48.8% and 61.25%, for 16-QAM and 64-QAM modulation, respectively. Figure 14 shows the maximum clock rate and the current of the design in 16-QAM modulation. The power consumption ratio of each module is shown in Figure 15.
By using the error monitoring circuit, timing error information can be reported and conveyed to the voltage controller at run time to control the voltage regulator to power up/down the system. If the supply voltage is inadequate to support required timing of critical paths, the power controller will send a decision bit to indirectly control the voltage source which is provided by a DC-DC converter. The power management of the platform is completed through these mechanisms. The double sampling scheme shows well-behaved function, provides a solution to timing error detection and recovery, and accomplishes a robust DVS platform.
A power-efficient SFSD for MIMO detection is presented in the paper. The configurable SFSD with a novel tree search algorithm achieves soft-output decoding with fixed throughput of 120 Mbps in 16-QAM modulation. The FER of the detector approaches the optimal sphere decoder, with 0.5-dB degradation. A novel adaptive voltage scaling method with double sampling circuitry provides error detection/recovery and a 48.8% power saving.
The authors would like to thank the National Chip Implementation Center (CIC) of the National Applied Research Laboratories, Taiwan, for EDA tool support, as well as fabrication and measurement of the proposed chip. They would also like to thank the anonymous reviewers and editor for their valuable suggestions, which helped them improve this paper. This research is supported in part by the National Science Council, Taiwan, under Grants NSC-99-2221-E-007-050, NSC-99-2220-E-007-024, NSC-98-2219-E-007-003, supported in part by NTHU/ITRI joint research center, and supported in part by MediaTek-NTHU Joint Lab.
- M. O. Damen, H. E. Gamal, and G. Caire, “On maximum-likelihood detection and the search for the closest lattice point,” IEEE Transactions on Information Theory, vol. 49, no. 10, pp. 2389–2402, 2003.
- U. Fincke and M. Pohst, “Improved methods for calculating vectors of short length in a lattice, including a complexity analysis,” Mathematics of Computation, vol. 44, no. 5, pp. 463–471, 1985.
- A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bölcskei, “VLSI Implementation of MIMO detection using the sphere decoding algorithm,” IEEE Journal of Solid-State Circuits, vol. 40, no. 7, pp. 1566–1577, 2005.
- J. Hagenauer and P. Hoeher, “Viterbi algorithm with soft-decision outputs and its applications,” in Proceedings of the IEEE Global Telecommunications Conference & Exhibition (GLOBECOM '89), pp. 1680–1686, November 1989.
- B. M. Hochwald and S. Ten Brink, “Achieving near-capacity on a multiple-antenna channel,” IEEE Transactions on Communications, vol. 51, no. 3, pp. 389–399, 2003.
- C. Studer, A. Burg, and H. Bölcskei, “Soft-output sphere decoding: algorithms and VLSI implementation,” IEEE Journal on Selected Areas in Communications, vol. 26, no. 2, pp. 290–300, 2008.
- L. G. Barbero and J. S. Thompson, “Fixing the complexity of the sphere decoder for MIMO detection,” IEEE Transactions on Wireless Communications, vol. 7, no. 6, Article ID 4543065, pp. 2131–2142, 2008.
- L. G. Barbero, T. Ratnarajah, and C. Cowan, “A low-complexity soft-MIMO detector based on the fixed-complexity sphere decoder,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '08), pp. 2669–2672, April 2008.
- C. P. Schnorr and M. Euchner, “Lattice basis reduction: improved practical algorithms and solving subset sum problems,” Mathematical Programming, Series B, vol. 66, no. 2, pp. 181–199, 1994.
- D. Wübben, R. Böhnke, J. Rinas, V. Kühn, and K. D. Kammeyer, “Efficient algorithm for decoding layered space-time codes,” Electronics Letters, vol. 37, no. 22, pp. 1348–1350, 2001.
- C. Hess, M. Wenk, A. Burg et al., “Reduced-complexity mimo detector with close-to ml error rate performance,” in Proceedings of the 17th Great Lakes Symposium on VLSI (GLSVLSI '07), pp. 200–203, March 2007.
- L. R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Transactions on Information Theory, vol. 20, no. 2, pp. 284–287, 1974.
- M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, “Scaling, power, and the future of CMOS,” in Proceedings of the IEEE International Electron Devices Meeting (IEDM '05), pp. 9–15, December 2005.
- T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, “Dynamic voltage scaled microprocessor system,” IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1571–1580, 2000.
- D. Ernst, N. S. Kim, S. Das et al., “Razor: a low-power pipeline based on circuit-level timing speculation,” in Proceedings of the 36th Annual International Symposium on Microarchitecture, pp. 7–18, 2003.
- S. Das, D. Roberts, S. Lee et al., “A self-tuning DVS processor using delay-error detection and correction,” IEEE Journal of Solid-State Circuits, vol. 41, no. 4, pp. 792–804, 2006.
- Z. Guo and P. Nilsson, “Algorithm and implementation of the K-best Sphere decoding for MIMO detection,” IEEE Journal on Selected Areas in Communications, vol. 24, no. 3, pp. 491–503, 2006.
- C. H. Liao, T. P. Wang, and T. D. Chiueh, “A 74.8 mW soft-output detector IC for 8 × 8 spatial-multiplexing MIMO communications,” IEEE Journal of Solid-State Circuits, vol. 45, no. 2, Article ID 5405138, pp. 411–421, 2010.