Abstract

VLSI implementation of a configurable power-efficient MIMO detector is proposed to support 4×4 spatial multiplexing and modulation from QPSK to 64-QAM. A novel tree search algorithm is proposed to enable the detector to provide soft outputs and to be implemented in parallel and pipelined hardware architecture. The frame error rate (FER) of the detector approaches the quasi-optimal sphere decoder, with 0.5-dB degradation. Moreover, the proposed detector can operate at the optimal voltage under different configurations and detect/recover timing error at run time by a novel adaptive voltage scaling technique with double sampling circuitry. The proposed detector, using TSMC 0.18 μm single-poly six-metal CMOS process with a core area of 1.17×1.17 mm2, provides fixed throughput of 45 Mbps in 64-QAM configuration, 120 Mbps in 16-QAM configuration, and 60 Mbps in QPSK configuration. The normalized power efficiency of the design for 64-QAM and 16-QAM configurations is 1.56 Mbps/mW and 2.53 Mbps/mW, respectively. Compared with the conservative margin-based design, the proposed design achieves a 48.8% power saving.

1. Introduction

Multiple-input multiple-output (MIMO) techniques in combination with high constellation orders have been identified as a promising approach to high spectral efficiency systems. Prominent detection in spatial multiplexing is essential for system performance. Maximum likelihood (ML) detection is the optimal method in spatial multiplexing systems. Sphere detection (SD) methods [13] compute the ML solution by taking into consideration only the lattice points inside the sphere with a given radius. Because a soft-in-soft-out error-correction-code (ECC) decoder [4] can have better error-correction performance than a usual hard ECC decoder, a soft-output MIMO detector, which cooperates with ECC better, is needed for coded communication system. The soft-output MIMO detector algorithms, such as list sphere decoder (LSD) [5] and soft single tree search sphere decoder (STS-SD) [6], have quasi-optimal performance with ECC. However, LSDs have huge complexity when list is extended, and STS-SD has variable throughput when channel condition is changed. The high computational intensity and the variable throughput characteristic of the iterative methods prevent current practical implementations from conforming the requirements for actual chip area, latency, and power consumption.

Fixed-complexity sphere decoder (FSD) algorithm [7] is another practical solution to MIMO detection. The FSD is similar to SD but with different search criteria. FSD performs a fixed tree search by visiting all nodes in the top level and just visiting one node in other levels to simplify the tree search in SD. Therefore, the complexity of FSD can be lower than SD. The FSD algorithm presents a quasi-ML performance and fixed complexity in an uncoded system. However, FSD is not compatible with powerful soft-input ECC decoders and needs some modifications. The work in [8] provides a good reference. In this paper, we reduced the visiting nodes by another way of node visiting distribution and found the local minima for soft-output under the tradeoff between performance and complexity using a modified FSD scheme. Furthermore, a novel method to simplify the SE enumeration [9] is proposed with the FSD. Therefore, the proposed FSD does not need to sort all points of the constellation.

To achieve more power saving, a full-pipelined parallel architecture is proposed according to the modification of the algorithm. The parallel design and adaptive voltage scaling technique can provide a power-efficient ASIC solution.

In this paper, a robust and power-efficient solution for spatial-multiplexing MIMO detection is proposed. The proposed soft MIMO detector can provide LLR output to ECC decoder and provide a turbo-MIMO solution. The proposed MIMO detector has the following features.

(a) Provision of High Order Modulations
The proposed MIMO detector supports multiple modulation configurations, including QPSK, 16-QAM, and 64-QAM modulation.

(b) Soft-Output Performance
With a modified tree search algorithm, the proposed MIMO detector, which provides soft-valued outputs, is compatible with soft-in-soft-out ECC decoders to attain enhanced detection performance.

(c) Parallel and Pipelined Architecture
Iterative sphere decoding methods prevent decoders from efficient hardware implementation. The proposed soft-output fixed-complexity sphere decoder (SFSD) retains the advantage of the fixed-complexity sphere decoder (FSD) [7] and therefore can be implemented in parallel and full pipelined designs to increase hardware efficiency.

(d) Fixed Throughput
General soft-output sphere decoding solutions only provide variable throughput, which makes imperfect use of hardware resources due to the sequential nature. The SFSD provides fixed throughput and achieves better performance than usual sphere decoders when the throughput is fixed. The maximum throughput is 120 Mbps in the proposed decoder.

(e) Error-Recovered Adaptive Voltage Scaling
A novel adaptive voltage scaling method is applied to the detector to reduce power dissipation. With double sampling circuitry, a timing error will be detected and recovered at run time. Therefore, an optimal voltage can be achieved and also keep the processing from functionality violation. With these techniques, a 48.8% saving in power is achieved.

(f) Configurable, Complexity-Efficient, and Power-Efficient Hardware Implementation
The configurable ASIC provides fixed throughput of 45 Mbps in 64-QAM configuration, 120 Mbps in 16-QAM configuration, and 60 Mbps in QPSK configuration. The normalized power efficiency is 1.56 Mbps/mW and 2.53 Mbps/mW for 64-QAM and 16-QAM configurations, respectively. The complexity efficient is 1.19 Mbps/K-gate for 16-QAM configuration.

The remainder of the paper is organized as follows. Section 2 reviews the conventional sphere decoding algorithm. The proposed SFSD algorithm with related simulation results is introduced in Section 3. Section 4 presents the logic design. Section 5 reports the hardware implementation. Finally the paper is concluded in Section 6.

2. Conventional Sphere Decoding Algorithm

Let us consider a MIMO system with 𝑀 transmitting and 𝑁 receiving antennas (𝑁M). 𝑀 streams of 𝑄 bits of data, 𝑥𝑗,𝑏,𝑗=1,2,,𝑀, and 𝑏=1,2,,𝑄, are mapped to an 𝑀-dimensional transmitted symbol vector 𝐬=[𝑠1𝑠2𝑠𝑀]𝑇, using 2𝑄-QAM modulation. The received complex vector, 𝐲, is given by 𝐲=𝐇𝐬+𝐧,(1) where 𝐇 is an 𝑁×𝑀 channel matrix, which is assumed to be known in advance, and 𝐧 is the complex Gaussian noise vector.

The a posteriori log-likelihood ratio (LLR) of the bit 𝑥𝑗,𝑏, conditioned on the received symbol vector, 𝐲, provides a soft-in-soft-out detector information for decision and can be expressed as 𝐿𝑥𝑗,𝑏𝑥𝐲=lnPr𝑗,𝑏=+1𝐲𝑥Pr𝑗,𝑏=1𝐲.(2) By Bayes’ theorem, the max-log approximation, and proof in [5, 6], (2) can be rewritten as𝐿𝑥𝑗,𝑏𝐲min𝐬𝒮(1)𝑗,𝑏𝐲𝐻𝑠2min𝐬𝒮(1)𝑗,𝑏𝐲𝐻𝑠2,(3) where 𝒮(1)𝑗,𝑏 and 𝒮(1)𝑗,𝑏 are search spaces, {𝐬𝑥𝑗,𝑏=1} and {𝐬𝑥𝑗,𝑏=1}, respectively. The maximum-likelihood (ML) minima in (3) is expressed as𝐬ML=argmin𝐬𝒞𝑀𝐲𝐇𝐬2,(4) where 𝒞𝑀 is the set of constellation symbols in the 𝑀-dimensional complex space. Letting 𝑥ML𝑗,𝑏 be the binary complement of the 𝑏th bit in the 𝑗th data symbol of 𝐬ML, the other minima in (3) can be expressed as𝐬ML𝑗,𝑏=argmin𝐬𝒮(𝑥ML)𝑗,𝑏𝑗,𝑏𝐲𝐇𝐬2.(5)

By means of the QR decomposition of the channel matrix 𝐇=𝐐𝐑, (4) and (5) can be reformulated as𝐬ML=argmin𝐬𝒞𝑀̃𝐲𝐑𝐬2,𝐬(6)ML𝑗,𝑏=argmin𝐬𝒮(𝑥ML)𝑗,𝑏𝑗,𝑏̃𝐲𝐑𝐬2,(7) respectively, where ̃𝐲=𝐐𝐻̃𝐲=𝐑𝐬+𝐧,𝐐 is an 𝑁×𝑀 matrix with orthogonal unit norm columns, 𝐑 is an 𝑀×𝑀 upper-triangle matrix, and ()𝐻 denotes the Hermitian matrix operator. Note that the noise term ̃𝐧=𝐐𝐻𝐧 keeps the same statistical properties as 𝐐 is an unitary matrix [1].

In addition, sorted QR decomposition (SQRD) [10] is applied, computing the Euclidean distance (ED) ̃𝐲𝐑𝐬2 in (6) and (7) recursively as𝑑𝑖𝐬𝑖=𝑑𝑖+1𝐬𝑖+1+|||||̃𝑦𝑖𝑀𝑗=𝑖𝑅𝑖,𝑗𝑠𝑗|||||2,𝑖=𝑀,𝑀1,,1,(8) where 𝑑𝑀+1(𝐬𝑀+1)=0, 𝐬𝑖=[𝑠𝑖,𝑠𝑖+1,s𝑀]𝑇, and ̃𝑦𝑖 and 𝑅𝑖,𝑗 are, respectively, elements in ̃𝐲 and 𝐑. Lattice points with partial Euclidean distance (PED), 𝑑𝑖(𝐬𝑖), greater than the square of a given radius 𝑟, are invalid. Therefore, the candidate search is confined to lattice points within a sphere to reduce the detection complexity. Nevertheless, the sequential search property and the variable computational complexity prevent the conventional SD from efficient hardware design.

3. Proposed Soft-Output FSD (SFSD)

3.1. Algorithm Description

FSD algorithm [7] is another solution to MIMO detection, which is similar to SD but has two major differences.(i)A fixed tree search is performed that FSD visits all nodes in the top level and just visits one node in other levels to simplify the tree search in SD.(ii)The channel matrix ordering [11] is modified that the smallest column norm of channel matrix 𝐇 is ordered in the first level of the tree.

The FSD algorithm presents a quasi-ML performance and fixed complexity in an uncoded system. However, FSD cannot be compatible with powerful soft-input ECC decoders.

For soft-output MIMO detection, the quasi-ML solution in (6) and the local minima in (7) are essential to calculate LLR. Therefore, the proposed FSD is a modification of the FSD by finding the local minima in (7) for soft-output. Accordingly, a soft-output SD requires more branches than the FSD. Monte Carlo simulations for the number of required visiting branches for SD and FSD algorithms were performed. Tables 1 and 2 show the mean and variance of the number of branches for a visited node at each layer in a 4×4 MIMO system using 16-QAM and 64-QAM, respectively. The distribution helps FSD algorithm to fix the number of branches in tree traversal. Both tables show higher mean and variance of visited branches in SD ordering compared with FSD ordering in all layers except layer 4. This can reduce the searching nodes and save computation complexity. By denoting the number of branches at the 𝑖th layer as 𝑛𝑖, we can denote the distribution of branch number for 4 layers by 𝐧=[𝑛1,𝑛2,𝑛3,𝑛4]𝑇. A diagram can show the idea in Figure 1. We can see that means in Tables 1 and 2 are in the descending order from the third level to the first level. Then we can find that the property of 𝑛𝑖 is𝐸𝑛3𝑛𝐸2𝑛𝐸1.(9) SFSD may follow the property and can be fixed to 𝐧=[2,3,4,𝑃]𝑇, 𝐧=[3,3,3,𝑃]𝑇, where 𝑃 is equal to the number of the constellation points (2𝑄) or other distributions which follow (9).

Figure 2 shows FER performance of different SFSD distributions with LLR maximum 𝐿max=16 in 4 × 4 MIMO 16-QAM system. The simulation distributions 𝐧=[𝑛1,𝑛2,𝑛3,𝑛4]𝑇 on the figure are selected based on (9) and the mean/variance of the visiting branches in Tables 1 and 2. SD can be considered as the optimal performance here. We can see that the branches of node can affect the performance of SFSD, and more branches can have better performance directly. But too high distribution 𝐧 can cause the tree search complexity to increase. Table 3 presents the total visited nodes in different SFSD distributions with 16-QAM modulation. The distribution [1,1,1,16]𝑇 is used in the hard output FSD algorithm. The total visited nodes may represent the computation complexity of SFSD. We can find that the 𝐧 = [1,2,2,16]𝑇 SFSD needs the least branch expansion and has only 0.5 db performance degeneration compared to SD. The 𝐧 = [2,3,4,16]𝑇 SFSD has better performance which degrades about 0.1 dB but has the highest complexity. It is a tradeoff between performance and complexity. Therefore, the branch distribution 𝐧 = [1,2,2,2𝑄]𝑇 for tree search has low complexity and is proposed to full-pipeline parallel hardware architecture implementation in the proposed SFSD. The top layer, 𝑛4, is set to the number of all constellation points.

The usual sphere decoders adopt the SE enumeration [9], which sorts all of the constellation points by the Euclidean distances between the received signal 𝑦𝑖 and itself. The point with the smallest distance has the first priority and enumerates others in ascending order of distance. Because proposed SFSD [1,2,2,𝑃]𝑇 only needs to enumerate the two smallest points of constellation, the SE enumeration can be simplified here. Therefore, the proposed SFSD does not need to sort all points. The following is the presentation of simplified enumeration. Figure 3 is an example of the 16-QAM constellation. The red point is the point of received signal 𝑦𝑖. The complex valued constellation can be separated into real part and imaginary part. The nearest point can be found by comparing with the value of the dotted lines in real part and imaginary part, respectively. Then the nearest point which is inside the green cycle is confirmed. The second nearest point of real part and imaginary part can be easily found by comparing with the values of the gray lines which include the nearest point. The two points 2𝑎 and 2𝑏 inside the blue cycles are the second nearest points of real part and imagine part, respectively. Then PED computations between those points and 𝑦𝑖 are required. The distances between the nearest point and the received signal point in the real part and imaginary part are denoted as 𝑑𝑅1 and 𝑑𝐼1, respectively. The distances between the second nearest point and the received signal point in the real part and imaginary part are denoted as 𝑑𝑅2 and 𝑑𝐼2, respectively. Then, the second nearest point can be decided by min{(𝑑𝑅1+𝑑𝐼2),(𝑑𝑅2+𝑑𝐼1)}. Only some comparisons help the SFSD to achieve the SE enumeration. Note that these PED computations of real part and imaginary part are equivalent to the PED computations of two points in the complex valued constellation. Therefore, the simplified SE enumeration just costs two PED computations.

The sequential design for tree search has variable throughput due to the unpredictable number of visited nodes, which causes hardware inefficiency. The early termination (ET) [6] can solve the problem. It confines the frame run time to 𝑁frame𝐷avg and lets every frame have the same delay during 𝑁frame MIMO detections. The function confines the maximum number of visited nodes to a value in each tree search as𝐷max(𝑘)=𝑁frame𝐷avg𝑘1𝑖=1𝑁𝐷(𝑖)frame𝑘𝑀,𝑘=1,2,,𝑁frame,(10) where 𝐷(𝑖) denotes the number of visited nodes for the 𝑖th MIMO symbol vector.

3.2. Simulation Results

The simulation environment is Rayleigh flat fading channel with AWGN. The algorithm is evaluated in terms of the number of visited parent nodes in tree search and frame-error-rate (FER) performance. All simulations are performed in a 4×4 MIMO system with Gray mapping QAM constellation modulation. In the simulations, data are coded in a rate 𝑅=1/2 convolutional code with constraint length 7 and a [133o171o] polynomial generator. A soft-input Viterbi decoder (max-log BCJR algorithm [12]) is employed in the receiver. A frame consists of 𝑁frame=64 MIMO symbols. In other words, a frame consists of randomly interleaved 𝑁frame𝑀𝐐 bits after outer encoding.

Figure 4 illustrates the frame-error-rate (FER) performance of the optimal SD, STS-SD [6], and proposed SFSD. At 2%-FER, only a 0.5-dB loss is observed between the SFSD and the optimal SD. The performance degradation is even smaller between the SFSD and the STS-SD.

Recursive computation of the PED as (8) for tree search leads to considerable computational complexity and latency. Hence, the number of visited parent nodes in tree search is an indicator for searching complexity and latency. Figure 5 shows the comparison of the average number of visited parent nodes between the STS-SD algorithm [6] and the proposed SFSD. It can be seen that the proposed method can reduce 20% of the average number of visited parent nodes. Accordingly, the low complexity advantage of the SFSD over the STS-SD can be observed.

4. Logic Design

The proposed MIMO detector, shown in Figure 6, consists of a preprocessing and an SFSD block. The preprocessing block performs sorted QR decomposition (SQRD) and FSD ordering. The matrix 𝐑 is decomposed by SQRD and consists of real valued diagonal elements, 𝑅𝑟, and complex valued off-diagonal elements, 𝑅𝑐. The computation of ̃𝐲=𝐐𝐻𝐲 is also performed in the block. FSD ordering chooses the column with the second smallest column norm to process in each iteration of the SQRD algorithm instead of the smallest column norm. The smallest column norm of channel matrix is ordered in the top level of tree; the other columns of channel are ordered to be decreased from right to left. In other words, the second top level has the greatest column norm, and the leaf level has the second smallest column norm. The SFSD block performs the proposed SFSD algorithm and adopts proper parallel design for high throughput and full pipeline design for high clock rate. Figure 7 shows the detailed block diagram of the proposed SFSD. The SFSD block consists of several PED modules (PED4, PED3, PED2, and PED1), a list administration unit (LAU), and an LLR module. The PED4, PED3, PED2, and PED1 modules calculate PEDs for layer 4, 3, 2, and 1, respectively. The LAU module compares and sorts the Euclidean distances (EDs) which are accumulated by the PED modules. After 2𝐐 times (2𝐐 is the constellation size) of sequential update checking, the LAU module outputs the calculated local minima 𝜆 for each layer and the ML solution 𝐱ML (Figure 7). LLR computation module substrates the smallest distance value 𝜆ML from the local minimum distance values 𝜆𝑘 which corresponds to the local minima with binary complement of the 𝑘th bit of 𝐱ML and computes each LLR value. The order of LLR values is not the same as the order of the original input data because SQRD preprocessing reorders the matrix 𝐑. Hence, the order recovery is also implemented in the LLR computation module.

4.1. Adaptive Voltage Scaling with Double Sampling

Dynamic voltage scaling (DVS) [13] is a common technique to adjust system voltage and reduce power consumption. Some methods have been proposed for pessimistic designs that require large safety margins [14]. Instead of using guard range from the safety margin, Razor [15, 16] proposed an in situ timing error detection and correction mechanism, which can overcome variation from the PVT and significantly reduce the power consumption. In Razor-based DVS, each flip-flop in the critical path is augmented with a shadow latch, which is triggered by a delayed clock. By comparing the data sampled by the main flip-flop and the shadow latch, a timing error can be detected and the corrupt value will be restored by the correct value in the shadow latch.

An adaptive voltage scaling with double sampling scheme is applied to the proposed SFSD. As shown in Figure 8, the main flop is clocked by clk and the delayed flop is clocked by clk_del, which is delayed with respect to clock clk. The delay between the clock edges of clk and clk_del is designed such that the correct value is obtained at the next clock edge of clk_del, even in the presence of unpredictability in the wire delay. Therefore, the delayed flop is designed to operate in an error-free manner. The XOR gate is used to compare the data captured by the main flop and delayed flop. When the outputs of the two flops differ, the signal errq is active to indicate a timing error and turn the main flop to the error mode. Afterward, the correct value captured by the delayed flop will be sent by the main flop in the next cycle, causing a one-cycle penalty for error recovery. Figure 9 shows the timing diagram of the error detection and recovery.

At each pipeline stage, an error control circuit presented in Figure 10 is added for generating suitable control signals when an error is detected. Let the number of bit-lines in the pipeline be 𝑤 lines. The XOR outputs (errq signals) generated at all the 𝑤 bit lines at each pipeline stage of the pipeline are ORed together and fed as an input to the error control circuit. For the Error-OR-tree, the error signals from all pipeline flip-flops are OR together to generate a unified error signal. If the design has many double sampling flip-flop, the fan-in of the OR gate can be very large, requiring OR signal to be pipelined. For error stabilization, in some cases, it is possible for error signal to become metastable. So we add two flip-flops at the global error output to overcome this problem.

4.2. Modified Cell-Based Flow

The timing paths of each submodule are analyzed after top-down synthesis. The critical paths of the target design are in PED1 and PED2 modules. Therefore, the double sampling flip-flops are inserted into the critical paths and ORed all the error signals on each double sampling pipeline stage. Figure 11 shows the block diagram of the SFSD after double sampling pipeline insertion.

5. Hardware Implementation

The proposed DVS soft output MIMO detector, using TSMC 0.18𝜇m single-poly six-metal (1P6M) CMOS technology, is shown in Figure 12. Table 4 lists the ASIC specification. All the power and throughput information in Table 4 operate at 1.8 V and 120 MHz.

This design surpasses the previous researches [6, 17] for its high power efficiency. A comparison (with the proposed postlayout simulation results) is given in Table 5, where the power efficiency is normalized to mitigate the impact of different process factors and throughputs using=NormalizedpowereciencyPower×1.8𝑉DD2×0.18×1ProcessThroughput1.(11) While normalizing power efficiency using (11), the nominal 𝑉DD of the proposed design is 1.8 V, instead of the scaled voltage, to show the power-saving effect of the applied voltage scaling technique.

By scaling down the supply voltage, the power consumption of the design can be reduced, as shown in Figure 13. The lowest supply voltage for error-free operations at 100-MHz clock rate for 64-QAM and 16-QAM modulation is 1.44 V and 1.26 V, leading to 34.7% and 49.22% power reduction, respectively. When voltage scales to 1.26 V for 64-QAM modulation or 1.08 V for 16-QAM modulation, timing error occurs on the critical path, which is pipelined by double sampling circuitry. The duplicate circuitry (delayed flip-flop) is triggered by the error signal and recovers the critical path from the timing error. While further scaling down the supply voltage, some subcritical paths may suffer from timing error and cannot be recovered. When the supply voltage decreases to 1.08 V for 64-QAM modulation, the functionality of the design fails due to incorrect data transferred by subcritical paths. With the proposed adaptive voltage scaling method with double sampling, the supply voltage can scale down to 1.08 V and 1.26 V, and the power reduction approaches 48.8% and 61.25%, for 16-QAM and 64-QAM modulation, respectively. Figure 14 shows the maximum clock rate and the current of the design in 16-QAM modulation. The power consumption ratio of each module is shown in Figure 15.

By using the error monitoring circuit, timing error information can be reported and conveyed to the voltage controller at run time to control the voltage regulator to power up/down the system. If the supply voltage is inadequate to support required timing of critical paths, the power controller will send a decision bit to indirectly control the voltage source which is provided by a DC-DC converter. The power management of the platform is completed through these mechanisms. The double sampling scheme shows well-behaved function, provides a solution to timing error detection and recovery, and accomplishes a robust DVS platform.

6. Conclusion

A power-efficient SFSD for MIMO detection is presented in the paper. The configurable SFSD with a novel tree search algorithm achieves soft-output decoding with fixed throughput of 120 Mbps in 16-QAM modulation. The FER of the detector approaches the optimal sphere decoder, with 0.5-dB degradation. A novel adaptive voltage scaling method with double sampling circuitry provides error detection/recovery and a 48.8% power saving.

Acknowledgments

The authors would like to thank the National Chip Implementation Center (CIC) of the National Applied Research Laboratories, Taiwan, for EDA tool support, as well as fabrication and measurement of the proposed chip. They would also like to thank the anonymous reviewers and editor for their valuable suggestions, which helped them improve this paper. This research is supported in part by the National Science Council, Taiwan, under Grants NSC-99-2221-E-007-050, NSC-99-2220-E-007-024, NSC-98-2219-E-007-003, supported in part by NTHU/ITRI joint research center, and supported in part by MediaTek-NTHU Joint Lab.